Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 214
|
![]() |
Author |
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 845 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Seems to be better here too....
Only OPN1 work units available, I think? The last hours I don't get any OPNG, ARP1 of MCM1 units at all. But downloads (en uploads) of the OPN1 is working OK here, no backoffs or retries. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1951 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Seems to be better here too.... Just ran some updates before going to sleep and both of my GPU hosts got each about a dozen OPNG WUs, that will keep them busy for about an hour...Only OPN1 work units available, I think? The last hours I don't get any OPNG, ARP1 of MCM1 units at all. But downloads (en uploads) of the OPN1 is working OK here, no backoffs or retries. After the web site and forum went out again earlier this evening for some time, I wasn't very hopeful, but maybe, just maybe, we can get back to normal procedures. Unfortunately though, still cirickets from WCG/Krembil... ![]() Ralf ![]() |
||
|
poppinfresh99
Cruncher Joined: Feb 29, 2020 Post Count: 49 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
Since you only need to download the large MCM1 once (as long as you have a steady supply of MCM1 work units), the download issues continue even when not downloading this file. Besides, the downloading of the OPN files should be entirely independent of the large MCM1 file. I would welcome any alternative explanation. Cheers One of us (or both) has misunderstandings because your points don't make any sense to me. The download SERVER is overwhelmed. Some fraction of users like me, due to the inconsistent work from Krembil and due to us not storing much work, have downloaded mcm1.dataset-sarc1.txt many times since Krembil started. Since this file is *MUCH* larger than other files, the download server needs to do MUCH more work. Certainly whether or not Sgt. Joe is downloading a file doesn't much affect the download server. Why should OPN not affect MCM? The download server (servers?) is at Krembil regardless. It *seems* to me that they do affect each other. I only run MCM and, when downloads are slow, I often see people saying that some OPN tasks just became available. When a bunch of small files are waiting to download, there is a project backoff done by the BOINC client to protect the download server. Just because a BOINC client is sitting there waiting for a bunch of small files to download doesn't mean that the WCG download server is being burdened (though it might be or perhaps the server the generates the small files is being burdened?). When the connection to the server is finally made, the small files download in an instant. I agree with whoever said that we should stop focusing on symptoms. The way I see it, the inability to get a connection to the download server is the symptom. A hypothesis is that the partial cause is the download server being busy from repeatedly serving large files due to the inconsistent work. |
||
|
Just1vet
Cruncher Joined: Nov 9, 2005 Post Count: 25 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The whole weekend without a hiccup here. Did increase number of days storage to 4. That seemed to help. Another few days like that, and I'll restart the farm.
|
||
|
mwroggenbuck
Advanced Cruncher USA Joined: Nov 1, 2006 Post Count: 77 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The website is definitely faster. I also have had no download problems. Something must have changed. It would be nice to know if the WCG staff did do something significant. Maybe, just maybe, things are getting better...
![]() |
||
|
Kirel2
Advanced Cruncher United States Joined: Sep 24, 2014 Post Count: 99 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yeah, website and down/uploads have been very snappy for the last 12 hours or so. Fingers crossed.
----------------------------------------![]() |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1951 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I agree with whoever said that we should stop focusing on symptoms. The way I see it, the inability to get a connection to the download server is the symptom. A hypothesis is that the partial cause is the download server being busy from repeatedly serving large files due to the inconsistent work. You agree (with me in this case), yet you do exactly that, make assumption based on symptoms. No, the connection issue is not just a symptom, as there is a clear error message that is being returned. Even if an external CDN isn't working well with WCG, as pretty much all the contents is "dynamic" (constantly changing, at least for all the WUs, not two requests result in the same file being send), if that large MCM1 text file would be the same over a large number of requests, this would be something that the read cache of the underlying file system on the server levels should take care of. And if the FS doesn't do it, a load balancing proxy like HAProxy, which is apparently being used on the software end, should do some caching for cases like this. Even when download issues are very bad, those large files, once they get a connection, download reliably. At least I have not seen any issue once they started to load. Yes, at times it seemed rather slow, but if there were WUs available, I have not seen that any of my hosts would run out of work because of that. And a lot of people that complain about those things the most are ones that seem to run with modified settings, which is something that makes testing on the server/project side so much more difficult. I don't know why MCM1 was stopped over the weekend, if just the hopper ran empty, the project is taking a break or if the techs are testing specific issues on their end. THAT is the root of all problems. In IMHO, i t would be much more useful if WCG/Krembil would be MUCH more communicative, telling us what is going on, so that there would be less speculation based on perceived symptoms. Like it used to be in "the good old days". We don't have really insight on the monitoring data at Krembil's end, but only that is what can help to narrow down the issues. But then only when the techs would get qualified responses from the users at the far end, based on information provided at their end what they might have tweaked/adjusted. But instead, the silence out of Toronto is deafening... ![]() Ralf ![]() ![]() |
||
|
Paul Schlaffer
Senior Cruncher USA Joined: Jun 12, 2005 Post Count: 244 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
... I don't know why MCM1 was stopped over the weekend, if just the hopper ran empty, the project is taking a break or if the techs are testing specific issues on their end. THAT is the root of all problems. In IMHO, i t would be much more useful if WCG/Krembil would be MUCH more communicative, telling us what is going on, so that there would be less speculation based on perceived symptoms. Like it used to be in "the good old days". We don't have really insight on the monitoring data at Krembil's end, but only that is what can help to narrow down the issues. But then only when the techs would get qualified responses from the users at the far end, based on information provided at their end what they might have tweaked/adjusted. But instead, the silence out of Toronto is deafening... ![]() Ralf ![]() Agreed. That was very well stated. We don't know whether the change we see on our end is from the change in work-units (now mostly OpenPandemics CPU), perhaps participation drop off, or something else. Some updates from the lab on what they are doing (or not) regarding the issue would be much appreciated. ![]() “Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792) [Edit 1 times, last edit by Paul Schlaffer at Oct 10, 2022 10:22:07 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7675 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The download SERVER is overwhelmed. This is assuming there is only one download server. Since there has been mention of a load balancer, this implies the existence of 2 or more servers for downloads. It is not unreasonable to suspect the partitioning of the load among multiple servers to be segmented by project, especially for the projects which require the greatest number of downloads. If they are not partitioned by project, the load would fall on the load balancer to properly apportion the available work units to meet the set proportion of work units for each project. However, not being privy to the topology of the setup for disbursing work units to the volunteers, this is all mere speculation on my part. I do recall an old post by Uplinger which did detail the percentage of resource allocation for each project which was set into the system, which they would occasionally tweak at the request of the researchers. They would try to provide to the researchers enough results to satisfy their needs without overwhelming the resources they had to store and process the results. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Link64
Advanced Cruncher Joined: Feb 19, 2021 Post Count: 129 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
No, we should not need to do this. And this never has been a problem before (the move)... Perhaps it wasn't an issue for IBM, but there are still lots of people with slow connections and/or with not unlimited traffic. Not me, I was just doing it "for fun". It's just pretty stupid to let people downlod same file again and again when BOINC offers the possibility to keep it in the project folder for future use. ![]() |
||
|
|
![]() |