Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3319
|
![]() |
Author |
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Results returned have dropped from 22K to 17K per day. and those would have been done by fast machines so more priority units for us ordinary crunchers.
Mike. |
||
|
AnandBhat
Cruncher Joined: Apr 2, 2020 Post Count: 10 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The big hitters may have run into this issue when ramping up their ARP output -- https://github.com/BOINC/boinc/issues/4572
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm skeptical of the described problem in 4572. I currently run 64 concurrent ARP1 work units and have done so for the past 28 days (broke away from ARP1 to run MCM for a challenge). Have also run 128 concurrent (cut back due to the length of time to complete a WU). I have never encountered the issue described after contributing over 100 years of computer time to ARP1. I have seen the uploads fail and accumulate like during system maintenance windows or when WCG was encountering filesystem errors on the upload storage device (resulting is a lot of HTTP errors). During those times, I did see the same messages as described in the incident but was able to retry the uploads and get them to clear. Yes, if enough accumulated in upload pending status the downloads would cease but at no time did the BOINCMGR disconnect or require a reboot or client restart to clear. In the past there have been very rare instances where, due to circumstances, the upload process got interrupted and would not restart without intervention by the WCG staff (like to delete the upload file from the upload filesystem so that the client and remote end were back in sync) and the upload would finish as normal. I have only seen this documented 4 or 5 times in 14 years was usually due to a power outage or similar immediate disconnect that took down not only the client but also the OS. I have also encountered times where I have lost internet connectivity and the uploads accumulated on the client until a connection was reestablished (maybe 24 hours or more later). Once the connection was established a flood of large file uploads would commence (I have mine set to 10 uploads concurrent) and would complete without a problem. Yes, it took as long as an hour sometimes but they did complete without intervention.
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Kevin has already said why the big hitters output fluctuates. It is due to them concentrating on the work they were bought to perform and only contribute to WCG when they have spare capacity.
Problem 4572 seems to occur when the uploads are batched. The answer to that problem would seem to me to be not to batch them but to upload each unit as it completes so spreading the load. My broadband is on 24/7 so I have no reason to batch. I don't look very often but if I see ARP units checkpointing close to each other I would suspend the running second for a few minutes. This avoids any possible overloading at the checkpointing or the uploading. Mike |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Kevin's last report said that there were 130 units currently stuck, having errored out. 96 of these would seem to be what remains of generations 079 - 095 plus most of 098 & 099.
I will look a little deeper into this for my next report at the weekend. Mike |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7697 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have seen a similar problem on 24 and 32 thread systems, but not very often. Since I don't run a lot of ARP units, it has occurred with both MCM and OPN units. I have had luck suspending network activity , waiting about 15 seconds and then resuming network activity. Once the logjam breaks, the rest of the uploads proceed as normal. I have in the past run up to 120 threads on a single internet connections through some rube goldberg concoctions of switches, routers and range extenders. Most of the time it works without any problems, but occasionally these logjams happen for no apparent reason.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
AnandBhat
Cruncher Joined: Apr 2, 2020 Post Count: 10 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My apologies for the distraction. I ran into connectivity issues with my 16 thread system and a similar logjam. I found that report by chance and since the reporter had expressed an interest to push ARP at a rate of 2000 WUs/day, I thought they (and other similar contributors) may have work sitting there attempting to be sent.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I'm now getting approximately 85% non-priority work during the past 24 hours but total validated results are still approximately 17,000 per day. Perhaps suggesting more machines have become reliable.
|
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 277 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I've bumped my ARP limits from 3 to 6 on the main cruncher (16 threads/15 active) and from 1 to 2 on the laptops (4 threads each). So far, there doesn't seem to be any issues, but it will mean more lost work when I need to reboot them.
|
||
|
|
![]() |