World Community Grid - View Thread

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Work Available

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 3319

[ ]

Author

This topic has been viewed 3313438 times and has 3318 replies

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12436
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Work Available

Results returned have dropped from 22K to 17K per day. and those would have been done by fast machines so more priority units for us ordinary crunchers.

Mike.

[Dec 8, 2021 7:50:30 PM]

AnandBhat
Cruncher
Joined: Apr 2, 2020
Post Count: 10
Status: Offline
Project Badges:

10 year badge for Mapping Cancer Markers

1 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: Work Available

The big hitters may have run into this issue when ramping up their ARP output -- https://github.com/BOINC/boinc/issues/4572

[Dec 9, 2021 12:46:50 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Work Available

I'm skeptical of the described problem in 4572. I currently run 64 concurrent ARP1 work units and have done so for the past 28 days (broke away from ARP1 to run MCM for a challenge). Have also run 128 concurrent (cut back due to the length of time to complete a WU). I have never encountered the issue described after contributing over 100 years of computer time to ARP1. I have seen the uploads fail and accumulate like during system maintenance windows or when WCG was encountering filesystem errors on the upload storage device (resulting is a lot of HTTP errors). During those times, I did see the same messages as described in the incident but was able to retry the uploads and get them to clear. Yes, if enough accumulated in upload pending status the downloads would cease but at no time did the BOINCMGR disconnect or require a reboot or client restart to clear. In the past there have been very rare instances where, due to circumstances, the upload process got interrupted and would not restart without intervention by the WCG staff (like to delete the upload file from the upload filesystem so that the client and remote end were back in sync) and the upload would finish as normal. I have only seen this documented 4 or 5 times in 14 years was usually due to a power outage or similar immediate disconnect that took down not only the client but also the OS. I have also encountered times where I have lost internet connectivity and the uploads accumulated on the client until a connection was reestablished (maybe 24 hours or more later). Once the connection was established a flood of large file uploads would commence (I have mine set to 10 uploads concurrent) and would complete without a problem. Yes, it took as long as an hour sometimes but they did complete without intervention.

[Dec 9, 2021 2:19:05 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12436
Status: Offline
Project Badges:


Re: Work Available

Kevin has already said why the big hitters output fluctuates. It is due to them concentrating on the work they were bought to perform and only contribute to WCG when they have spare capacity.

Problem 4572 seems to occur when the uploads are batched. The answer to that problem would seem to me to be not to batch them but to upload each unit as it completes so spreading the load.

My broadband is on 24/7 so I have no reason to batch. I don't look very often but if I see ARP units checkpointing close to each other I would suspend the running second for a few minutes. This avoids any possible overloading at the checkpointing or the uploading.

Mike

[Dec 9, 2021 9:39:05 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12436
Status: Offline
Project Badges:


Re: Work Available

Kevin's last report said that there were 130 units currently stuck, having errored out. 96 of these would seem to be what remains of generations 079 - 095 plus most of 098 & 099.

I will look a little deeper into this for my next report at the weekend.

Mike

[Dec 9, 2021 9:54:59 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7697
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: Work Available

I have seen a similar problem on 24 and 32 thread systems, but not very often. Since I don't run a lot of ARP units, it has occurred with both MCM and OPN units. I have had luck suspending network activity , waiting about 15 seconds and then resuming network activity. Once the logjam breaks, the rest of the uploads proceed as normal. I have in the past run up to 120 threads on a single internet connections through some rube goldberg concoctions of switches, routers and range extenders. Most of the time it works without any problems, but occasionally these logjams happen for no apparent reason.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Dec 9, 2021 9:59:38 PM]

AnandBhat
Cruncher
Joined: Apr 2, 2020
Post Count: 10
Status: Offline
Project Badges:


Re: Work Available

My apologies for the distraction. I ran into connectivity issues with my 16 thread system and a similar logjam. I found that report by chance and since the reporter had expressed an interest to push ARP at a rate of 2000 WUs/day, I thought they (and other similar contributors) may have work sitting there attempting to be sent.

[Dec 10, 2021 1:06:37 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Work Available

I'm now getting approximately 85% non-priority work during the past 24 hours but total validated results are still approximately 17,000 per day. Perhaps suggesting more machines have become reliable.

[Dec 12, 2021 1:31:22 PM]

spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 277
Status: Offline
Project Badges:

100 year badge for Mapping Cancer Markers

1 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Work Available

I've bumped my ARP limits from 3 to 6 on the main cruncher (16 threads/15 active) and from 1 to 2 on the laptops (4 threads each). So far, there doesn't seem to be any issues, but it will mean more lost work when I need to reboot them.

[Dec 12, 2021 1:41:55 PM]

[ ]