World Community Grid - View Thread

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Work Available

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 3195

[ ]

Author

This topic has been viewed 2709413 times and has 3194 replies

geophi
Advanced Cruncher
U.S.
Joined: Sep 3, 2007
Post Count: 102
Status: Offline
Project Badges:

1 year badge for Help Fight Childhood Cancer

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

14 day badge for Uncovering Genome Mysteries

90 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

1 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Work Available

I got one of the formerly stuck units which is now in generation 90. I'm assuming the timestep size for these unstuck units has gone back to the normal?

It's running on the 64 bit executable and and those typically complete in 6.5 to 7.5 hours on my Ryzen. BOINC is estimating 10 hours at startup but it usually does a bad job of estimation.

[Dec 31, 2021 10:56:46 PM]

geophi
Advanced Cruncher
U.S.
Joined: Sep 3, 2007
Post Count: 102
Status: Offline
Project Badges:


Re: Work Available

Well it took nearly 10 hours, so it is likely the timestep duration is shorter for this unit. Perhaps this was the first try to "unstick" it, or perhaps these units will continue on with the shorter timesteps.

https://www.worldcommunitygrid.org/contribution/workunit/100022979

[Jan 1, 2022 6:38:35 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Work Available

Was it a twin or a triplet? If a twin, it would have been from the unstuck generation. If a triplet, then it would probably have been from the next generation, That is a generic question for the benefit of others.

I see that it was a triplet, so probably from the next generation, which indicates that it will continue like that, at least for a while.

Mike

----------------------------------------
[Edit 1 times, last edit by Mike.Gibson at Jan 2, 2022 1:34:35 AM]

[Jan 2, 2022 1:30:35 AM]

Hype
Cruncher
Germany
Joined: Nov 18, 2011
Post Count: 43
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for The Clean Energy Project - Phase 2

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

1 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

180 day badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: Work Available

How many ARP WUs should you run with 64 mb L3 cache?

----------------------------------------

[Jan 2, 2022 12:24:19 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:


Re: Work Available

The general advice is to run a maximum of half of your threads on ARP and the rest on other projects - currently MCM & OPN1. The problem is not just L3 cache.

The ARP problems stem from intense checkpointing which occur at each 12.5% and at the end. It is also a good idea to keep the checkpointings apart - if 2 are close, even at different checkpointing stages, it is advisable to suspend the one that would occur second for a couple of minutes to increase the separation.

Mike.

[Jan 2, 2022 2:43:17 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:


Re: Work Available

Thank you, Kevin.

Best Wishes for a Happy New Year.

60 stuck units have now been freed up so only 70 remain.

7 days have passed since the last full report. In that time, the last of the ultra laggards has moved on 6 generations to generation 064 and their leader has moved on 7 generations to 075 and is now only 4 generations behind the oldest stuck unit.

The 70 units still stuck appear to include generation 079 x 1, 102 x 2 105 x 1 plus 66 others. The list will become clearer.

There are 59 units listed as extreme. 6 are the ultras, so there are 53 others, which are unstuck units 1 unstuck unit has escaped the extreme range, so 6 unstuck units have yet to be validated.

The last of the accelerated (Priority) generations is 111 and their leader is 115.

The current leading generation is 125.

Each of those leaders has moved on 2 generations so the definition for extreme laggard is still -15 generations with accelerated remaining at -10 generations. However, there are only still stuck units in generations 102 - 110..

59,942 units have been validated in the last 7 days (up to 8,563.1 per day).

There are now 2,340,653 units to be crunched to finish the project assuming that a full generation 182 is the last. My target for completion is now calculated as 1 October 2022. That is 16 days earlier than my last summary, assumes everyone works as now until the last day and reflects the increase in work returned.

It is likely that we will see the project finish before the end of 2022 due to the efforts to close up the laggards. My guess would be October 2022 assuming those university computers come back on line after their holiday.

Mike

----------------------------------------
[Edit 2 times, last edit by Mike.Gibson at Jan 2, 2022 4:30:36 PM]

[Jan 2, 2022 4:14:28 PM]

Hype
Cruncher
Germany
Joined: Nov 18, 2011
Post Count: 43
Status: Offline
Project Badges:


Re: Work Available

Thanks for the suggestion.
I have 32 threads and 16 seems definitely too much.
Even with 12 threads the OPN tasks running in parallel are slowing down.
I seem to have the best performance with 8 ARP in parallel, 10 might also be okay.

----------------------------------------

[Jan 3, 2022 5:11:16 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:


Re: Work Available

RAM is also a factor. I can run all 8 threads with ARP because it has 20 GB RAM, albeit taking an extra third more time than if I run 4 threads with ARP.

The problem with a large number of threads is that the checkpoints become more frequent. With 16 ARP threads out of 32, you are likely to have checkpoints every 5 minutes, say, if they are well spread. As they don't take the same amount of time, some checkpoints clash and clog up the machine. And if more than 2 clash you have a bigger problem.

Mike

[Jan 3, 2022 6:26:45 PM]

spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 274
Status: Offline
Project Badges:

50 year badge for Mapping Cancer Markers

20 year badge for OpenPandemics - COVID-19


Re: Work Available

I'm running six ARPs on my Ryzen 7, and haven't noticed any undue issues. I reserve one thread for bookkeeping and GPU feeding, so I normally run 15 CPU tasks total, with a GPU task taking the 16th slot if any OPNG work units are available.

On my main cruncher, /var/lib/boinc-client lives on a ZFS storage pool: two mirrored pairs of hard drives plus a pair of SSDs providing log and cache devices. It does a beautiful job of handling any load thrown at it. The SSDs aren't showing abnormal wear, and if one goes, it's just a performance degradation until it's replaced.

[Jan 4, 2022 3:02:42 AM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:


Re: Work Available

spRocket

With your Rizen 7, I suspect you may be completing your ARP units in 12-18 hours. If that were 15 hours, you would be completing 1 every 150 minutes so you would be checkpointing every 19 minutes. Even if it were 12 hours per unit, your checkpointing would only occur every 15 minutes which is plenty of time for them to keep apart nearly all the time.

Hype has twice as many threads which means that his checkpointing is much more likely to clash.

Mike

[Jan 4, 2022 4:10:41 AM]

[ ]