Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3195
|
![]() |
Author |
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 102 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I got one of the formerly stuck units which is now in generation 90. I'm assuming the timestep size for these unstuck units has gone back to the normal?
It's running on the 64 bit executable and and those typically complete in 6.5 to 7.5 hours on my Ryzen. BOINC is estimating 10 hours at startup but it usually does a bad job of estimation. |
||
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 102 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I got one of the formerly stuck units which is now in generation 90. I'm assuming the timestep size for these unstuck units has gone back to the normal? It's running on the 64 bit executable and and those typically complete in 6.5 to 7.5 hours on my Ryzen. BOINC is estimating 10 hours at startup but it usually does a bad job of estimation. Well it took nearly 10 hours, so it is likely the timestep duration is shorter for this unit. Perhaps this was the first try to "unstick" it, or perhaps these units will continue on with the shorter timesteps. https://www.worldcommunitygrid.org/contribution/workunit/100022979 |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Was it a twin or a triplet? If a twin, it would have been from the unstuck generation. If a triplet, then it would probably have been from the next generation, That is a generic question for the benefit of others.
----------------------------------------I see that it was a triplet, so probably from the next generation, which indicates that it will continue like that, at least for a while. Mike [Edit 1 times, last edit by Mike.Gibson at Jan 2, 2022 1:34:35 AM] |
||
|
Hype
Cruncher Germany Joined: Nov 18, 2011 Post Count: 43 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
How many ARP WUs should you run with 64 mb L3 cache?
----------------------------------------![]() |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The general advice is to run a maximum of half of your threads on ARP and the rest on other projects - currently MCM & OPN1. The problem is not just L3 cache.
The ARP problems stem from intense checkpointing which occur at each 12.5% and at the end. It is also a good idea to keep the checkpointings apart - if 2 are close, even at different checkpointing stages, it is advisable to suspend the one that would occur second for a couple of minutes to increase the separation. Mike. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you, Kevin.
----------------------------------------Best Wishes for a Happy New Year. 60 stuck units have now been freed up so only 70 remain. 7 days have passed since the last full report. In that time, the last of the ultra laggards has moved on 6 generations to generation 064 and their leader has moved on 7 generations to 075 and is now only 4 generations behind the oldest stuck unit. The 70 units still stuck appear to include generation 079 x 1, 102 x 2 105 x 1 plus 66 others. The list will become clearer. There are 59 units listed as extreme. 6 are the ultras, so there are 53 others, which are unstuck units 1 unstuck unit has escaped the extreme range, so 6 unstuck units have yet to be validated. The last of the accelerated (Priority) generations is 111 and their leader is 115. The current leading generation is 125. Each of those leaders has moved on 2 generations so the definition for extreme laggard is still -15 generations with accelerated remaining at -10 generations. However, there are only still stuck units in generations 102 - 110.. 59,942 units have been validated in the last 7 days (up to 8,563.1 per day). There are now 2,340,653 units to be crunched to finish the project assuming that a full generation 182 is the last. My target for completion is now calculated as 1 October 2022. That is 16 days earlier than my last summary, assumes everyone works as now until the last day and reflects the increase in work returned. It is likely that we will see the project finish before the end of 2022 due to the efforts to close up the laggards. My guess would be October 2022 assuming those university computers come back on line after their holiday. Mike [Edit 2 times, last edit by Mike.Gibson at Jan 2, 2022 4:30:36 PM] |
||
|
Hype
Cruncher Germany Joined: Nov 18, 2011 Post Count: 43 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The general advice is to run a maximum of half of your threads on ARP and the rest on other projects - currently MCM & OPN1. The problem is not just L3 cache. The ARP problems stem from intense checkpointing which occur at each 12.5% and at the end. It is also a good idea to keep the checkpointings apart - if 2 are close, even at different checkpointing stages, it is advisable to suspend the one that would occur second for a couple of minutes to increase the separation. Mike. Thanks for the suggestion. I have 32 threads and 16 seems definitely too much. Even with 12 threads the OPN tasks running in parallel are slowing down. I seem to have the best performance with 8 ARP in parallel, 10 might also be okay. ![]() |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
RAM is also a factor. I can run all 8 threads with ARP because it has 20 GB RAM, albeit taking an extra third more time than if I run 4 threads with ARP.
The problem with a large number of threads is that the checkpoints become more frequent. With 16 ARP threads out of 32, you are likely to have checkpoints every 5 minutes, say, if they are well spread. As they don't take the same amount of time, some checkpoints clash and clog up the machine. And if more than 2 clash you have a bigger problem. Mike |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I'm running six ARPs on my Ryzen 7, and haven't noticed any undue issues. I reserve one thread for bookkeeping and GPU feeding, so I normally run 15 CPU tasks total, with a GPU task taking the 16th slot if any OPNG work units are available.
On my main cruncher, /var/lib/boinc-client lives on a ZFS storage pool: two mirrored pairs of hard drives plus a pair of SSDs providing log and cache devices. It does a beautiful job of handling any load thrown at it. The SSDs aren't showing abnormal wear, and if one goes, it's just a performance degradation until it's replaced. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
spRocket
With your Rizen 7, I suspect you may be completing your ARP units in 12-18 hours. If that were 15 hours, you would be completing 1 every 150 minutes so you would be checkpointing every 19 minutes. Even if it were 12 hours per unit, your checkpointing would only occur every 15 minutes which is plenty of time for them to keep apart nearly all the time. Hype has twice as many threads which means that his checkpointing is much more likely to clash. Mike |
||
|
|
![]() |