Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3317
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
OK. All 200 cores have been assigned to ARP1. However, I completely forgot about the existing error in the BOINC client (7.16.11) where it ignores the work queue setting at times. Looked at it this morning and had 990 WUs assigned to the machines. I set them to no new tasks for now and will have to babysit these things this week. Will get through most of them but there may be a few (< 1%) that miss the deadline.
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
entity
Don't forget that they do have 8 days now rather than the 7 shown in Boinc Manager. (Or 4.5 rather than 3.5) They are working on the deadline shown in Result Status. Check after 4 days and delete some only if there are more left than they have already crunched. Mike |
||
|
paulch2
Cruncher Joined: Aug 6, 2020 Post Count: 25 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
I guess it takes a while for new machines to be trusted to run stragglers.
Those ones I added back on the 14th June have only just started seeing earlier gens, while older, and slower, machines I'm running have been getting them quite often. |
||
|
Stiwi
Advanced Cruncher Joined: May 19, 2012 Post Count: 75 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OK. All 200 cores have been assigned to ARP1. However, I completely forgot about the existing error in the BOINC client (7.16.11) where it ignores the work queue setting at times. Looked at it this morning and had 990 WUs assigned to the machines. I set them to no new tasks for now and will have to babysit these things this week. Will get through most of them but there may be a few (< 1%) that miss the deadline. If you are running ARP on all cores you will probably don't have enough L3 Cache which will increase your runtime. If i remember correctly on my 3900x the runtime doubled if i run ARP on 24 Threads. 12 seems fine. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
You are correct, the runtime does double but the amount of work done per day is essentially the same. The EPYC server seems to handle 128 simultaneously reasonably well. On average 24 to 26 hours per WU. If I ran 64, they would probably run in 12 to 15 hours. Same number of WUs per day though. The biggest impact on that server is the time spent handling hardware interrupts. The machines with the consumer grade chips are a little different. The L3 cache conflict is considerably more noticeable on those machines.
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
What is better is to use half the threads on ARP and the rest on other projects. That achieves the highest throughput.
OPN, MCM & HST are worthwhile projects. Mike |
||
|
phytell
Cruncher Joined: Sep 8, 2014 Post Count: 37 Status: Offline |
@Entity:
I've been wondering how well one of those 128 thread monsters would handle ARP - thanks for posting your runtimes! If you don't mind sharing, what are you using as storage (I can only imagine that many units would slaughter SSDs)? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
entity That might be so at the moment, but they are on 8-day deadlines whereas the earlier ones are on 4.5-day deadlines so should reappear more frequently. The earlier ones also have a higher priority so should be turned around faster. There are 5302 pre-077 out there compared with 30307 077-081 but they have many more generations to get through (35227 compared with 86167) which redresses the balance to some extent. A generation is 35609, but a generation can be completed in 4 days at 18000 results per day. The earliest stragglers have 80 generations to get through so are likely to take 3 months even if they get through 1 generation per day. That is most unlikely as both copies have to validate before moving on. Your extra machines will help the project to finish sooner, but might cause a local heatwave! Mike Mike, I'm reluctant to completely buy into your scenario. It may be true if everything was taking the max amount of time to complete (8 and 4.5 days). However, I'm getting copious amounts of current generation WUs and turning them around in less than a day. If a machine gets a high priority WU and sits on it for 4 days, I have already turned around 800 of the current generations by the time that one unit comes back. I don't know how Kevin has the server configured for ARP1 priority work. In other words, what is the definition of a reliable machine for ARP1? Question for Kevin: Would it be worth while to redefine reliable machines for ARP1 to those that return work in less than 2 days? At any given time there probably aren't that many high priority work WUs available since they have to run consecutively. I can make all 208 cores available for only priority work if that helps the backlog and they all will come back in less that 2 days. Another question for Kevin: What is the average return time for the high priority work? Is it considerably less that the 4.5 days? |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1323 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Received: ARP1_0002240_081_1 26 Jul 14:40:43 UTC
|
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12436 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
entity
If I express it differently it might be clearer. Firstly, your 'new' machines may have to establish their reliability - I don't know how many they will have to return to do that, but it will be several so a few days. Secondly there were only 5302 pre 077 whereas a full generation is 35609, so there are many more new units by comparison with the stragglers. However when a new generation unit has been crunched, it has to wait for the next generation to be created whereas the stragglers are automatically moved on to the next generation when they have been crunched. These factors mean that your 'new' machines should start to get an increasing number of stragglers. My current machine is mostly getting 077 & 078 because they are over half the available units. Mike |
||
|
|
![]() |