Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Member(s) browsing this thread: AgrFan |
Thread Status: Active Total posts in this thread: 3195
|
![]() |
Author |
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2153 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Aperture_Science_Innovators:
----------------------------------------I wonder why they drop the time_step (& thus make the WU take 2x as long to complete) _and_ also give it such a short deadline. Seems like a poor combination.... There is a legitimate reason for this. Sometimes all tasks in an ARP1-workunit error out. So the error has to be fixed in some way. To resolve the issue, they found that they had to reduce the time_step. (Note: in three cases they even had to start all over again from generation 000 to resolve a particular issue.) See this interesting post 669410 and this one (post 671450) ("3 cannot be processed even with changing the more granular step_size of 24" - BTW, Kevin means time_step there). Also, this question about reliable machines is interesting in post 671952: "If you are reducing the deadline for extreme case, are you reducing the definition of reliable to narrow the machines eligible?" Another question about running tasks on the fastest turnaround machines is also interesting here and the answer is: "I agree that would be nice, but no such mechanism exists in BOINC." Also, about the three most Extreme units: "We have also discussed this with the Delft team and they know that these 3 might take another 3-4 weeks to finish after the bulk are done." Hope this helps a lot! ![]() Adri PS I noticed slot number 125 on your device. (This is outrageous ![]() ![]() ![]() [Edit 1 times, last edit by adriverhoef at May 8, 2023 3:27:02 PM] |
||
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 266 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Don't have access to the system to check now As Mike Gibson pointed out, the time_step can be identified by looking at the progress % in BOINC Manager. "If it increases by 0.014% (occasionally 0.013%) then that is a 24 second time step. If it increases by 0.021% (occasionally 0.020%) then that is the standard 36 second time step."I wonder why they drop the time_step (& thus make the WU take 2x as long to complete) _and_ also give it such a short deadline Dropping the time_step was a work around to enable "stuck" workunits to make progress. Reducing the deadline was intended to help the laggards catch up. I agree that the short deadline doesn't work so well for 18 or 24 second time steps. I think that's why they send these laggards out to 3 machines initially, rather than 2. Once a stuck unit gets unstuck, they increase the time_step to 36 again (or at least they used to when IBM was in charge).Cheers, Mark Edit: I see that Adri was too quick for me and has already answered the question rather better than I did! ![]() [Edit 1 times, last edit by MJH333 at May 8, 2023 3:27:28 PM] |
||
|
Aperture_Science_Innovators
Advanced Cruncher United States Joined: Jul 6, 2009 Post Count: 139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you both. I knew I'd find some experts here who could weight in. Looks like they sent out another three copies of the WU about two hours ago after none of the devices got it in on-time. I suppose I'll see what happens this evening when my copy comes back, if it's deemed "too late" or if they abort one of the resends.
----------------------------------------![]() |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2153 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mark:
----------------------------------------Edit: I see that Adri was too quick for me and has already answered the question rather better than I did! ![]() After some digging in the vast vaults of the forum ![]() ![]() ![]() Just another thought, are they perhaps sending all Extreme workunits to 3 machines initially? Adri [Edit 1 times, last edit by adriverhoef at May 8, 2023 5:38:48 PM] |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I notice that 2 of the extremes recently flagged were sent out as newbies with 36 hour deadlines. This seems to be a departure from the previous rule for extreme newbies of 72 hours. 36 hours used to be used for extreme re-sends after the half-way mark of the original deadline.
----------------------------------------This can create a problem when the timestep is reduced causing the unit to take longer to crunch. To summarise recent posts: Normals go to 2 machines with 6 day deadlines Accelerated go to 2 machines with 3 day deadlines Extremes go to 3 machines with 3 day deadlines Re-sends replace errors or no replies with the deadline halved after the halfway point of the original unit. Units with reduced timesteps have their 36 secs. timestep restored after 2 or 3 successful generations. Or that was how IBM did it! Mike [Edit 1 times, last edit by Mike.Gibson at May 8, 2023 11:46:32 PM] |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Sunday Report
Sorry this is a day late. I missed the 7 May output so have interpolated between 6 & 8 May. All categories have been released this week. 977 units have been validated this week Assuming that a full generation 182 will be the last, there are 1,637,090 units still outstanding. I will re-start my forecasting once output stabilises. The definition of accelerated and extreme are unchanged There are still 35 Extremes and 57 Accelerated units listed as none have moved. The numbers in their generations are 2,200 & 3837. Mike |
||
|
paulch2
Cruncher Joined: Aug 6, 2020 Post Count: 25 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
Some of the extremes seem to be 36 hours rather than 3 days
https://www.worldcommunitygrid.org/contribution/workunit/301765111 |
||
|
paulch2
Cruncher Joined: Aug 6, 2020 Post Count: 25 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() |
Some of the extremes seem to be 36 hours rather than 3 days https://www.worldcommunitygrid.org/contribution/workunit/301765111 Ignore that. Mike already mentioned it. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2153 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Aperture_Science_Innovators:
----------------------------------------I suppose I'll see what happens this evening when my copy comes back, if it's deemed "too late" or if they abort one of the resends. If you'll return it within 24 hours of the needed validations you should be safe. So as long as nobody returns a result you have at least 24 hours to return your result as long as the deadline is met by the others. Plus, they have an extra 24 hours if they don't meet the deadline if you didn't meet the deadline. However, if you return your result later than 24 hours after the needed results were received, then you're out of luck. Example 1: Our task, OPNG_0184849_00003_1, missed the deadline ("Due time") by some 30 hours (!); luckily the last needed wingman returned their result, OPNG_0184849_00003_2, only some 20 hours and 38 minutes before we did: $ wcgstats -wJSrrr= 300214794 Example 2: Our task, OPNG_0184849_00005_1, missed the deadline ("Due time") by some 28 hours (!), only to find that the last needed wingman returned their result, OPNG_0184849_00005_2, already more than 24 hours ago (!) before we did: $ wcgstats -wJSrr= 300214793 Adri PS I just noticed one result is now Pending Validation: Result name OS type Status Sent time Due / Return time CPUtime/Elapsed Claimed/Granted Whoopsie, another one coming in right at this time: ARP1_0033793_102_2 Linux Debian P.Val. 2023-05-07 03:22:47 2023-05-09 09:23:19 18.89/19 1,422/0 [Edit 1 times, last edit by adriverhoef at May 9, 2023 9:32:13 AM] |
||
|
Aperture_Science_Innovators
Advanced Cruncher United States Joined: Jul 6, 2009 Post Count: 139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yup, that Linux Mint device is mine. 41 hours of CPU time is astonishing. Hopefully when a third one comes back it proves to be what the task needed though and validates without a hitch.
----------------------------------------![]() |
||
|
|
![]() |