Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3204
|
![]() |
Author |
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 952 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you, Al. Yup; much to my chagrin I missed that one in my scan of that thread because it was in sam6861's forensic analysis post rather than as a failure report per se :-( -- as he'd just reported 33395 I probably misread the forensic one to be the same...That is an unstuck unit from adriverhoef's September thread. I could only guess that the reason for 32 bit might be because one of the 3 machines is 32 bit. I'd assumed that, so I've been looking at wingmen for all my recent 32-bit jobs (rather than just the unstuck tasks...) and most of the time there's nothing obvious about the O/S in use - typically a 64-bit system... Sometimes, however, it's obvious what's going on as there's a 32-bit kernel in play, albeit not that often!Mike However, I suspect that some of those systems actually have a 32-bit client without the configuration adjustment for 64-bit tasks; that would go a long way to explaining it. Cheers - Al. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A quick update before we head into the weekend. There were three work units that we have to restart from the beginning. These have now been reset and generation 0 has been sent out. They are the units:
I have one final workunit that I'm rerunning clean jobs on that will get submitted into the grid tomorrow. At that point all of the units will be back running on the grid.* One change that we have made is that because those three units have so far to catch up we need them to advance quickly. In order to do that we are reducing the report deadline for 'extreme' jobs. This will ensure that those three don't get stuck on a generation for very long and that they have the best chance to catch up with the pack before the project ends. Note that this change will impact all of the 'extreme' jobs. We will monitor things to make sure that we don't see a suddenly spike in jobs not being finished by the deadline due to this change (we don't expect it to a be problem, but we will watch for it anyway). *Note that the report that counts the number of work-units running as 'extreme' doesn't count the first generation after I put them on the grid. Once the first generation runs successfully, its child will get the flag that makes it run as an extreme job and then it will show up in the report. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12359 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you, Kevin.
If you are reducing the deadline for extreme case, are you reducing the definition of reliable to narrow the machines eligible? As there are only 3 units restarting at 000, could you not have a short list of the fastest machines to receive them? That way they would close up faster. There are currently only 261 days to my calculated completion date and those 3 units will need to be run through 183 generations, which means they will need to turn around every 1.43 days (34 hours) to meet that date. I would suggest that only machines regularly returning units within 18 hours should be eligible for those 3 units. The other extremes could continue as now. That would rule me out, but help the project! Mike |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12359 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We now have 3 new ultras in generation 000 which includes the one that was stuck in generation 079.
Only 2 of the old ultras are still discernable. There will be a reduced reporting deadline for the extremes. This new deadline has not been notified so if anyone picks one up, please post here. There is an error in the latest daily text file ....stats/state.txt. The max_generation for the extremes should be 118. There are 123 units in the extreme range but only 114 listed as extreme. This means that there are 9 units which have yet to complete their first generation since being re-started. This includes the 3 in generation 000. My current calculated completion date forecast for ARP is 22 October so those 3 units which have restarted will have to turn around very quickly (34 hours average for the entire period). Mike |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
If you are reducing the deadline for extreme case, are you reducing the definition of reliable to narrow the machines eligible? No - while I would like to do this, the reliable mechanism in the BOINC code applies to everything on World Community Grid and making this change would cause there to be too many jobs needing a reliable host which would in turn gum up the works. As there are only 3 units restarting at 000, could you not have a short list of the fastest machines to receive them? That way they would close up faster. I agree that would be nice, but no such mechanism exists in BOINC. There are currently only 261 days to my calculated completion date and those 3 units will need to be run through 183 generations, which means they will need to turn around every 1.43 days (34 hours) to meet that date. That's in line with my estimate. Currently the extremes average 38 hours while the median is 29 hours ( as an aside 62% of the workunits finish within 34 hrs). The difference between the average and the median is the hard luck cases that stretch out to 100+ hours. As long as these three jobs have a minimal number of hard luck cases, then they should be able to be close by the end of project. In the absence of better tools to target these at specific hosts, the next best tool is to make sure that the worst case time to finish a single generation isn't too bad. That way if the average is pushed closer to the median (and in particular, below 34 hours), then they should be able to catch up. Once most of the extreme workunits become accelerated we can look at tightening the report deadline for those three further to be only slightly longer than the reliable threshold. We have also discussed this with the Delft team and they know that these 3 might take another 3-4 weeks to finish after the bulk are done. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There will be a reduced reporting deadline for the extremes. This new deadline has not been notified so if anyone picks one up, please post here. The base deadline assigned to the workunit is changed from 7 -> 3.5 days. However, there are some complications due to the way that the deadline is modified based on the reliable mechanism. Most of the jobs are getting assigned a 2.75 day deadline. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12359 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Kevin
Thank you for your replies. I note that 7 divided by 2 is 3.5 and 3.5 divided by 2 is 1.75. Is the 2.75 arrived at by adding the 1.75 and the 1 leeway allowed part way through the project? Did you find the error on ,,,,stats/state? Cheers Mike |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 951 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thank you Kevin for taking the time to let us know what is going on with the project. I love the updates, and getting to see what gen each of the tasks is on.
Thanks to Mike also for posting quick updates. I think the best thing we can all do is keep our own queues as small as possible, and keep reminding others to keep their queues small. This project is very stable (unlike seti where a queue was needed) so one really doesn't need much of a queue. So close to getting my year badge, but I'm going to stick with this project, as it is the only one on WCG with feedback. Hopefully there will be a new project to jump to when this one is done. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7660 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have ARP1_0000026_123. I don't recall seeing a number as low as 0000026 in a long time. It has a 5 day deadline.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Robokapp
Senior Cruncher Joined: Feb 6, 2012 Post Count: 249 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
woke up this morning to find ARP1_0031093_123_0 and ARP1_0019553_123_2 had failed to initiate. I was squinting at my list "why does it look there's fewer than 8 running" and then I found them when i scrolled down the task list stuck at 0.000% and "high Priority".
I abandoned both so... these two buggers are still in limbo. Sorry all. My little Intel couldn't. :D |
||
|
|
![]() |