Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Member(s) browsing this thread: Mike.Gibson , xensazn , catchercradle |
Thread Status: Active Total posts in this thread: 3281
|
![]() |
Author |
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you, Kevin. Now we know. Have you allowed for 0033711_099 and 0034392_089 restuck this week? Cheers Mike 0033711_099 has been fixed, resubmitted to the grid and validated: https://www.worldcommunitygrid.org/contribution/workunit/127861651 0034392_089 has been fixed, resubmitted to the grid and is currently running: https://www.worldcommunitygrid.org/contribution/workunit/131568984 [edit - and apparently entity is running 0034392_089 as noted above] [Edit 1 times, last edit by knreed at Jan 26, 2022 3:36:03 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Of the remaining 41, we are re-running them on our servers before we put them back on the grid. This is a slightly slow process but it allows us to be sure that we understand the issue and that they are running properly before we send them out again. Some have to be re-run for multiple generations and as a result we have only been able to put 4-5 back into circulation each day. With the help of Delft, we have a way to detect the issue in the validator so we will identify this issue in the first generation it occurs in from now on so we shouldn't get these stuck jobs again (we will still have to periodically re-run the jobs with a smaller step size). I hope to have the remaining 41 running again within the next 7-10 days. Will they come back into the grid at the same generation as when they were stuck or will they come back to the grid at a future generation? |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Of the remaining 41, we are re-running them on our servers before we put them back on the grid. This is a slightly slow process but it allows us to be sure that we understand the issue and that they are running properly before we send them out again. Some have to be re-run for multiple generations and as a result we have only been able to put 4-5 back into circulation each day. With the help of Delft, we have a way to detect the issue in the validator so we will identify this issue in the first generation it occurs in from now on so we shouldn't get these stuck jobs again (we will still have to periodically re-run the jobs with a smaller step size). I hope to have the remaining 41 running again within the next 7-10 days. Will they come back into the grid at the same generation as when they were stuck or will they come back to the grid at a future generation? I expect to put them into the grid at the same generation as when they were stuck (reserving of course the right to do something differently if something unexpected comes up) |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12409 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No need for any back tracking, then?
Mike |
||
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 268 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Mike,
I've got ARP1_0034245_107, a triplet with 36 second time step. Could that be a formerly stuck unit? Cheers, Mark |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12409 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you, Mark.
All the extremes are either ultras which seem to be in the low 10000s or unstuck units which seem to be in the 30000s. The rececently unstuck will have 24 second time steps for a few generations before reverting to 36 second. Mike |
||
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 268 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thanks Mike.
I've now got ARP1_0034390_092, another triplet with 36 second timestep. This has errored out on one of the initial 3 machines with an unhandled exception error. Cheers, Mark |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12409 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That is one to watch. 34392, which could be near, errored out in 089.
Mike |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That is one to watch. 34392, which could be near, errored out in 089. Mike It is still running after 33 hours. Supposedly, has about 10 hours left. |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 971 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mike,
Just noticed I've got ARP1_0033880_104_1, which is part of a triplet with 36 second time-step. For what it's worth, it's running the 32-bit executable :-( Cheers - Al. |
||
|
|
![]() |