Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Beta Testing Forum: Beta Test Support Forum Thread: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ] |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 118
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The agent is supposed to retain a task in memory when paused/suspended even if laim is off until it writes the first checkpoint. This implies the task reset for another reason, which one would expect to be recorded in the event or result log. For running tasks latter can be found in the slots.
----------------------------------------[Edit 1 times, last edit by Former Member at Aug 20, 2014 8:17:28 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
And here's a sad case caused by run-dependent convergence.
BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 2-- 640 Valid 19/08/14 22:20:26 20/08/14 01:05:22 0.82 25.0 / 42.8 BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 1-- 640 Invalid 18/08/14 17:58:40 19/08/14 12:29:52 6.84 185.6 / 185.6 BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 0-- 640 Valid 18/08/14 17:45:29 18/08/14 19:14:02 1.17 42.8 / 42.8 Guess which WU continued to Job#6 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Was the validation for cep2 not based on 'whoever did least work', meaning if one did 5 tasks and another 6, the first 5 were compared and the 'excess' assumed to be fine? Maybe just a matter of more validator tuning.
|
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1294 Status: Offline Project Badges: |
And here's a sad case caused by run-dependent convergence. BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 2-- 640 Valid 19/08/14 22:20:26 20/08/14 01:05:22 0.82 25.0 / 42.8 BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 1-- 640 Invalid 18/08/14 17:58:40 19/08/14 12:29:52 6.84 185.6 / 185.6 BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 0-- 640 Valid 18/08/14 17:45:29 18/08/14 19:14:02 1.17 42.8 / 42.8 Guess which WU continued to Job#6 This looks like a serious problem. I can't look over somebody else's shoulder, but probably the 2 Valids were finished, because of the 30 sec. heartbeat. The application exited likely with RC = 0x1. For BOINC that's a successful finish. Edit: I got 1 resend for verification: BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 2-- - In Progress 8/20/14 08:09:48 8/24/14 08:09:48 0.00 0.0 / 0.0 <-- mine BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 1-- 640 Pending Verification 8/18/14 17:41:36 8/20/14 08:09:39 9.75 249.8 / 0.0 BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 0-- 640 Pending Verification 8/18/14 17:41:32 8/18/14 21:28:59 1.19 35.5 / 0.0 The shorter PVer's exited during job #0: Result Name: BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 0-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [22:15:40] Number of jobs = 8 [22:15:40] Starting job 0,CPU time has been restored to 0.000000. Application exited with RC = 0x1 [23:28:29] Finished Job #0 [23:28:29] Starting job 1,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #1 [23:28:29] Starting job 2,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #2 [23:28:29] Starting job 3,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #3 [23:28:29] Starting job 4,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #4 [23:28:29] Starting job 5,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #5 [23:28:29] Starting job 6,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #6 [23:28:29] Starting job 7,CPU time has been restored to 4287.172682. [23:28:29] Skipping Job #7 23:28:30 (11876): called boinc_finish </stderr_txt> ]]> [Edit 2 times, last edit by Crystal Pellet at Aug 20, 2014 9:11:33 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Crystal,
All 3 cases finished with RC = 0x1, _0 (mine) & _2 during Job#0, _1 during Job#6. The Result Logs for all 3 have no sign of a heartbeat message or restart. Here's mine. INFO: No state to restore. Start from the beginning. [18:49:49] Number of jobs = 8 [18:49:49] Starting job 0,CPU time has been restored to 0.000000. Application exited with RC = 0x1 [20:05:25] Finished Job #0 [20:05:25] Starting job 1,CPU time has been restored to 4223.617874. [20:05:25] Skipping Job #1 [20:05:25] Starting job 2,CPU time has been restored to 4223.617874. [20:05:25] Skipping Job #2 [20:05:25] Starting job 3,CPU time has been restored to 4223.617874. [20:05:26] Skipping Job #3 [20:05:26] Starting job 4,CPU time has been restored to 4223.617874. [20:05:26] Skipping Job #4 [20:05:26] Starting job 5,CPU time has been restored to 4223.617874. [20:05:26] Skipping Job #5 [20:05:26] Starting job 6,CPU time has been restored to 4223.617874. [20:05:26] Skipping Job #6 [20:05:26] Starting job 7,CPU time has been restored to 4223.617874. [20:05:26] Skipping Job #7 20:05:27 (8572): called boinc_finish It looks more like convergence failed in Job#0 in _0 & _2 but managed to continue to Job#6 in _1. It'll be interesting to see the validation outcome your PVer example! |
||
|
branjo
Master Cruncher Slovakia Joined: Jun 29, 2012 Post Count: 1892 Status: Offline Project Badges: |
tonyh205 wrote: Patrick, so LAIM is on but a suspended wu restarts from the beginning - hmmm, can't explain that It happened also to me yesterday - LAIM on, 7.0.65, Mac OS X Mavericks, i5-2500S, 12 GB RAM. I worked on PC (any RAM-eating nor huge-IO app on) and the WU restarted from the latest checkpoint (lost 1.75 hour of work). WU finished normally (7 h CPU time) and it is in PVal prison now. Cheers ETA: here is the part with the a.m. "problem": Result Log Crunching@Home since January 13 2000. Shrubbing@Home since January 5 2006 [Edit 3 times, last edit by branjo at Aug 20, 2014 10:43:49 AM] |
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1294 Status: Offline Project Badges: |
Tony wrote:
----------------------------------------It'll be interesting to see the validation outcome your PVer example! The 2 already finished and validated on that machine lasted 14 and 16.4 hours. When this one is also running that long, it's after midnight here, so I'll report tomorrow. [Edit 1 times, last edit by Crystal Pellet at Aug 20, 2014 11:57:44 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Mmmm. Another thing that I'm not sure is a pure beta issue but it's a beta WU and a new validator, so here goes.
WU BETA_ E225108_ 412_ S.328.C42H27N7.LLZNKJDGUWRARQ-UHFFFAOYSA-N.15_ s1_ 14_ 0 ran for 18 hours on one of my machines and was killed "because cpu time has been exceeded". It was still in the first sub-job. However, it was allocated the status "Error" (rather than "Invalid") and zero points. The wingman made it beyond the first sub-job and a third copy was issued for validation of that. I quite understand that my machine did nothing useful, but I don't see why this was regarded as an error because there is insufficient information to evaluate the validity of the processing that the machine did before the application terminated. I would have expected it to have been regarded as invalid on the basis of a 'no user fault' problem. Similarly, I think zero points is a bit harsh. Again, just my 2p'th. |
||
|
littlepeaks
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 748 Status: Offline Project Badges: |
tonyh205:
----------------------------------------It looks more like convergence failed in Job#0 in _0 & _2 but managed to continue to Job#6 in _1. It'll be interesting to see the validation outcome your PVer example! I had one that exited in Job#6, while two other "contestants" exited in job#0. Mine turned from PV to error, whereas the other two got valids. It was BETA_ E225108_ 711_ S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.20_ s1_ 14_ 1-- [Edit 1 times, last edit by littlepeaks at Aug 20, 2014 7:48:12 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I had one that exited in Job#6, while two other "contestants" exited in job#0. Mine turned from PV to error, whereas the other two got valids. That's odd, if it was a "success"-type exit from Job#6. What does the Result Log show for your unit? The common successful exit seems to give the line "Application exited with RC = 0x1". |
||
|
|