Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 118
Posts: 118   Pages: 12   [ Previous Page | 2 3 4 5 6 7 8 9 10 11 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 11396 times and has 117 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

The agent is supposed to retain a task in memory when paused/suspended even if laim is off until it writes the first checkpoint. This implies the task reset for another reason, which one would expect to be recorded in the event or result log. For running tasks latter can be found in the slots.
----------------------------------------
[Edit 1 times, last edit by Former Member at Aug 20, 2014 8:17:28 AM]
[Aug 20, 2014 8:07:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

And here's a sad case caused by run-dependent convergence.

BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 2-- 640 Valid 19/08/14 22:20:26 20/08/14 01:05:22 0.82 25.0 / 42.8
BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 1-- 640 Invalid 18/08/14 17:58:40 19/08/14 12:29:52 6.84 185.6 / 185.6
BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 0-- 640 Valid 18/08/14 17:45:29 18/08/14 19:14:02 1.17 42.8 / 42.8

Guess which WU continued to Job#6 crying
[Aug 20, 2014 8:16:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Was the validation for cep2 not based on 'whoever did least work', meaning if one did 5 tasks and another 6, the first 5 were compared and the 'excess' assumed to be fine? Maybe just a matter of more validator tuning.
[Aug 20, 2014 8:34:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

And here's a sad case caused by run-dependent convergence.

BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 2-- 640 Valid 19/08/14 22:20:26 20/08/14 01:05:22 0.82 25.0 / 42.8
BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 1-- 640 Invalid 18/08/14 17:58:40 19/08/14 12:29:52 6.84 185.6 / 185.6
BETA_ E225108_ 655_ S.328.C42H26N6O1.BXDZVSAWISJAKK-UHFFFAOYSA-N.4_ s1_ 14_ 0-- 640 Valid 18/08/14 17:45:29 18/08/14 19:14:02 1.17 42.8 / 42.8

Guess which WU continued to Job#6 crying

This looks like a serious problem.

I can't look over somebody else's shoulder, but probably the 2 Valids were finished, because of the 30 sec. heartbeat.
The application exited likely with RC = 0x1. For BOINC that's a successful finish.

Edit:
I got 1 resend for verification:
BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 2-- - In Progress 8/20/14 08:09:48 8/24/14 08:09:48 0.00 0.0 / 0.0 <-- mine
BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 1-- 640 Pending Verification 8/18/14 17:41:36 8/20/14 08:09:39 9.75 249.8 / 0.0
BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 0-- 640 Pending Verification 8/18/14 17:41:32 8/18/14 21:28:59 1.19 35.5 / 0.0

The shorter PVer's exited during job #0:

Result Name: BETA_ E225108_ 119_ S.328.C40H24N8O1.LRPCEOGSRDGPRS-UHFFFAOYSA-N.5_ s1_ 14_ 0--

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[22:15:40] Number of jobs = 8
[22:15:40] Starting job 0,CPU time has been restored to 0.000000.
Application exited with RC = 0x1
[23:28:29] Finished Job #0
[23:28:29] Starting job 1,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #1
[23:28:29] Starting job 2,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #2
[23:28:29] Starting job 3,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #3
[23:28:29] Starting job 4,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #4
[23:28:29] Starting job 5,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #5
[23:28:29] Starting job 6,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #6
[23:28:29] Starting job 7,CPU time has been restored to 4287.172682.
[23:28:29] Skipping Job #7
23:28:30 (11876): called boinc_finish

</stderr_txt>
]]>
----------------------------------------

----------------------------------------
[Edit 2 times, last edit by Crystal Pellet at Aug 20, 2014 9:11:33 AM]
[Aug 20, 2014 8:38:46 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Crystal,

All 3 cases finished with RC = 0x1, _0 (mine) & _2 during Job#0, _1 during Job#6. The Result Logs for all 3 have no sign of a heartbeat message or restart. Here's mine.

INFO: No state to restore. Start from the beginning.
[18:49:49] Number of jobs = 8
[18:49:49] Starting job 0,CPU time has been restored to 0.000000.
Application exited with RC = 0x1
[20:05:25] Finished Job #0
[20:05:25] Starting job 1,CPU time has been restored to 4223.617874.
[20:05:25] Skipping Job #1
[20:05:25] Starting job 2,CPU time has been restored to 4223.617874.
[20:05:25] Skipping Job #2
[20:05:25] Starting job 3,CPU time has been restored to 4223.617874.
[20:05:26] Skipping Job #3
[20:05:26] Starting job 4,CPU time has been restored to 4223.617874.
[20:05:26] Skipping Job #4
[20:05:26] Starting job 5,CPU time has been restored to 4223.617874.
[20:05:26] Skipping Job #5
[20:05:26] Starting job 6,CPU time has been restored to 4223.617874.
[20:05:26] Skipping Job #6
[20:05:26] Starting job 7,CPU time has been restored to 4223.617874.
[20:05:26] Skipping Job #7
20:05:27 (8572): called boinc_finish

It looks more like convergence failed in Job#0 in _0 & _2 but managed to continue to Job#6 in _1.

It'll be interesting to see the validation outcome your PVer example!
[Aug 20, 2014 9:21:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
branjo
Master Cruncher
Slovakia
Joined: Jun 29, 2012
Post Count: 1892
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

tonyh205 wrote:
Patrick, so LAIM is on but a suspended wu restarts from the beginning - hmmm, can't explain that sad


It happened also to me yesterday - LAIM on, 7.0.65, Mac OS X Mavericks, i5-2500S, 12 GB RAM. I worked on PC (any RAM-eating nor huge-IO app on) and the WU restarted from the latest checkpoint (lost 1.75 hour of work). WU finished normally (7 h CPU time) and it is in PVal prison now.

Cheers peace

ETA: here is the part with the a.m. "problem":

Result Log  

Result Name: BETA_ E225108_ 905_ S.328.C40F3H25N4O1.AQNWUMLLBGZFFC-UHFFFAOYSA-N.7_ s1_ 14_ 0--

<core_client_version>7.0.65</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[12:40:58] Number of jobs = 8
...
18:05:04] End of Job
[18:05:10] Finished Job #5
[18:05:10] Starting job 6,CPU time has been restored to 18098.625686.
[18:05:13] Starting new Job
[18:05:13] Qink name = fldman
[18:05:21] Qink name = gesman
[18:05:22] Qink name = scfman
No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
Parent process was killed, exiting
[19:52:39] Number of jobs = 8
[19:52:39] Starting job 6,CPU time has been restored to 18098.625686.
[19:52:42] Starting new Job
[19:52:42] Qink name = fldman
[19:52:49] Qink name = gesman
[19:52:51] Qink name = scfman
No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[21:45:04] Number of jobs = 8
[21:45:04] Starting job 6,CPU time has been restored to 18098.625686.
[21:45:08] Starting new Job
[21:45:08] Qink name = fldman
[21:45:16] Qink name = gesman
[21:45:18] Qink name = scfman
Application exited with RC = 0x100
[23:51:26] Finished Job #6
[23:51:26] Starting job 7,CPU time has been restored to 25120.430338.
[23:51:26] Skipping Job #7
called boinc_finish

</stderr_txt>
]]>

----------------------------------------

Crunching@Home since January 13 2000. Shrubbing@Home since January 5 2006

----------------------------------------
[Edit 3 times, last edit by branjo at Aug 20, 2014 10:43:49 AM]
[Aug 20, 2014 9:41:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1294
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Tony wrote:

It'll be interesting to see the validation outcome your PVer example!
We've to be patient. Task is running for 3 hours now.
The 2 already finished and validated on that machine lasted 14 and 16.4 hours.
When this one is also running that long, it's after midnight here, so I'll report tomorrow.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by Crystal Pellet at Aug 20, 2014 11:57:44 AM]
[Aug 20, 2014 11:55:38 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

Mmmm. Another thing that I'm not sure is a pure beta issue but it's a beta WU and a new validator, so here goes.

WU BETA_ E225108_ 412_ S.328.C42H27N7.LLZNKJDGUWRARQ-UHFFFAOYSA-N.15_ s1_ 14_ 0 ran for 18 hours on one of my machines and was killed "because cpu time has been exceeded". It was still in the first sub-job. However, it was allocated the status "Error" (rather than "Invalid") and zero points. The wingman made it beyond the first sub-job and a third copy was issued for validation of that.

I quite understand that my machine did nothing useful, but I don't see why this was regarded as an error because there is insufficient information to evaluate the validity of the processing that the machine did before the application terminated. I would have expected it to have been regarded as invalid on the basis of a 'no user fault' problem. Similarly, I think zero points is a bit harsh.

Again, just my 2p'th.
[Aug 20, 2014 7:34:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
littlepeaks
Veteran Cruncher
USA
Joined: Apr 28, 2007
Post Count: 748
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

tonyh205:


It looks more like convergence failed in Job#0 in _0 & _2 but managed to continue to Job#6 in _1.

It'll be interesting to see the validation outcome your PVer example!


I had one that exited in Job#6, while two other "contestants" exited in job#0. Mine turned from PV to error, whereas the other two got valids.
It was BETA_ E225108_ 711_ S.328.C42H26N6O1.RJMUXNLAPBPODN-UHFFFAOYSA-N.20_ s1_ 14_ 1--
----------------------------------------
[Edit 1 times, last edit by littlepeaks at Aug 20, 2014 7:48:12 PM]
[Aug 20, 2014 7:44:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Clean Energy Project - Phase 2 BETA test new workunits - Aug 15, 2014 [ Issues Thread ]

I had one that exited in Job#6, while two other "contestants" exited in job#0. Mine turned from PV to error, whereas the other two got valids.
That's odd, if it was a "success"-type exit from Job#6. What does the Result Log show for your unit? The common successful exit seems to give the line "Application exited with RC = 0x1".
[Aug 20, 2014 8:05:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 118   Pages: 12   [ Previous Page | 2 3 4 5 6 7 8 9 10 11 | Next Page ]
[ Jump to Last Post ]
Post new Thread