Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 14
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1911 times and has 13 replies Next Thread
simjoe
Cruncher
Joined: Dec 4, 2013
Post Count: 35
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
The new WU's result all in error on all of my machines

Hello, the new Wu's result all in error on my 3 win7 boxes.

Here is a log of one of the jobs :


Result Log

Result Name: E225005_ 228_ S.162.C21H13N1O2.BIHMRZOPJMFTPB-UHFFFAOYSA-N.5_ s1_ 14_ 0--
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[07:25:47] Number of jobs = 8
[07:25:47] Starting job 0,CPU time has been restored to 0.000000.
[08:11:20] Finished Job #0
[08:11:20] Starting job 1,CPU time has been restored to 2666.259891.
[08:14:27] Finished Job #1
[08:14:27] Starting job 2,CPU time has been restored to 2849.249064.
[08:18:21] Finished Job #2
[08:18:21] Starting job 3,CPU time has been restored to 3077.416127.
[08:22:37] Finished Job #3
[08:22:37] Starting job 4,CPU time has been restored to 3328.983340.
[08:25:26] Finished Job #4
[08:25:26] Starting job 5,CPU time has been restored to 3494.484800.
[08:27:34] Finished Job #5
[08:27:34] Starting job 6,CPU time has been restored to 3618.801997.
[08:52:26] Finished Job #6
[08:52:26] Starting job 7,CPU time has been restored to 5090.484231.
[09:43:28] Finished Job #7
09:43:30 (1944): called boinc_finish

</stderr_txt>
]]>

I suspended CEP2 for a while now.

rgds.
----------------------------------------
[Edit 1 times, last edit by simjoe at Aug 2, 2014 11:00:52 AM]
[Aug 2, 2014 9:21:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

Plz confirm you checked on my contributions > result status whether they went truly in error. The log does not say so, properly processing and finishing through job 7.
[Aug 2, 2014 11:18:35 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

I have 5 WUs that have ended in error shock (out of 8 of which I assume are the new batch). All 5 WUs made it to job 7. Running on Windows 7.
Result Name: E225004_ 358_ S.154.C9H3N7S3.JSKHZHVDTCQGAR-UHFFFAOYSA-N.2_ s1_ 14_ 0--
<core_client_version>7.2.47</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[03:19:40] Number of jobs = 8
[03:19:40] Starting job 0,CPU time has been restored to 0.000000.
[04:17:51] Finished Job #0
[04:17:51] Starting job 1,CPU time has been restored to 3012.816113.
[04:20:39] Finished Job #1
[04:20:39] Starting job 2,CPU time has been restored to 3157.319839.
[04:25:08] Finished Job #2
[04:25:08] Starting job 3,CPU time has been restored to 3392.709748.
[04:28:50] Finished Job #3
[04:28:50] Starting job 4,CPU time has been restored to 3588.615804.
[04:31:42] Finished Job #4
[04:31:42] Starting job 5,CPU time has been restored to 3736.847954.
[04:33:18] Finished Job #5
[04:33:18] Starting job 6,CPU time has been restored to 3817.032468.
[04:58:39] Finished Job #6
[04:58:39] Starting job 7,CPU time has been restored to 5131.403293.
[05:50:00] Finished Job #7
05:50:02 (120): called boinc_finish

</stderr_txt>
]]>

These are the WUs:
E225005_ 439_ S.162.C20H12N2O2.NOLKOJKGODBZTF-UHFFFAOYSA-N.2_ s1_ 14_ 0-- R8XZ5B4 Error 8/2/14 05:33:06 8/2/14 11:50:41 3.07 / 3.21 57.3 / 0.0
E225004_ 358_ S.154.C9H3N7S3.JSKHZHVDTCQGAR-UHFFFAOYSA-N.2_ s1_ 14_ 0-- R8XZ5B4 Error 8/2/14 05:33:06 8/2/14 11:50:41 2.16 / 2.27 40.5 / 0.0
E225003_ 8_ S.142.C15F3H9O2.YNEUFDIDKUKGDE-UHFFFAOYSA-N.2_ s1_ 14_ 1-- R8XZ5B4 Error 8/2/14 04:08:43 8/2/14 08:38:26 2.35 / 2.46 44.3 / 0.0
E225003_ 94_ S.142.C17H11N3O1.YIENJBITZGIPPA-UHFFFAOYSA-N.2_ s1_ 14_ 1-- R8XZ5B4 Error 8/2/14 04:08:43 8/2/14 08:38:26 3.18 / 3.31 59.6 / 0.0
E225001_ 323_ S.104.C11H8N2S1.PHVZUFNHHKCGGM-UHFFFAOYSA-N.1_ s1_ 14_ 0-- R8XZ5B4 Error 8/2/14 02:51:05 8/2/14 06:43:55 1.01 / 1.08 19.5 / 0.0

So what do I do? Keep on crunching the CEP2 WUs or abort?

Thanks,

CJSL

Crunching for a better world...
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


[Aug 2, 2014 12:28:58 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

Hi CJSL,

I am not expert at decoding the error messages, but if its got to job 7 then I think that may be the final job - are they simply timing out? The last three calculations in this new set up are much more computationally expensive so this could be the case.

Your Harvard CEP Team
[Aug 2, 2014 12:38:45 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 325
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

Job 7 is the final job. So far all validated work units in the E225999 series have errored.
Some appear to work normaly with no error messages:
Result Name: E225008_ 152_ S.174.C18H10N8.HSPIONFNEKQQNT-UHFFFAOYSA-N.1_ s1_ 14_ 0--
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[08:45:32] Number of jobs = 8
[08:45:32] Starting job 0,CPU time has been restored to 0.000000.
[09:23:58] Finished Job #0
[09:23:58] Starting job 1,CPU time has been restored to 2281.109022.
[09:27:17] Finished Job #1
[09:27:17] Starting job 2,CPU time has been restored to 2476.406674.
[09:32:08] Finished Job #2
[09:32:08] Starting job 3,CPU time has been restored to 2763.542115.
[09:36:40] Finished Job #3
[09:36:40] Starting job 4,CPU time has been restored to 3031.863835.
[09:39:51] Finished Job #4
[09:39:51] Starting job 5,CPU time has been restored to 3219.751439.
[09:42:25] Finished Job #5
[09:42:25] Starting job 6,CPU time has been restored to 3367.359585.
[10:12:46] Finished Job #6
[10:12:46] Starting job 7,CPU time has been restored to 5172.821559.
[11:10:50] Finished Job #7
11:10:53 (11084): called boinc_finish

</stderr_txt>

Another has a 0x1 error in job 6:

Result Log

Result Name: E225009_ 473_ S.176.C21H13N3O2.DTEHWBYFTDJDRU-UHFFFAOYSA-N.4_ s1_ 14_ 0--
<core_client_version>7.2.42</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[11:19:40] Number of jobs = 8
[11:19:40] Starting job 0,CPU time has been restored to 0.000000.
[12:10:01] Finished Job #0
[12:10:01] Starting job 1,CPU time has been restored to 2975.282272.
[12:13:27] Finished Job #1
[12:13:27] Starting job 2,CPU time has been restored to 3177.880771.
[12:17:55] Finished Job #2
[12:17:55] Starting job 3,CPU time has been restored to 3440.664455.
[12:23:02] Finished Job #3
[12:23:02] Starting job 4,CPU time has been restored to 3742.978793.
[12:26:45] Finished Job #4
[12:26:45] Starting job 5,CPU time has been restored to 3962.597001.
[12:29:16] Finished Job #5
[12:29:16] Starting job 6,CPU time has been restored to 4109.503143.
Application exited with RC = 0x1
[13:26:29] Finished Job #6
[13:26:29] Starting job 7,CPU time has been restored to 7517.937792.
[13:26:29] Skipping Job #7
13:26:31 (9808): called boinc_finish

</stderr_txt>

A few months ago a 0x1 error in job 12 was a normal valid completion.
Is the validator set up correctly for this new batch?
[Aug 2, 2014 12:49:51 PM]   Link   Report threatening or abusive post: please login first  Go to top 
cjslman
Master Cruncher
Mexico
Joined: Nov 23, 2004
Post Count: 2082
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

If I look at one of the WUs that didn't error out (Pending Validation), the result log looks the same (finished job 7).
I am not expert at decoding the error messages, but if its got to job 7 then I think that may be the final job - are they simply timing out? The last three calculations in this new set up are much more computationally expensive so this could be the case.
I have no idea what "timing out" means in reference to CEP2 WUs or if that is something bad/unrecoverable.
I'm sure that this will be hitting (or already is) all the CEP2 crunchers and a determination needs to be made as to what is to be done:
- Do we keep crunching (because the error is a false positive)? or
- Do we abort (because it a true error) ?

I hope the person(s) that can guide us hasn't/haven't left for the weekend...

[EDIT]: All the WUs that errored out ran for 1-3 hours

Thanks,
CJSL

Crunching for a brighter future...
----------------------------------------
I follow the Gimli philosophy: "Keep breathing. That's the key. Breathe."
Join The Cahuamos Team


----------------------------------------
[Edit 1 times, last edit by cjslman at Aug 2, 2014 1:25:44 PM]
[Aug 2, 2014 1:01:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 325
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

CEP2 jobs used to be set to time out after 12 hours. This was increased to 18 hours a few months ago. The percentage completed is calculated using this 18 hours and not the estimated time.
My work units are completing in less than 3 hours.
[Aug 2, 2014 1:09:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ca05065
Senior Cruncher
Joined: Dec 4, 2007
Post Count: 325
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

Has anyone seen a E225xxx series work unit which has passed validation?
[Aug 2, 2014 1:13:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
AgrFan
Senior Cruncher
USA
Joined: Apr 17, 2008
Post Count: 376
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

This unit failed in Job #6 with a 0x100. I'm pretty sure this RC passed validation with the previous batches. Looks like the validator is not configured for the new batches.

E225009_ 481_ S.176.C23H15N1O2.JYMUQWGEOQUDBZ-UHFFFAOYSA-N.3_ s1_ 14_ 0-- 640 Error 8/2/14 09:40:29 8/2/14 12:55:14 3.11 46.3 / 0.0

[07:39:23] Starting job 6,CPU time has been restored to 6856.130000.
[07:39:23] Starting new Job
[07:39:23] Qink name = fldman
[07:39:25] Qink name = gesman
[07:39:26] Qink name = scfman
Application exited with RC = 0x100
[08:51:51] Finished Job #6
[08:51:51] Starting job 7,CPU time has been restored to 11113.500000.
[08:51:51] Skipping Job #7
08:51:53 (27967): called boinc_finish

</stderr_txt>
]]>
----------------------------------------
[Edit 2 times, last edit by AgrFan at Aug 2, 2014 1:23:54 PM]
[Aug 2, 2014 1:18:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: The new WU's result all in error on all of my machines

HI all,

I have alerted the IBM team to this. It is my feeling that they are false errors due to the new look jobs - i.e. the validation proess is looking for the existance of files which are no longer meant to be there and could be a hangover from the overlap of the end of one library with the start of the new one. I will keep you all updated as soon as I hear anything, but don't panic. From my point of view, the fact that it is getting to the later jobs is a positive thing, and means it is unlikely to be truely erroring.

I will try my best to remain as responsive as possible over the weekend!

Your Harvard CEP Team
[Aug 2, 2014 1:28:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread