Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 14
|
![]() |
Author |
|
simjoe
Cruncher Joined: Dec 4, 2013 Post Count: 35 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hello, the new Wu's result all in error on my 3 win7 boxes.
----------------------------------------Here is a log of one of the jobs : Result Log Result Name: E225005_ 228_ S.162.C21H13N1O2.BIHMRZOPJMFTPB-UHFFFAOYSA-N.5_ s1_ 14_ 0-- <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [07:25:47] Number of jobs = 8 [07:25:47] Starting job 0,CPU time has been restored to 0.000000. [08:11:20] Finished Job #0 [08:11:20] Starting job 1,CPU time has been restored to 2666.259891. [08:14:27] Finished Job #1 [08:14:27] Starting job 2,CPU time has been restored to 2849.249064. [08:18:21] Finished Job #2 [08:18:21] Starting job 3,CPU time has been restored to 3077.416127. [08:22:37] Finished Job #3 [08:22:37] Starting job 4,CPU time has been restored to 3328.983340. [08:25:26] Finished Job #4 [08:25:26] Starting job 5,CPU time has been restored to 3494.484800. [08:27:34] Finished Job #5 [08:27:34] Starting job 6,CPU time has been restored to 3618.801997. [08:52:26] Finished Job #6 [08:52:26] Starting job 7,CPU time has been restored to 5090.484231. [09:43:28] Finished Job #7 09:43:30 (1944): called boinc_finish </stderr_txt> ]]> I suspended CEP2 for a while now. rgds. [Edit 1 times, last edit by simjoe at Aug 2, 2014 11:00:52 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Plz confirm you checked on my contributions > result status whether they went truly in error. The log does not say so, properly processing and finishing through job 7.
|
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have 5 WUs that have ended in error
----------------------------------------![]() Result Name: E225004_ 358_ S.154.C9H3N7S3.JSKHZHVDTCQGAR-UHFFFAOYSA-N.2_ s1_ 14_ 0-- <core_client_version>7.2.47</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [03:19:40] Number of jobs = 8 [03:19:40] Starting job 0,CPU time has been restored to 0.000000. [04:17:51] Finished Job #0 [04:17:51] Starting job 1,CPU time has been restored to 3012.816113. [04:20:39] Finished Job #1 [04:20:39] Starting job 2,CPU time has been restored to 3157.319839. [04:25:08] Finished Job #2 [04:25:08] Starting job 3,CPU time has been restored to 3392.709748. [04:28:50] Finished Job #3 [04:28:50] Starting job 4,CPU time has been restored to 3588.615804. [04:31:42] Finished Job #4 [04:31:42] Starting job 5,CPU time has been restored to 3736.847954. [04:33:18] Finished Job #5 [04:33:18] Starting job 6,CPU time has been restored to 3817.032468. [04:58:39] Finished Job #6 [04:58:39] Starting job 7,CPU time has been restored to 5131.403293. [05:50:00] Finished Job #7 05:50:02 (120): called boinc_finish </stderr_txt> ]]> These are the WUs: E225005_ 439_ S.162.C20H12N2O2.NOLKOJKGODBZTF-UHFFFAOYSA-N.2_ s1_ 14_ 0-- R8XZ5B4 Error 8/2/14 05:33:06 8/2/14 11:50:41 3.07 / 3.21 57.3 / 0.0 E225004_ 358_ S.154.C9H3N7S3.JSKHZHVDTCQGAR-UHFFFAOYSA-N.2_ s1_ 14_ 0-- R8XZ5B4 Error 8/2/14 05:33:06 8/2/14 11:50:41 2.16 / 2.27 40.5 / 0.0 E225003_ 8_ S.142.C15F3H9O2.YNEUFDIDKUKGDE-UHFFFAOYSA-N.2_ s1_ 14_ 1-- R8XZ5B4 Error 8/2/14 04:08:43 8/2/14 08:38:26 2.35 / 2.46 44.3 / 0.0 E225003_ 94_ S.142.C17H11N3O1.YIENJBITZGIPPA-UHFFFAOYSA-N.2_ s1_ 14_ 1-- R8XZ5B4 Error 8/2/14 04:08:43 8/2/14 08:38:26 3.18 / 3.31 59.6 / 0.0 E225001_ 323_ S.104.C11H8N2S1.PHVZUFNHHKCGGM-UHFFFAOYSA-N.1_ s1_ 14_ 0-- R8XZ5B4 Error 8/2/14 02:51:05 8/2/14 06:43:55 1.01 / 1.08 19.5 / 0.0 So what do I do? Keep on crunching the CEP2 WUs or abort? Thanks, CJSL Crunching for a better world... |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi CJSL,
I am not expert at decoding the error messages, but if its got to job 7 then I think that may be the final job - are they simply timing out? The last three calculations in this new set up are much more computationally expensive so this could be the case. Your Harvard CEP Team |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 325 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Job 7 is the final job. So far all validated work units in the E225999 series have errored.
Some appear to work normaly with no error messages: Result Name: E225008_ 152_ S.174.C18H10N8.HSPIONFNEKQQNT-UHFFFAOYSA-N.1_ s1_ 14_ 0-- <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [08:45:32] Number of jobs = 8 [08:45:32] Starting job 0,CPU time has been restored to 0.000000. [09:23:58] Finished Job #0 [09:23:58] Starting job 1,CPU time has been restored to 2281.109022. [09:27:17] Finished Job #1 [09:27:17] Starting job 2,CPU time has been restored to 2476.406674. [09:32:08] Finished Job #2 [09:32:08] Starting job 3,CPU time has been restored to 2763.542115. [09:36:40] Finished Job #3 [09:36:40] Starting job 4,CPU time has been restored to 3031.863835. [09:39:51] Finished Job #4 [09:39:51] Starting job 5,CPU time has been restored to 3219.751439. [09:42:25] Finished Job #5 [09:42:25] Starting job 6,CPU time has been restored to 3367.359585. [10:12:46] Finished Job #6 [10:12:46] Starting job 7,CPU time has been restored to 5172.821559. [11:10:50] Finished Job #7 11:10:53 (11084): called boinc_finish </stderr_txt> Another has a 0x1 error in job 6: Result Log Result Name: E225009_ 473_ S.176.C21H13N3O2.DTEHWBYFTDJDRU-UHFFFAOYSA-N.4_ s1_ 14_ 0-- <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [11:19:40] Number of jobs = 8 [11:19:40] Starting job 0,CPU time has been restored to 0.000000. [12:10:01] Finished Job #0 [12:10:01] Starting job 1,CPU time has been restored to 2975.282272. [12:13:27] Finished Job #1 [12:13:27] Starting job 2,CPU time has been restored to 3177.880771. [12:17:55] Finished Job #2 [12:17:55] Starting job 3,CPU time has been restored to 3440.664455. [12:23:02] Finished Job #3 [12:23:02] Starting job 4,CPU time has been restored to 3742.978793. [12:26:45] Finished Job #4 [12:26:45] Starting job 5,CPU time has been restored to 3962.597001. [12:29:16] Finished Job #5 [12:29:16] Starting job 6,CPU time has been restored to 4109.503143. Application exited with RC = 0x1 [13:26:29] Finished Job #6 [13:26:29] Starting job 7,CPU time has been restored to 7517.937792. [13:26:29] Skipping Job #7 13:26:31 (9808): called boinc_finish </stderr_txt> A few months ago a 0x1 error in job 12 was a normal valid completion. Is the validator set up correctly for this new batch? |
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
If I look at one of the WUs that didn't error out (Pending Validation), the result log looks the same (finished job 7).
----------------------------------------I am not expert at decoding the error messages, but if its got to job 7 then I think that may be the final job - are they simply timing out? The last three calculations in this new set up are much more computationally expensive so this could be the case. I have no idea what "timing out" means in reference to CEP2 WUs or if that is something bad/unrecoverable. I'm sure that this will be hitting (or already is) all the CEP2 crunchers and a determination needs to be made as to what is to be done: - Do we keep crunching (because the error is a false positive)? or - Do we abort (because it a true error) ? I hope the person(s) that can guide us hasn't/haven't left for the weekend... [EDIT]: All the WUs that errored out ran for 1-3 hours Thanks, CJSL Crunching for a brighter future... ---------------------------------------- [Edit 1 times, last edit by cjslman at Aug 2, 2014 1:25:44 PM] |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 325 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
CEP2 jobs used to be set to time out after 12 hours. This was increased to 18 hours a few months ago. The percentage completed is calculated using this 18 hours and not the estimated time.
My work units are completing in less than 3 hours. |
||
|
ca05065
Senior Cruncher Joined: Dec 4, 2007 Post Count: 325 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Has anyone seen a E225xxx series work unit which has passed validation?
|
||
|
AgrFan
Senior Cruncher USA Joined: Apr 17, 2008 Post Count: 376 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
This unit failed in Job #6 with a 0x100. I'm pretty sure this RC passed validation with the previous batches. Looks like the validator is not configured for the new batches.
----------------------------------------E225009_ 481_ S.176.C23H15N1O2.JYMUQWGEOQUDBZ-UHFFFAOYSA-N.3_ s1_ 14_ 0-- 640 Error 8/2/14 09:40:29 8/2/14 12:55:14 3.11 46.3 / 0.0 [07:39:23] Starting job 6,CPU time has been restored to 6856.130000. [07:39:23] Starting new Job [07:39:23] Qink name = fldman [07:39:25] Qink name = gesman [07:39:26] Qink name = scfman Application exited with RC = 0x100 [08:51:51] Finished Job #6 [08:51:51] Starting job 7,CPU time has been restored to 11113.500000. [08:51:51] Skipping Job #7 08:51:53 (27967): called boinc_finish </stderr_txt> ]]> [Edit 2 times, last edit by AgrFan at Aug 2, 2014 1:23:54 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
HI all,
I have alerted the IBM team to this. It is my feeling that they are false errors due to the new look jobs - i.e. the validation proess is looking for the existance of files which are no longer meant to be there and could be a hangover from the overlap of the end of one library with the start of the new one. I will keep you all updated as soon as I hear anything, but don't panic. From my point of view, the fact that it is getting to the later jobs is a positive thing, and means it is unlikely to be truely erroring. I will try my best to remain as responsive as possible over the weekend! Your Harvard CEP Team |
||
|
|
![]() |