Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 3
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 345 times and has 2 replies Next Thread
yoro42
Ace Cruncher
United States
Joined: Feb 19, 2011
Post Count: 8976
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Computational Error

Had a CEP2 job end with a status of 'Computational Error' in the 'Status' columne before it was reported. The WU ended after 11 plus Hrs of CPU time.

Here is the data I collected:

Event Log entries:
9/22/2011 6:38:21 PM | World Community Grid | Restarting task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 using cep2 version 640
9/23/2011 7:27:02 AM | World Community Grid | Computation for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 finished
9/23/2011 7:27:02 AM | World Community Grid | Output file E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_0 for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 absent
9/23/2011 7:27:02 AM | World Community Grid | Output file E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_1 for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 absent
9/23/2011 7:27:02 AM | World Community Grid | Output file E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_2 for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 absent
9/23/2011 7:27:58 AM | World Community Grid | Started upload of E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_3
9/23/2011 7:27:59 AM | World Community Grid | Finished upload of E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_3

Results Status:
E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 2-- Rory Error 9/21/11 20:13:29 9/23/11 15:56:13 11.47 258.3 / 0.0

E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 2-- 640 Error 9/21/11 20:13:29 9/23/11 15:56:13 11.47 258.3 / 0.0
Result Name: E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 2--
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
The pipe is being closed. (0xe8) - exit code 232 (0xe8)
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[08:47:55] Number of jobs = 16
[08:47:55] Starting job 0,CPU time has been restored to 0.000000.
[08:50:44] Finished Job #0
[08:50:44] Starting job 1,CPU time has been restored to 150.556565.
[08:59:22] Finished Job #1
[08:59:22] Starting job 2,CPU time has been restored to 621.554784.
12:35:31 (8988): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[12:35:42] Number of jobs = 16
[12:35:42] Starting job 2,CPU time has been restored to 621.554784.
Quit requested: Exiting
[16:04:52] Number of jobs = 16
[16:04:52] Starting job 2,CPU time has been restored to 621.554784.
[17:01:18] Number of jobs = 16
[17:01:18] Starting job 2,CPU time has been restored to 621.554784.
Quit requested: Exiting
[17:23:08] Number of jobs = 16
[17:23:08] Starting job 2,CPU time has been restored to 621.554784.
Quit requested: Exiting
[17:46:14] Number of jobs = 16
[17:46:14] Starting job 2,CPU time has been restored to 621.554784.
[18:38:21] Number of jobs = 16
[18:38:21] Starting job 2,CPU time has been restored to 621.554784.
[23:11:43] Finished Job #2
[23:11:43] Starting job 3,CPU time has been restored to 14802.763289.
[23:21:56] Finished Job #3
[23:21:56] Starting job 4,CPU time has been restored to 15323.510227.
[23:28:30] Finished Job #4
[23:28:30] Starting job 5,CPU time has been restored to 15690.689780.
[23:36:02] Finished Job #5
[23:36:02] Starting job 6,CPU time has been restored to 16089.069534.
[23:42:39] Finished Job #6
[23:42:39] Starting job 7,CPU time has been restored to 16457.278694.
[23:51:58] Finished Job #7
[23:51:58] Starting job 8,CPU time has been restored to 16956.357094.
[23:58:42] Finished Job #8
[23:58:42] Starting job 9,CPU time has been restored to 17322.522641.
[00:08:17] Finished Job #9
[00:08:17] Starting job 10,CPU time has been restored to 17716.222364.
[00:28:45] Finished Job #10
[00:28:45] Starting job 11,CPU time has been restored to 18558.487364.
[00:39:48] Finished Job #11
[00:39:48] Starting job 12,CPU time has been restored to 19024.353150.
[01:40:01] Finished Job #12
[01:40:01] Starting job 13,CPU time has been restored to 21903.538806.
[03:36:04] Finished Job #13
[03:36:04] Starting job 14,CPU time has been restored to 28335.023233.
[05:19:06] Finished Job #14
[05:19:06] Starting job 15,CPU time has been restored to 34124.844347.
[07:27:01] Finished Job #15
Unable to open result file C.27.C21H14N2S2Si2.00582877.3.noopt.bp86.sto6g.n.sp.out
</stderr_txt>
]]>
E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 0-- - No Reply 9/11/11 20:10:44 9/21/11 20:10:44 0.00 0.0 / 0.0
E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 1-- 640 Pending Validation 9/11/11 19:48:02 9/13/11 01:53:05 12.00 172.0 / 0.0
Result Name: E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 1--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[05:13:46] Number of jobs = 16
[05:13:46] Starting job 0,CPU time has been restored to 0.000000.
[05:16:59] Finished Job #0
[05:16:59] Starting job 1,CPU time has been restored to 174.845921.
[05:27:33] Finished Job #1
[05:27:33] Starting job 2,CPU time has been restored to 689.040817.
[10:07:47] Finished Job #2
[10:07:47] Starting job 3,CPU time has been restored to 15312.402956.
[10:16:50] Finished Job #3
[10:16:50] Starting job 4,CPU time has been restored to 15832.775492.
[10:24:27] Finished Job #4
[10:24:27] Starting job 5,CPU time has been restored to 16235.850875.
[10:33:01] Finished Job #5
[10:33:01] Starting job 6,CPU time has been restored to 16656.913174.
[10:40:03] Finished Job #6
[10:40:03] Starting job 7,CPU time has been restored to 17061.486168.
[10:50:11] Finished Job #7
[10:50:11] Starting job 8,CPU time has been restored to 17594.650786.
[10:57:57] Finished Job #8
[10:57:57] Starting job 9,CPU time has been restored to 17991.642130.
[11:05:48] Finished Job #9
[11:05:48] Starting job 10,CPU time has been restored to 18443.670628.
[11:23:03] Finished Job #10
[11:23:03] Starting job 11,CPU time has been restored to 19390.019494.
[11:32:46] Finished Job #11
[11:32:46] Starting job 12,CPU time has been restored to 19912.841246.
Quit requested: Exiting
[12:14:47] Number of jobs = 16
[12:14:47] Starting job 12,CPU time has been restored to 19912.841246.
[13:34:32] Finished Job #12
[13:34:32] Starting job 13,CPU time has been restored to 23290.387697.
[15:48:31] Finished Job #13
[15:48:31] Starting job 14,CPU time has been restored to 30347.919737.
[17:48:44] Finished Job #14
[17:48:44] Starting job 15,CPU time has been restored to 36648.066922.
Killing job because cpu time has been exceeded. Subjob start time = 607810028, Subjob current time = 1088546050
[19:51:28] Finished Job #15
19:51:42 (2828): called boinc_finish

</stderr_txt>
]]>
E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 3-- - Waiting to be sent — — 0.00 0.0 / 0.0
----------------------------------------

[Sep 23, 2011 4:31:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
anhhai
Veteran Cruncher
Joined: Mar 22, 2005
Post Count: 839
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Computational Error

i don't know if this is wide-spread but I have notice a large uptick in repairs jobs for CEP2 as of late (as in almost 1/4 of my CEP2 are repair jobs). that is not a good sign.
----------------------------------------

[Sep 23, 2011 4:39:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Computational Error

The core error is this case is:

"12:35:31 (8988): No heartbeat from core client for 30 sec - exiting"

The system was too busy, so the client kills the job(s).... any job that is not able to communicate with the core client within 30 seconds from last time... the heartbeat of 2 per minute.

I've set BOINC to pause when my Linux quad gets a none-BOINC load greater than 75%

Suspend work if CPU usage is above 75.0 % of cpu

with LAIM on. Typically the system updates cause this to happen, the root somewhere in the WIFI instability. When a high network demand occurs, BOINC feels deprived.

--//--
[Sep 23, 2011 4:56:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread