Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: Computational Error |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3
|
Author |
|
yoro42
Ace Cruncher United States Joined: Feb 19, 2011 Post Count: 8976 Status: Offline Project Badges: |
Had a CEP2 job end with a status of 'Computational Error' in the 'Status' columne before it was reported. The WU ended after 11 plus Hrs of CPU time.
----------------------------------------Here is the data I collected: Event Log entries: 9/22/2011 6:38:21 PM | World Community Grid | Restarting task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 using cep2 version 640 9/23/2011 7:27:02 AM | World Community Grid | Computation for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 finished 9/23/2011 7:27:02 AM | World Community Grid | Output file E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_0 for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 absent 9/23/2011 7:27:02 AM | World Community Grid | Output file E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_1 for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 absent 9/23/2011 7:27:02 AM | World Community Grid | Output file E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_2 for task E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2 absent 9/23/2011 7:27:58 AM | World Community Grid | Started upload of E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_3 9/23/2011 7:27:59 AM | World Community Grid | Finished upload of E203169_805_C.27.C21H14N2S2Si2.00582877.3.set1d06_2_3 Results Status: E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 2-- Rory Error 9/21/11 20:13:29 9/23/11 15:56:13 11.47 258.3 / 0.0 E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 2-- 640 Error 9/21/11 20:13:29 9/23/11 15:56:13 11.47 258.3 / 0.0 Result Name: E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 2-- <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> The pipe is being closed. (0xe8) - exit code 232 (0xe8) </message> <stderr_txt> INFO: No state to restore. Start from the beginning. [08:47:55] Number of jobs = 16 [08:47:55] Starting job 0,CPU time has been restored to 0.000000. [08:50:44] Finished Job #0 [08:50:44] Starting job 1,CPU time has been restored to 150.556565. [08:59:22] Finished Job #1 [08:59:22] Starting job 2,CPU time has been restored to 621.554784. 12:35:31 (8988): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [12:35:42] Number of jobs = 16 [12:35:42] Starting job 2,CPU time has been restored to 621.554784. Quit requested: Exiting [16:04:52] Number of jobs = 16 [16:04:52] Starting job 2,CPU time has been restored to 621.554784. [17:01:18] Number of jobs = 16 [17:01:18] Starting job 2,CPU time has been restored to 621.554784. Quit requested: Exiting [17:23:08] Number of jobs = 16 [17:23:08] Starting job 2,CPU time has been restored to 621.554784. Quit requested: Exiting [17:46:14] Number of jobs = 16 [17:46:14] Starting job 2,CPU time has been restored to 621.554784. [18:38:21] Number of jobs = 16 [18:38:21] Starting job 2,CPU time has been restored to 621.554784. [23:11:43] Finished Job #2 [23:11:43] Starting job 3,CPU time has been restored to 14802.763289. [23:21:56] Finished Job #3 [23:21:56] Starting job 4,CPU time has been restored to 15323.510227. [23:28:30] Finished Job #4 [23:28:30] Starting job 5,CPU time has been restored to 15690.689780. [23:36:02] Finished Job #5 [23:36:02] Starting job 6,CPU time has been restored to 16089.069534. [23:42:39] Finished Job #6 [23:42:39] Starting job 7,CPU time has been restored to 16457.278694. [23:51:58] Finished Job #7 [23:51:58] Starting job 8,CPU time has been restored to 16956.357094. [23:58:42] Finished Job #8 [23:58:42] Starting job 9,CPU time has been restored to 17322.522641. [00:08:17] Finished Job #9 [00:08:17] Starting job 10,CPU time has been restored to 17716.222364. [00:28:45] Finished Job #10 [00:28:45] Starting job 11,CPU time has been restored to 18558.487364. [00:39:48] Finished Job #11 [00:39:48] Starting job 12,CPU time has been restored to 19024.353150. [01:40:01] Finished Job #12 [01:40:01] Starting job 13,CPU time has been restored to 21903.538806. [03:36:04] Finished Job #13 [03:36:04] Starting job 14,CPU time has been restored to 28335.023233. [05:19:06] Finished Job #14 [05:19:06] Starting job 15,CPU time has been restored to 34124.844347. [07:27:01] Finished Job #15 Unable to open result file C.27.C21H14N2S2Si2.00582877.3.noopt.bp86.sto6g.n.sp.out </stderr_txt> ]]> E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 0-- - No Reply 9/11/11 20:10:44 9/21/11 20:10:44 0.00 0.0 / 0.0 E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 1-- 640 Pending Validation 9/11/11 19:48:02 9/13/11 01:53:05 12.00 172.0 / 0.0 Result Name: E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 1-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [05:13:46] Number of jobs = 16 [05:13:46] Starting job 0,CPU time has been restored to 0.000000. [05:16:59] Finished Job #0 [05:16:59] Starting job 1,CPU time has been restored to 174.845921. [05:27:33] Finished Job #1 [05:27:33] Starting job 2,CPU time has been restored to 689.040817. [10:07:47] Finished Job #2 [10:07:47] Starting job 3,CPU time has been restored to 15312.402956. [10:16:50] Finished Job #3 [10:16:50] Starting job 4,CPU time has been restored to 15832.775492. [10:24:27] Finished Job #4 [10:24:27] Starting job 5,CPU time has been restored to 16235.850875. [10:33:01] Finished Job #5 [10:33:01] Starting job 6,CPU time has been restored to 16656.913174. [10:40:03] Finished Job #6 [10:40:03] Starting job 7,CPU time has been restored to 17061.486168. [10:50:11] Finished Job #7 [10:50:11] Starting job 8,CPU time has been restored to 17594.650786. [10:57:57] Finished Job #8 [10:57:57] Starting job 9,CPU time has been restored to 17991.642130. [11:05:48] Finished Job #9 [11:05:48] Starting job 10,CPU time has been restored to 18443.670628. [11:23:03] Finished Job #10 [11:23:03] Starting job 11,CPU time has been restored to 19390.019494. [11:32:46] Finished Job #11 [11:32:46] Starting job 12,CPU time has been restored to 19912.841246. Quit requested: Exiting [12:14:47] Number of jobs = 16 [12:14:47] Starting job 12,CPU time has been restored to 19912.841246. [13:34:32] Finished Job #12 [13:34:32] Starting job 13,CPU time has been restored to 23290.387697. [15:48:31] Finished Job #13 [15:48:31] Starting job 14,CPU time has been restored to 30347.919737. [17:48:44] Finished Job #14 [17:48:44] Starting job 15,CPU time has been restored to 36648.066922. Killing job because cpu time has been exceeded. Subjob start time = 607810028, Subjob current time = 1088546050 [19:51:28] Finished Job #15 19:51:42 (2828): called boinc_finish </stderr_txt> ]]> E203169_ 805_ C.27.C21H14N2S2Si2.00582877.3.set1d06_ 3-- - Waiting to be sent â â 0.00 0.0 / 0.0 |
||
|
anhhai
Veteran Cruncher Joined: Mar 22, 2005 Post Count: 839 Status: Offline Project Badges: |
i don't know if this is wide-spread but I have notice a large uptick in repairs jobs for CEP2 as of late (as in almost 1/4 of my CEP2 are repair jobs). that is not a good sign.
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
The core error is this case is:
"12:35:31 (8988): No heartbeat from core client for 30 sec - exiting" The system was too busy, so the client kills the job(s).... any job that is not able to communicate with the core client within 30 seconds from last time... the heartbeat of 2 per minute. I've set BOINC to pause when my Linux quad gets a none-BOINC load greater than 75% Suspend work if CPU usage is above 75.0 % of cpu with LAIM on. Typically the system updates cause this to happen, the root somewhere in the WIFI instability. When a high network demand occurs, BOINC feels deprived. --//-- |
||
|
|