Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 18
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Only a few, but since this Linux box was solid from first beta through 7.24, it raises suspicion... rigidity of the validator or instability in the processing/concatenation of result pieces after a restart? Homogeneous redundancy, all wingman are Linux too.
----------------------------------------1) MCM1_ 0000148_ 4938_ 2-- 2524499 Invalid 11/20/13 15:13:58 11/20/13 19:58:42 3.58 / 3.59 70.3 / 35.4 2) MCM1_ 0000110_ 8524_ 0-- 2524499 Invalid 11/19/13 16:34:04 11/19/13 23:02:03 5.89 / 5.91 82.7 / 46.7 Wingman list 1): MCM1_ 0000148_ 4938_ 4-- 726 Valid 11/20/13 22:05:07 11/21/13 11:44:14 3.63 68.7 / 70.8 MCM1_ 0000148_ 4938_ 3-- - Detached 11/20/13 20:51:27 11/20/13 21:19:59 0.00 0.0 / 0.0 MCM1_ 0000148_ 4938_ 2-- 726 Invalid 11/20/13 15:13:58 11/20/13 19:58:42 3.58 70.3 / 35.4 MCM1_ 0000148_ 4938_ 1-- 726 Error 11/20/13 15:10:31 11/20/13 15:13:42 0.00 79.8 / 0.0 MCM1_ 0000148_ 4938_ 0-- 726 Valid 11/20/13 15:10:29 11/20/13 20:51:02 2.71 72.9 / 70.8 Wingman list 2): MCM1_ 0000110_ 8524_ 2-- 726 Valid 11/20/13 02:23:13 11/20/13 22:44:38 8.38 88.6 / 93.3 MCM1_ 0000110_ 8524_ 1-- 726 Valid 11/19/13 16:34:08 11/20/13 02:20:36 4.24 98.1 / 93.3 MCM1_ 0000110_ 8524_ 0-- 726 Invalid 11/19/13 16:34:04 11/19/13 23:02:03 5.89 82.7 / 46.7 The invalid and error show restarts, noting that LAIM was off. [I suspect why restart may have occurred... got several <exclusive_app> scheduled processes that are to stop BOINC while they run, manually or scheduled, which candidates these next 4 as warranting check-up once the wingman have checked in. 2314 21-11-2013 07:59 Suspending computation - an exclusive app is running 2315 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000166_6184_1 (removed from memory) 2316 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000173_5530_0 (removed from memory) 2317 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000174_9099_0 (removed from memory) 2318 World Community Grid 21-11-2013 07:59 [cpu_sched] Preempting MCM1_0000174_4101_0 (removed from memory) 2319 21-11-2013 08:00 Resuming computation Normally run with LAIM on, but hey, are we testing to break things or not? The restart theory gets wobbly, and the random number generator gets to slip a foot in. One of the 4 above logs looks EXACTLY the same as the invalid results, yet, it's gone valid: MCM1_ 0000174_ 9099_ 0-- 2524499 Valid 11/21/13 06:17:17 11/21/13 10:09:48 3.50 / 3.55 68.5 / 71.9 Result Log Result Name: MCM1_ 0000174_ 9099_ 0-- <core_client_version>7.2.28</core_client_version> <![CDATA[ <stderr_txt> Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000174_9099.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running [07:19:08]: Computing pass 0 Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000174_9099.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running [08:00:08]: Computing pass 0 Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000174_9099.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running [08:46:38]: Computing pass 0 Result.out = 3899480.000000 Run complete, CPU time: 12612.136050 11:09:22 (17667): called boinc_finish </stderr_txt> ]]> The other 3 list PV, with identical logs indicating interruption: MCM1_ 0000166_ 6184_ 1-- 2524499 Pending Validation 11/21/13 00:00:28 11/21/13 08:50:57 8.51 / 8.56 164.7 / 0.0 MCM1_ 0000173_ 5530_ 0-- 2524499 Pending Validation 11/21/13 05:04:38 11/21/13 08:47:46 3.48 / 3.52 67.7 / 0.0 MCM1_ 0000174_ 4101_ 0-- 2524499 Pending Validation 11/21/13 06:24:38 11/21/13 09:38:31 2.99 / 3.03 58.3 / 0.0 Leaving LAIM off for now, but eventually will set it on to see if that makes a change... then no more invalid [or Murphy riding hi?] Edit: Title [Edit 1 times, last edit by Former Member at Dec 2, 2013 10:07:29 AM] |
||
|
BobCat13
Senior Cruncher Joined: Oct 29, 2005 Post Count: 295 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Same thing here on Linux Mint 15 64-bit only. 3 invalid and 4 in Pending Verification, all were reported to WCG at the same time. Of the 3 copies of each workunit that were sent out, very few have any restarts in the stderr.txt, but on my Invalids the Result.out file is 1 or 2 bytes less than the valid results.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Now got 5 passing via PVal to PVer, and sadly, it's also befalling Windows. In all cases one of the first 2 had a restart, which is visible in the log. Now the next thing to check is what happens if 2 restarts meet 1 non restarted. Not one encountered yet, but would anticipate a _3 to be issued.
|
||
|
gomeyer
Senior Cruncher USA Joined: Jul 11, 2008 Post Count: 161 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I got two on a solid Linux box, both had a restart.
----------------------------------------(EDIT: I restarted the machine I mean, the work unit didn't restart itself.) Also one on a Windoz machine with a "no heartbeat" message; the first time I've seen that on this box. ![]() [Edit 1 times, last edit by gomeyer at Nov 22, 2013 1:50:05 AM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7655 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have quite a rash of these invalids also on three separate Linux machines. The result out line is different than the ones which come back valid. They all all have several lines about"No heartbeat from client for 30 sec - exiting" However, some of my valid units also have have the no heartbeat for 30 seconds line in them also.
----------------------------------------Here is one which is marked as "Valid." Result Log Result Name: MCM1_ 0000154_ 1457_ 1-- <core_client_version>7.0.27</core_client_version> <![CDATA[ <stderr_txt> Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000154_1457.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running 13:41:20 (8644): No heartbeat from client for 30 sec - exiting 13:41:20 (8644): timer handler: client dead, exiting Commandline = ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_x86_64-pc-linux-gnu -SettingsFile MCM1_0000154_1457.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running Result.out = 1144692.000000 Run complete, CPU time: 17591.939426 18:36:32 (8676): called boinc_finish </stderr_txt> ]]> The bolded lines only show up occasionally in the valid units, but always show up in the invalid units. In the invalid units the result.out line is always different then the valid units. Beats me as to the cause of the invalid - result.out difference. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have just noticed that I have 7 invalids that got generated within two days
----------------------------------------![]() MCM1_ 0000281_ 7216_ 0-- PBYBWHW Invalid 11/25/13 02:41:18 11/30/13 07:41:51 2.94 / 2.96 65.0 / 39.6 MCM1_ 0000302_ 5492_ 2-- R8XZ4P5 Invalid 11/27/13 00:21:00 11/30/13 05:58:54 5.23 / 5.44 118.9 / 55.5 MCM1_ 0000278_ 3201_ 0-- PBYBWHW Invalid 11/24/13 23:20:00 11/30/13 02:37:26 5.61 / 5.75 124.4 / 69.2 MCM1_ 0000275_ 3673_ 0-- PBYBWHW Invalid 11/24/13 22:38:11 11/29/13 22:33:38 5.58 / 5.73 119.9 / 67.5 MCM1_ 0000299_ 9235_ 2-- R8XZ4P5 Invalid 11/26/13 22:08:27 11/29/13 22:28:37 4.44 / 4.80 79.7 / 40.4 MCM1_ 0000275_ 2037_ 0-- PBYBWHW Invalid 11/24/13 22:35:06 11/29/13 20:57:03 5.56 / 5.71 119.9 / 71.6 MCM1_ 0000275_ 6725_ 2-- PBYBWHW Invalid 11/24/13 22:32:01 11/29/13 02:07:53 They all seem to have basically the same output: Result Name: MCM1_ 0000302_ 5492_ 2-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_windows_intelx86 -SettingsFile MCM1_0000302_5492.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_windows_intelx86 -SettingsFile MCM1_0000302_5492.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running Commandline = projects/www.worldcommunitygrid.org/wcgrid_mcm1_7.26_windows_intelx86 -SettingsFile MCM1_0000302_5492.txt -DatabaseFile dataset-17_72_SDG_v1.txt Initializing wcg_learn_limit = 500000 Running Result.out = 3326321.000000 Run complete, CPU time: 18812.536777 23:29:26 (5892): called boinc_finish </stderr_txt> ]]> The Invalids happened on 2 Intel i5 laptops. Any comments? [EDIT]: forget about the request for comments... I see that the thread for this topic is here Thanks, CJSL Crunching for a better world... ---------------------------------------- [Edit 1 times, last edit by cjslman at Dec 2, 2013 12:21:04 AM] |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We are looking into the invalids for MCM1. It looks like it could be a checkpoint/restart issue so if possibly change settings to Leave Applicaiton in Memory to yes while we investigate.
Thanks, armstrdj |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Regrettably, LAIM only -prevents- half the problem. Client / device restarts cause the same. Predominantly this will affect any cruncher that does part-time, not making use of hibernation/shleep.
What is disturbing is that even with matching Result.out values, the validator would not pass the mustering, but in summation, we've also seen all 5 copies, the maximum, come back with different Result.out values, then being moved into Too Late [which you will have seen in the take-out reports] Anyway, thanks for looking into this [and baited breath Beta hunters not far behind, to take that urgency away] |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We are looking into the invalids for MCM1. It looks like it could be a checkpoint/restart issue so if possibly change settings to Leave Applicaiton in Memory to yes while we investigate. Thanks, armstrdj Having LAIM checked doesn't seem to make any difference. I have 5 tasks from yesterday and today from 1 machine that is running with LAIM checked that are pending verification . 3 restarted, 2 didn't. It's the "Home Premium 764" named machine if the techs want to take a look. The machine and client have not been restarted that I know of. I can confirm what Sek said about restarts. I had to reboot one of my machines several times yesterday and all the tasks that were running at the time are headed to pending verification pergutory. ![]()
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
----------------------------------------![]() ![]() [Edit 4 times, last edit by nanoprobe at Dec 2, 2013 4:39:09 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7655 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
What is disturbing is that even with matching Result.out values, the validator would not pass the mustering, I wondered if anyone else had noticed this. Hopefully the techs can fix this. MCM1_ 0000317_ 5063_ 4-- 726 Valid 12/1/13 20:44:40 12/2/13 13:41:26 9.94 162.8 / 155.8 MCM1_ 0000317_ 5063_ 3-- - No Reply 11/28/13 20:44:31 12/1/13 20:44:31 0.00 0.0 / 0.0 MCM1_ 0000317_ 5063_ 2-- 726 Invalid 11/27/13 19:37:58 11/28/13 20:43:52 5.93 124.0 / 77.9 MCM1_ 0000317_ 5063_ 1-- 726 Invalid 11/26/13 16:40:50 11/27/13 19:37:36 6.63 167.1 / 77.9 <Mine MCM1_ 0000317_ 5063_ 0-- 726 Valid 11/26/13 16:40:31 11/27/13 12:37:45 6.67 148.7 / 155.8 All have the same result out:Result.out = 3982717.000000 Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
![]() |