Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 117
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
cjslman, interrupting a MCM task is mortal, as simple as that. Hibernate is the only way to preserve a task and resume without the model rebuild that's needed on restart. Hibernation is 0 power consuming.
And yes, armstrdj [search on this author] replied they're addressing this [and a Mac related build issue]. See Rickjb post above which has an actual link to the key post [armstrdj made about 4 spread over multiple threads relating to MCM]. |
||
|
cjslman
Master Cruncher Mexico Joined: Nov 23, 2004 Post Count: 2082 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the info. I'll see what I can do about the hibernate vs the shut down.
----------------------------------------CJSL Crunching for a better future... |
||
|
littlepeaks
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 748 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So, reading between the lines, if I need to do software updates, or shut down my PC to clean the dust out (which I REALLY need to do), I need to select "no new tasks", finish all my pending WUs, then shut down my PC? This is a Windows machine, but haven't had to reboot it since MCM started. But it's coming near Windows Updates time of month again.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
littlepeaks,
Since the invalid prob is only MCM, I've already resorted to planned shutdown/boot, flip profile to short FAHV and just let it rip. Then at some point only FAHV are crunching, then do whatever. After, flip the profile back to include/exclusively do MCM again. No time lost [hardly], no idling cores, maximum crunch return. Been doing same for CEP2 due the wide apart checkpointing... there's no checkpoint tuned planning when 4-8 run concurrent. Sorry, triple Emming ** is a must these days [or just shrug shoulder and take the who cares beyond the Set-And-forget route. Just crunch and let the fallout be the fallout] ** Micro Management Methodology. |
||
|
littlepeaks
Veteran Cruncher USA Joined: Apr 28, 2007 Post Count: 748 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks -- good idea SekeRob
|
||
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18665 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
FWIW, I've taken a look thru my Results Status at the MCM WUs that went Pending Verification over the last several days. I looked at the reslts log for those that got Invalid on each WU and the BOINC levels were across multiple BOINC 6x and 7x levels. The one thing they all had was multiple "COmmand line, Initializing, Running" entries. I think I even found invalids with the same size Results.out file. From the various thread on this issue, it would seem this may help confirm the problem is with reloading from checkpoints. I normally run with LAIM on so I have to force a machine reboot to test if that causes the problem for my copy of a WU. Since the techs are working on this, I'm not going to force that but will try to track what happens if I have a reboot soon.
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Project Name: Mapping Cancer Markers
Created: 11/29/2013 21:17:03 Name: MCM1_0000292_9680 Minimum Quorum: 2 Replication: 5 Result Name App Version Number Status Sent Time Time Due / Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit MCM1_ 0000292_ 9680_ 4-- - In Progress 12/7/13 13:20:27 12/11/13 01:20:27 0.00 0.0 / 0.0 MCM1_ 0000292_ 9680_ 3-- 726 Pending Verification 12/6/13 16:55:53 12/7/13 13:20:01 3.12 59.8 / 0.0 MCM1_ 0000292_ 9680_ 2-- 726 Pending Verification 12/2/13 09:56:10 12/6/13 16:51:41 3.24 70.2 / 0.0 MCM1_ 0000292_ 9680_ 1-- 726 Pending Verification 12/1/13 09:24:54 12/1/13 22:40:27 4.05 70.9 / 0.0 MCM1_ 0000292_ 9680_ 0-- 726 Pending Verification 12/1/13 09:24:45 12/2/13 09:55:42 6.79 77.2 / 0.0 <-- mine |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Project Name: Mapping Cancer Markers
Created: 11/30/2013 01:51:21 Name: MCM1_0000308_0969 Minimum Quorum: 2 Replication: 4 Result Name App Version Number Status Sent Time Time Due / Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit MCM1_ 0000308_ 0969_ 3-- 726 Valid 07/12/13 07:00:19 07/12/13 16:50:40 4.74 96.5 / 98.1 MCM1_ 0000308_ 0969_ 2-- 726 Valid 06/12/13 23:51:28 07/12/13 06:59:46 4.80 99.6 / 98.1 MCM1_ 0000308_ 0969_ 1-- 726 Invalid 01/12/13 12:58:24 02/12/13 18:41:09 6.36 126.6 / 49.0 MCM1_ 0000308_ 0969_ 0-- 726 Invalid 01/12/13 12:58:21 06/12/13 23:51:11 15.21 112.5 / 49.0 All have the same Result.out. _0 restarted five times. _1 restarted three times. _2 and _3 ran straight through. It would be interesting to find one where two Invalids both restarted the same number of times, but I think I'm going to turn this machine over to FAAH until this is fixed. It's just wasting everyone's time to have a machine that gets switched off run MCM1. (And, no, it's not in my control and it's a desktop machine, so there's no option to suspend it.) |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
....I did the same with one of mine that my son uses as a games machine and so it is constantly suspending and LAIM cannot be used
![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Weird.
Before I changed things, this unit went valid: Project Name: Mapping Cancer Markers Created: 12/05/2013 23:25:38 Name: MCM1_0000433_1016 Minimum Quorum: 2 Replication: 2 Result Name App Version Number Status Sent Time Time Due / Return Time CPU Time / Elapsed Time (hours) Claimed/ Granted BOINC Credit MCM1_ 0000433_ 1016_ 1-- 726 Valid 06/12/13 23:25:28 07/12/13 07:55:35 1.96 73.3 / 62.8 MCM1_ 0000433_ 1016_ 0-- 726 Valid 06/12/13 23:25:25 08/12/13 18:06:27 7.18 52.3 / 62.8 and although the wingman went straight through, the unit on this machine restarted twice - and it still validated. ![]() Maybe something has now changed? Anyway, I'll let it run some more and see what happens. |
||
|
|
![]() |