Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 18
|
![]() |
Author |
|
martin64
Senior Cruncher Germany Joined: May 11, 2009 Post Count: 445 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My computer is set to go into hibernation when inactive for more than 2 hours or so. It works fin with WCG usually, but with the new betas, which I am watching a bit more closely, I discovered something unusual:
----------------------------------------[01:34:44] Starting job 13,CPU time has been restored to 14649.687500. 08:12:05 (9584): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [08:12:11] Number of jobs = 16 [08:12:11] Starting job 13,CPU time has been restored to 14649.687500. [09:35:57] Finished Job #13 Computer went into hibernation at about 2:00, and it woke up around 8:12. Job 13 didn't resume, but it restarted. This heartbeat thing may have caused that - of course there is no heartbeat when the computer is turned off ![]() Since CEP2 has some jobs that run for around 2 hours, hibernation might waste some significant processor time. If the heartbeat is the issue, maybe the code could be extended to wait for 30 seconds *twice*, which would mean that after waking up from hibernation there are another 30 seconds for the WU to react? Regards, Martin ![]() |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It's BOINC core client itself that controls this... checks if a running science app is really running. This mechanism has been a thorn in the side for all of BOINC for a very long time and once in a while crops up in discussions how to change this, not restart or even killing the task. Possibly the cc_config.xml <start_delay>nseconds</start_delay> could be made to also act when coming out of hibernation in addition to regular boot-up delay, but that would have to be discussed at the BOINC developers Alpha-Mail list.
Whilst, I think this belongs in the Beta forum thread of v 6.35, beta 11, the question being, if this is reproducible or just one-off extreme busy system that happened to power down/up right when the model was at it's biggest... and anything else that was loaded such as the Firefox memory hog. --//-- |
||
|
martin64
Senior Cruncher Germany Joined: May 11, 2009 Post Count: 445 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Whilst, I think this belongs in the Beta forum thread of v 6.35, beta 11, the question being, if this is reproducible or just one-off extreme busy system that happened to power down/up right when the model was at it's biggest... and anything else that was loaded such as the Firefox memory hog. I posted it here because it looked to me not so much like an issue of the beta, but the CEP code. System (dual core Win7/64) was idle except 2 CEP2 WUs running. The other one didn't have that issue. Will look out if it happens again with regular CEP2 WUs. Regards, Martin ![]() |
||
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
quantum chemistry at play:
----------------------------------------Result Name: E201242_ 980_ A.31.C19H7N7S5.40.0.set1d06_ 1-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [10:14:55] Number of jobs = 16 [10:14:55] Starting job 0,CPU time has been restored to 0.000000. [10:18:22] Finished Job #0 [10:18:22] Starting job 1,CPU time has been restored to 199.781250. [10:24:52] Finished Job #1 [10:24:52] Starting job 2,CPU time has been restored to 564.562500. [14:11:33] Finished Job #2 [14:11:33] Starting job 3,CPU time has been restored to 13413.937500. [14:19:13] Finished Job #3 [14:19:13] Starting job 4,CPU time has been restored to 13849.578125. [14:25:14] Finished Job #4 [14:25:14] Starting job 5,CPU time has been restored to 14186.546875. 14:27:26 (3380): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [14:27:30] Number of jobs = 16 [14:27:30] Starting job 5,CPU time has been restored to 14186.546875. [14:34:00] Finished Job #5 [Edit 1 times, last edit by skgiven at Feb 14, 2011 11:23:53 PM] |
||
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
and again,
Result Name: E201241_ 638_ A.28.C22H17N3SSi2.42.3.set1d06_ 0-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> [14:21:55] Starting job 12,CPU time has been restored to 17133.812500. 14:27:26 (3396): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [14:27:31] Number of jobs = 16 [14:27:31] Starting job 12,CPU time has been restored to 17133.812500. Same system, so probably the same issue at the same time. The tasks finished so it was not a serious issue, probably a system stall. [ot] Below is a fine example of why we should let Boinc always run, Result Name: E201232_ 765_ A.27.C23H16S2SeSi.238.3.set1d06_ 0-- <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> INFO: No state to restore. Start from the beginning. [09:27:35] Number of jobs = 16 [09:27:35] Starting job 0,CPU time has been restored to 0.000000. [09:31:12] Finished Job #0 [09:31:12] Starting job 1,CPU time has been restored to 201.890625. [09:42:09] Finished Job #1 [09:42:09] Starting job 2,CPU time has been restored to 846.093750. [15:27:54] Finished Job #2 [15:27:54] Starting job 3,CPU time has been restored to 21079.765625. [15:39:19] Finished Job #3 [15:39:19] Starting job 4,CPU time has been restored to 21747.843750. [15:47:32] Finished Job #4 [15:47:32] Starting job 5,CPU time has been restored to 22233.937500. [15:55:52] Finished Job #5 [15:55:52] Starting job 6,CPU time has been restored to 22725.375000. [16:04:02] Finished Job #6 [16:04:02] Starting job 7,CPU time has been restored to 23209.421875. [16:15:24] Finished Job #7 [16:15:24] Starting job 8,CPU time has been restored to 23880.562500. [16:23:17] Finished Job #8 [16:23:17] Starting job 9,CPU time has been restored to 24346.453125. [08:48:41] Number of jobs = 16 [08:48:41] Starting job 9,CPU time has been restored to 24346.453125. Quit requested: Exiting [08:53:36] Number of jobs = 16 [08:53:36] Starting job 9,CPU time has been restored to 24346.453125. Quit requested: Exiting [08:56:14] Number of jobs = 16 [08:56:14] Starting job 9,CPU time has been restored to 24346.453125. [09:04:52] Finished Job #9 [09:04:52] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:08:36] Number of jobs = 16 [09:08:36] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:13:52] Number of jobs = 16 [09:13:52] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:19:52] Number of jobs = 16 [09:19:52] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:35:15] Number of jobs = 16 [09:35:15] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:40:55] Number of jobs = 16 [09:40:55] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:42:31] Number of jobs = 16 [09:42:31] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:51:08] Number of jobs = 16 [09:51:08] Starting job 10,CPU time has been restored to 24858.859375. Quit requested: Exiting [09:59:56] Number of jobs = 16 [09:59:56] Starting job 10,CPU time has been restored to 24858.859375. [10:19:02] Finished Job #10 [10:19:02] Starting job 11,CPU time has been restored to 25996.515625. [10:29:51] Finished Job #11 [10:29:51] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [10:43:45] Number of jobs = 16 [10:43:45] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [10:47:40] Number of jobs = 16 [10:47:40] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [10:59:22] Number of jobs = 16 [10:59:22] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:02:16] Number of jobs = 16 [11:02:16] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:05:39] Number of jobs = 16 [11:05:39] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:07:45] Number of jobs = 16 [11:07:45] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:10:25] Number of jobs = 16 [11:10:25] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:35:28] Number of jobs = 16 [11:35:28] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:52:38] Number of jobs = 16 [11:52:38] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [11:55:12] Number of jobs = 16 [11:55:12] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [12:01:44] Number of jobs = 16 [12:01:44] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [12:11:56] Number of jobs = 16 [12:11:56] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [12:23:56] Number of jobs = 16 [12:23:56] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [12:40:25] Number of jobs = 16 [12:40:25] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [12:49:23] Number of jobs = 16 [12:49:23] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [12:57:35] Number of jobs = 16 [12:57:35] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:01:28] Number of jobs = 16 [13:01:28] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:09:36] Number of jobs = 16 [13:09:36] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:20:44] Number of jobs = 16 [13:20:44] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:26:57] Number of jobs = 16 [13:26:57] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:32:11] Number of jobs = 16 [13:32:11] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:38:25] Number of jobs = 16 [13:38:25] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:42:47] Number of jobs = 16 [13:42:47] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:45:05] Number of jobs = 16 [13:45:05] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:46:43] Number of jobs = 16 [13:46:43] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:53:29] Number of jobs = 16 [13:53:29] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [13:59:32] Number of jobs = 16 [13:59:32] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [14:04:27] Number of jobs = 16 [14:04:27] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [14:16:43] Number of jobs = 16 [14:16:43] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [14:33:05] Number of jobs = 16 [14:33:05] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [14:39:06] Number of jobs = 16 [14:39:06] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [15:12:10] Number of jobs = 16 [15:12:10] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [15:17:43] Number of jobs = 16 [15:17:43] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [16:28:32] Number of jobs = 16 [16:28:32] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [16:38:05] Number of jobs = 16 [16:38:05] Starting job 12,CPU time has been restored to 26634.906250. Quit requested: Exiting [16:45:46] Number of jobs = 16 [16:45:46] Starting job 12,CPU time has been restored to 26634.906250. [17:51:07] Finished Job #12 [17:51:07] Starting job 13,CPU time has been restored to 30524.453125. [19:46:32] Finished Job #13 [19:46:32] Starting job 14,CPU time has been restored to 37425.218750. Killing job because cpu time has been exceeded. Subjob start time = 0, Subjob current time = 1088570919 [21:22:59] Finished Job #14 21:23:11 (3324): called boinc_finish </stderr_txt> As someone once put it, doh! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I noticed similar things when just waking the computer from monitor-off power saving mode, while everything else including hard disk was working as usual, no hibernation etc.
hibernation with cep2 works fine for me, but i have to suspend these units (or the project) manually with the LAIM option activated. waking up the pc, the project continues with out cpu time loss. anyway, it seems to be a cep2 specific problem. i never had any similar experiences with other projects. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
OSses and modern CPU's these days are highly anticipative in behaviour. If a device has been for longer without user input, things in memory get shuffled around substantially. I suppose that includes the stuff that is on screen has to be fetched from somewhere.
----------------------------------------The Heartbeat issue is without a doubt a situation where the overall system was so busy that the BOINC service / core client (normal priority) and the science app (running at lowest priority), don't manager to get an interrupt in to tell the science is still alive, which How to avoid this: Suspend BOINC, with the said LAIM on anyway as recommended for CEP2 in particular, then hibernate. Have the system resume, and then unsuspend BOINC. By the time you're able to do that it is probable that all has been loaded up and ready for action. Remember, hibernating writes the whole memory image to disk, including the monster CEP2 model(s). With Vista/W7 and USB stick, ReadyBoost on or off that experience might be different. I've never investigated if it really makes any perceptual difference at all beyond a first power up. --//-- edit: *MUST* inserted for emphasis. [Edit 1 times, last edit by Former Member at Feb 15, 2011 11:52:12 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have never made the monitor power-down feature work correctly on my HP Windows 7 64-bit computer that I got last October. For some reason, the whole computer freezes if the monitor powers down and I have to do a power-off reboot. I spent 2 weeks on that problem before I finally decided to live with it. Perhaps some chip has a broken circuit or the mother board has an undiagnosed fault?
![]() Anyway, I think that a lot of hibernation problems are caused by poor implementation by the hardware manufacturers. ![]() Lawrence |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi everybody,
Maybe the IBM/WCG team has an idea on this issue. Well, you could just forget about the hibernation and keep CEP2 running all night… ![]() Best wishes from Your Harvard CEP team |
||
|
tfmagnetism
Cruncher Joined: Jul 22, 2011 Post Count: 25 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Also having a heartbeat issue with hibernation here.
I've just started running CEP2 again, this time on 7.0.64 (W7 64-bit) after previously running it back on on 6.12.34 on the same machine. I've found that where previously job 3(?) would checkpoint at anywhere up to the 40%-ish progress mark usually (or several hours CPU time say), job 3 checkpoint is now at over 9hrs CPU time! Therefore, instead of shutting down Windows as I have been doing since I last ran CEP2, I've resorted back to hybrid sleeping (and then pulling the plug) AKA hibernating/resuming as I did with 6.12.34. This used to pose no problem before (with Activity menu in BOINC manager left set at "Run based on preferences"), but now I'm getting a heartbeat problem on resume with resetting to the last checkpoint with CEP2. The machine isn't the fastest in the world, AMD Athlon X2 5000+ w/ 2GB DDR2 ram, and takes several minutes to resume and so I'm guessing, since the HD activity LED is on solidly for certainly more than 30s on resume, that this busy system isn't affording adequate heartbeat communications during resume. I've noticed that the timestamp of "Resuming computation" in the BOINC event log exactly coincides with the first logged event in "event viewer" on Windows after the resume (something about the system time has changed). Isn't this a bit early? "Resuming computation" is also happening nearly 3 minutes before "Windows is resuming operations" as noted in the BOINC event log. If windows isn't resuming operations until 3 minutes later, why doesn't "Resuming computation" happen *after* this instead of before? After all, "suspending computations" happens *after* "Windows is suspending operations". Please see the logs below (computer sleep ~ 03:40): stdoutdae.txt (BOINC Event Log): 28-Jul-2013 03:39:34 [---] Windows is suspending operations 28-Jul-2013 03:39:34 [---] Suspending computation - requested by operating system 28-Jul-2013 03:39:34 [---] Suspending network activity - requested by operating system 28-Jul-2013 03:39:55 [---] Resuming after OS suspension 28-Jul-2013 16:07:40 [---] Resuming computation 28-Jul-2013 16:07:40 [---] Resuming network activity 28-Jul-2013 16:09:19 [World Community Grid] Task E214657_724_C.35.C30H18S3Si2.00815051.1.set1d06_2 exited with zero status but no 'finished' file 28-Jul-2013 16:09:19 [World Community Grid] If this happens repeatedly you may need to reset the project. 28-Jul-2013 16:10:17 [---] Windows is resuming operations stderr.txt (CEP2 slots folder in c:\ProgramData\BOINC): 16:08:39 (4472): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting 16:08:44 (4472): No heartbeat from core client for 30 sec - exiting 16:08:45 (4472): No heartbeat from core client for 30 sec - exiting 16:08:46 (4472): No heartbeat from core client for 30 sec - exiting 16:08:47 (4472): No heartbeat from core client for 30 sec - exiting 16:08:48 (4472): No heartbeat from core client for 30 sec - exiting 16:08:49 (4472): No heartbeat from core client for 30 sec - exiting 16:08:50 (4472): No heartbeat from core client for 30 sec - exiting 16:08:51 (4472): No heartbeat from core client for 30 sec - exiting 16:08:52 (4472): No heartbeat from core client for 30 sec - exiting 16:08:53 (4472): No heartbeat from core client for 30 sec - exiting 16:08:54 (4472): No heartbeat from core client for 30 sec - exiting 16:08:55 (4472): No heartbeat from core client for 30 sec - exiting 16:08:56 (4472): No heartbeat from core client for 30 sec - exiting After reading previous posts, I have manually tried using Activity->Suspend from BOINC manager to see if this helps before sleeping (and hence hibernating). I have noted that this seems to mitigate the problem somewhat, although not completely cure it (more observation needed). I have therefore looked into automating this process with a couple of simple batch files, and tried task scheduler in Windows triggering on sleep/hibernate from event [Log name: System, Source: Kernel-Power, Event ID: 42] but this happens too late into the sleep process to be of help (doesn't trigger until resume). I have therefore opted intead for "Power Triggers" at http://win7suspendresume.codeplex.com/. I have two batch files in use: 1) BOINCsleep.bat ============ "C:\Program Files\BOINC\boinccmd.exe" --set_run_mode never cmd /V:ON /C "echo sleep OK at !date! !time!">>D:\Users\Public\BOINCsuspend\BAToutput.txt 2)BOINCresume.bat ============= "C:\Program Files\BOINC\boinccmd.exe" --set_run_mode auto cmd /V:ON /C "echo resume OK at !date! !time!">>D:\Users\Public\BOINCsuspend\BAToutput.txt The first line in each file is the only one needed. The second line just prints out the time and date to a log file (D:\...BAToutput.txt). You should change your path and filename accordingly here, and the path to your boinccmd.exe file in BOINC folder. In setting up "Power Triggers", use the "Start Process" tab, "Other" button, specify the path to the relevant batch file. In the "execution sequence" box to the right, I used Action: "Wait for exit"; Applies to: "Service account". You can use the "Test" button here to test running your batch file for suspend and for resume. My "resuming computation" is now occurring about 3 minutes later than it would have before (i.e. compared to "resuming network activity") and after "Windows is resuming operations": 29-Jul-2013 03:47:08 [---] Windows is suspending operations 29-Jul-2013 03:47:08 [---] Suspending network activity - requested by operating system 29-Jul-2013 03:47:21 [---] Resuming after OS suspension 29-Jul-2013 14:41:44 [---] Resuming network activity 29-Jul-2013 14:42:23 [---] Windows is resuming operations 29-Jul-2013 14:44:09 [World Community Grid] Task E214657_724_C.35.C30H18S3Si2.00815051.1.set1d06_2 exited with zero status but no 'finished' file 29-Jul-2013 14:44:09 [World Community Grid] If this happens repeatedly you may need to reset the project. 29-Jul-2013 14:44:43 [---] Resuming computation Notice that I also had a heartbeat problem today again after using Power Triggers even though the CEP2 was *manually* suspended (suspend button in the tasks tab), but as I said earlier, suspending things (either this way or through menu Activity->Suspend) doesn't seem to completely solve the problem. I'll make an update when I have more info on how much this is all helping. |
||
|
|
![]() |