Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 18
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 14279 times and has 17 replies Next Thread
martin64
Senior Cruncher
Germany
Joined: May 11, 2009
Post Count: 445
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
CEP2 and hibernation

My computer is set to go into hibernation when inactive for more than 2 hours or so. It works fin with WCG usually, but with the new betas, which I am watching a bit more closely, I discovered something unusual:

[01:34:44] Starting job 13,CPU time has been restored to 14649.687500.
08:12:05 (9584): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[08:12:11] Number of jobs = 16
[08:12:11] Starting job 13,CPU time has been restored to 14649.687500.
[09:35:57] Finished Job #13

Computer went into hibernation at about 2:00, and it woke up around 8:12. Job 13 didn't resume, but it restarted. This heartbeat thing may have caused that - of course there is no heartbeat when the computer is turned off wink

Since CEP2 has some jobs that run for around 2 hours, hibernation might waste some significant processor time. If the heartbeat is the issue, maybe the code could be extended to wait for 30 seconds *twice*, which would mean that after waking up from hibernation there are another 30 seconds for the WU to react?

Regards,
Martin
----------------------------------------

[Jan 12, 2011 10:48:14 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

It's BOINC core client itself that controls this... checks if a running science app is really running. This mechanism has been a thorn in the side for all of BOINC for a very long time and once in a while crops up in discussions how to change this, not restart or even killing the task. Possibly the cc_config.xml <start_delay>nseconds</start_delay> could be made to also act when coming out of hibernation in addition to regular boot-up delay, but that would have to be discussed at the BOINC developers Alpha-Mail list.

Whilst, I think this belongs in the Beta forum thread of v 6.35, beta 11, the question being, if this is reproducible or just one-off extreme busy system that happened to power down/up right when the model was at it's biggest... and anything else that was loaded such as the Firefox memory hog.

--//--
[Jan 12, 2011 11:10:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
martin64
Senior Cruncher
Germany
Joined: May 11, 2009
Post Count: 445
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

Whilst, I think this belongs in the Beta forum thread of v 6.35, beta 11, the question being, if this is reproducible or just one-off extreme busy system that happened to power down/up right when the model was at it's biggest... and anything else that was loaded such as the Firefox memory hog.

I posted it here because it looked to me not so much like an issue of the beta, but the CEP code. System (dual core Win7/64) was idle except 2 CEP2 WUs running. The other one didn't have that issue. Will look out if it happens again with regular CEP2 WUs.

Regards,
Martin
----------------------------------------

[Jan 12, 2011 1:38:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

quantum chemistry at play:

Result Name: E201242_ 980_ A.31.C19H7N7S5.40.0.set1d06_ 1--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[10:14:55] Number of jobs = 16
[10:14:55] Starting job 0,CPU time has been restored to 0.000000.
[10:18:22] Finished Job #0
[10:18:22] Starting job 1,CPU time has been restored to 199.781250.
[10:24:52] Finished Job #1
[10:24:52] Starting job 2,CPU time has been restored to 564.562500.
[14:11:33] Finished Job #2
[14:11:33] Starting job 3,CPU time has been restored to 13413.937500.
[14:19:13] Finished Job #3
[14:19:13] Starting job 4,CPU time has been restored to 13849.578125.
[14:25:14] Finished Job #4
[14:25:14] Starting job 5,CPU time has been restored to 14186.546875.
14:27:26 (3380): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting

[14:27:30] Number of jobs = 16
[14:27:30] Starting job 5,CPU time has been restored to 14186.546875.
[14:34:00] Finished Job #5
----------------------------------------
[Edit 1 times, last edit by skgiven at Feb 14, 2011 11:23:53 PM]
[Jan 12, 2011 9:04:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

and again,
Result Name: E201241_ 638_ A.28.C22H17N3SSi2.42.3.set1d06_ 0--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>

[14:21:55] Starting job 12,CPU time has been restored to 17133.812500.
14:27:26 (3396): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[14:27:31] Number of jobs = 16
[14:27:31] Starting job 12,CPU time has been restored to 17133.812500.

Same system, so probably the same issue at the same time. The tasks finished so it was not a serious issue, probably a system stall.

[ot]
Below is a fine example of why we should let Boinc always run,

Result Name: E201232_ 765_ A.27.C23H16S2SeSi.238.3.set1d06_ 0--
<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[09:27:35] Number of jobs = 16
[09:27:35] Starting job 0,CPU time has been restored to 0.000000.
[09:31:12] Finished Job #0
[09:31:12] Starting job 1,CPU time has been restored to 201.890625.
[09:42:09] Finished Job #1
[09:42:09] Starting job 2,CPU time has been restored to 846.093750.
[15:27:54] Finished Job #2
[15:27:54] Starting job 3,CPU time has been restored to 21079.765625.
[15:39:19] Finished Job #3
[15:39:19] Starting job 4,CPU time has been restored to 21747.843750.
[15:47:32] Finished Job #4
[15:47:32] Starting job 5,CPU time has been restored to 22233.937500.
[15:55:52] Finished Job #5
[15:55:52] Starting job 6,CPU time has been restored to 22725.375000.
[16:04:02] Finished Job #6
[16:04:02] Starting job 7,CPU time has been restored to 23209.421875.
[16:15:24] Finished Job #7
[16:15:24] Starting job 8,CPU time has been restored to 23880.562500.
[16:23:17] Finished Job #8
[16:23:17] Starting job 9,CPU time has been restored to 24346.453125.
[08:48:41] Number of jobs = 16
[08:48:41] Starting job 9,CPU time has been restored to 24346.453125.
Quit requested: Exiting
[08:53:36] Number of jobs = 16
[08:53:36] Starting job 9,CPU time has been restored to 24346.453125.
Quit requested: Exiting
[08:56:14] Number of jobs = 16
[08:56:14] Starting job 9,CPU time has been restored to 24346.453125.
[09:04:52] Finished Job #9
[09:04:52] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:08:36] Number of jobs = 16
[09:08:36] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:13:52] Number of jobs = 16
[09:13:52] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:19:52] Number of jobs = 16
[09:19:52] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:35:15] Number of jobs = 16
[09:35:15] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:40:55] Number of jobs = 16
[09:40:55] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:42:31] Number of jobs = 16
[09:42:31] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:51:08] Number of jobs = 16
[09:51:08] Starting job 10,CPU time has been restored to 24858.859375.
Quit requested: Exiting
[09:59:56] Number of jobs = 16
[09:59:56] Starting job 10,CPU time has been restored to 24858.859375.
[10:19:02] Finished Job #10
[10:19:02] Starting job 11,CPU time has been restored to 25996.515625.
[10:29:51] Finished Job #11
[10:29:51] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[10:43:45] Number of jobs = 16
[10:43:45] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[10:47:40] Number of jobs = 16
[10:47:40] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[10:59:22] Number of jobs = 16
[10:59:22] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:02:16] Number of jobs = 16
[11:02:16] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:05:39] Number of jobs = 16
[11:05:39] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:07:45] Number of jobs = 16
[11:07:45] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:10:25] Number of jobs = 16
[11:10:25] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:35:28] Number of jobs = 16
[11:35:28] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:52:38] Number of jobs = 16
[11:52:38] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[11:55:12] Number of jobs = 16
[11:55:12] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[12:01:44] Number of jobs = 16
[12:01:44] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[12:11:56] Number of jobs = 16
[12:11:56] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[12:23:56] Number of jobs = 16
[12:23:56] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[12:40:25] Number of jobs = 16
[12:40:25] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[12:49:23] Number of jobs = 16
[12:49:23] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[12:57:35] Number of jobs = 16
[12:57:35] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:01:28] Number of jobs = 16
[13:01:28] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:09:36] Number of jobs = 16
[13:09:36] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:20:44] Number of jobs = 16
[13:20:44] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:26:57] Number of jobs = 16
[13:26:57] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:32:11] Number of jobs = 16
[13:32:11] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:38:25] Number of jobs = 16
[13:38:25] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:42:47] Number of jobs = 16
[13:42:47] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:45:05] Number of jobs = 16
[13:45:05] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:46:43] Number of jobs = 16
[13:46:43] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:53:29] Number of jobs = 16
[13:53:29] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[13:59:32] Number of jobs = 16
[13:59:32] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[14:04:27] Number of jobs = 16
[14:04:27] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[14:16:43] Number of jobs = 16
[14:16:43] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[14:33:05] Number of jobs = 16
[14:33:05] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[14:39:06] Number of jobs = 16
[14:39:06] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[15:12:10] Number of jobs = 16
[15:12:10] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[15:17:43] Number of jobs = 16
[15:17:43] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[16:28:32] Number of jobs = 16
[16:28:32] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[16:38:05] Number of jobs = 16
[16:38:05] Starting job 12,CPU time has been restored to 26634.906250.
Quit requested: Exiting
[16:45:46] Number of jobs = 16
[16:45:46] Starting job 12,CPU time has been restored to 26634.906250.
[17:51:07] Finished Job #12
[17:51:07] Starting job 13,CPU time has been restored to 30524.453125.
[19:46:32] Finished Job #13
[19:46:32] Starting job 14,CPU time has been restored to 37425.218750.
Killing job because cpu time has been exceeded. Subjob start time = 0, Subjob current time = 1088570919
[21:22:59] Finished Job #14
21:23:11 (3324): called boinc_finish

</stderr_txt>

As someone once put it, doh!
[Feb 14, 2011 11:41:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

I noticed similar things when just waking the computer from monitor-off power saving mode, while everything else including hard disk was working as usual, no hibernation etc.
hibernation with cep2 works fine for me, but i have to suspend these units (or the project) manually with the LAIM option activated. waking up the pc, the project continues with out cpu time loss.
anyway, it seems to be a cep2 specific problem. i never had any similar experiences with other projects.
[Feb 15, 2011 11:18:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

OSses and modern CPU's these days are highly anticipative in behaviour. If a device has been for longer without user input, things in memory get shuffled around substantially. I suppose that includes the stuff that is on screen has to be fetched from somewhere.

The Heartbeat issue is without a doubt a situation where the overall system was so busy that the BOINC service / core client (normal priority) and the science app (running at lowest priority), don't manager to get an interrupt in to tell the science is still alive, which it does MUST do at least once every 30 seconds. This often happens simultaneous to high disk activity (usually memory swapping).

How to avoid this: Suspend BOINC, with the said LAIM on anyway as recommended for CEP2 in particular, then hibernate. Have the system resume, and then unsuspend BOINC. By the time you're able to do that it is probable that all has been loaded up and ready for action.

Remember, hibernating writes the whole memory image to disk, including the monster CEP2 model(s). With Vista/W7 and USB stick, ReadyBoost on or off that experience might be different. I've never investigated if it really makes any perceptual difference at all beyond a first power up.

--//--

edit: *MUST* inserted for emphasis.
----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 15, 2011 11:52:12 AM]
[Feb 15, 2011 11:50:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

I have never made the monitor power-down feature work correctly on my HP Windows 7 64-bit computer that I got last October. For some reason, the whole computer freezes if the monitor powers down and I have to do a power-off reboot. I spent 2 weeks on that problem before I finally decided to live with it. Perhaps some chip has a broken circuit or the mother board has an undiagnosed fault? sad

Anyway, I think that a lot of hibernation problems are caused by poor implementation by the hardware manufacturers. love struck

Lawrence
[Feb 15, 2011 1:53:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

Hi everybody,

Maybe the IBM/WCG team has an idea on this issue. Well, you could just forget about the hibernation and keep CEP2 running all night… biggrin

Best wishes from

Your Harvard CEP team
[Feb 15, 2011 6:48:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 25
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CEP2 and hibernation

Also having a heartbeat issue with hibernation here.

I've just started running CEP2 again, this time on 7.0.64 (W7 64-bit) after previously running it back on on 6.12.34 on the same machine. I've found that where previously job 3(?) would checkpoint at anywhere up to the 40%-ish progress mark usually (or several hours CPU time say), job 3 checkpoint is now at over 9hrs CPU time! Therefore, instead of shutting down Windows as I have been doing since I last ran CEP2, I've resorted back to hybrid sleeping (and then pulling the plug) AKA hibernating/resuming as I did with 6.12.34. This used to pose no problem before (with Activity menu in BOINC manager left set at "Run based on preferences"), but now I'm getting a heartbeat problem on resume with resetting to the last checkpoint with CEP2.

The machine isn't the fastest in the world, AMD Athlon X2 5000+ w/ 2GB DDR2 ram, and takes several minutes to resume and so I'm guessing, since the HD activity LED is on solidly for certainly more than 30s on resume, that this busy system isn't affording adequate heartbeat communications during resume.

I've noticed that the timestamp of "Resuming computation" in the BOINC event log exactly coincides with the first logged event in "event viewer" on Windows after the resume (something about the system time has changed). Isn't this a bit early? "Resuming computation" is also happening nearly 3 minutes before "Windows is resuming operations" as noted in the BOINC event log. If windows isn't resuming operations until 3 minutes later, why doesn't "Resuming computation" happen *after* this instead of before? After all, "suspending computations" happens *after* "Windows is suspending operations".

Please see the logs below (computer sleep ~ 03:40):

stdoutdae.txt (BOINC Event Log):

28-Jul-2013 03:39:34 [---] Windows is suspending operations
28-Jul-2013 03:39:34 [---] Suspending computation - requested by operating system
28-Jul-2013 03:39:34 [---] Suspending network activity - requested by operating system
28-Jul-2013 03:39:55 [---] Resuming after OS suspension
28-Jul-2013 16:07:40 [---] Resuming computation
28-Jul-2013 16:07:40 [---] Resuming network activity
28-Jul-2013 16:09:19 [World Community Grid] Task E214657_724_C.35.C30H18S3Si2.00815051.1.set1d06_2 exited
with zero status but no 'finished' file
28-Jul-2013 16:09:19 [World Community Grid] If this happens repeatedly you may need to reset the project.
28-Jul-2013 16:10:17 [---] Windows is resuming operations

stderr.txt (CEP2 slots folder in c:\ProgramData\BOINC):

16:08:39 (4472): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
16:08:44 (4472): No heartbeat from core client for 30 sec - exiting
16:08:45 (4472): No heartbeat from core client for 30 sec - exiting
16:08:46 (4472): No heartbeat from core client for 30 sec - exiting
16:08:47 (4472): No heartbeat from core client for 30 sec - exiting
16:08:48 (4472): No heartbeat from core client for 30 sec - exiting
16:08:49 (4472): No heartbeat from core client for 30 sec - exiting
16:08:50 (4472): No heartbeat from core client for 30 sec - exiting
16:08:51 (4472): No heartbeat from core client for 30 sec - exiting
16:08:52 (4472): No heartbeat from core client for 30 sec - exiting
16:08:53 (4472): No heartbeat from core client for 30 sec - exiting
16:08:54 (4472): No heartbeat from core client for 30 sec - exiting
16:08:55 (4472): No heartbeat from core client for 30 sec - exiting
16:08:56 (4472): No heartbeat from core client for 30 sec - exiting

After reading previous posts, I have manually tried using Activity->Suspend from BOINC manager to see if this helps before sleeping (and hence hibernating). I have noted that this seems to mitigate the problem somewhat, although not completely cure it (more observation needed).

I have therefore looked into automating this process with a couple of simple batch files, and tried task scheduler in Windows triggering on sleep/hibernate from event [Log name: System, Source: Kernel-Power, Event ID: 42] but this happens too late into the sleep process to be of help (doesn't trigger until resume). I have therefore opted intead for "Power Triggers" at http://win7suspendresume.codeplex.com/. I have two batch files in use:

1) BOINCsleep.bat
============
"C:\Program Files\BOINC\boinccmd.exe" --set_run_mode never
cmd /V:ON /C "echo sleep OK at !date! !time!">>D:\Users\Public\BOINCsuspend\BAToutput.txt

2)BOINCresume.bat
=============
"C:\Program Files\BOINC\boinccmd.exe" --set_run_mode auto
cmd /V:ON /C "echo resume OK at !date! !time!">>D:\Users\Public\BOINCsuspend\BAToutput.txt

The first line in each file is the only one needed. The second line just prints out the time and date to a log file (D:\...BAToutput.txt). You should change your path and filename accordingly here, and the path to your boinccmd.exe file in BOINC folder.

In setting up "Power Triggers", use the "Start Process" tab, "Other" button, specify the path to the relevant batch file. In the "execution sequence" box to the right, I used Action: "Wait for exit"; Applies to: "Service account". You can use the "Test" button here to test running your batch file for suspend and for resume.

My "resuming computation" is now occurring about 3 minutes later than it would have before (i.e. compared to "resuming network activity") and after "Windows is resuming operations":

29-Jul-2013 03:47:08 [---] Windows is suspending operations
29-Jul-2013 03:47:08 [---] Suspending network activity - requested by operating system
29-Jul-2013 03:47:21 [---] Resuming after OS suspension
29-Jul-2013 14:41:44 [---] Resuming network activity
29-Jul-2013 14:42:23 [---] Windows is resuming operations
29-Jul-2013 14:44:09 [World Community Grid] Task E214657_724_C.35.C30H18S3Si2.00815051.1.set1d06_2 exited with zero status but no 'finished' file
29-Jul-2013 14:44:09 [World Community Grid] If this happens repeatedly you may need to reset the project.
29-Jul-2013 14:44:43 [---] Resuming computation

Notice that I also had a heartbeat problem today again after using Power Triggers even though the CEP2 was *manually* suspended (suspend button in the tasks tab), but as I said earlier, suspending things (either this way or through menu Activity->Suspend) doesn't seem to completely solve the problem.

I'll make an update when I have more info on how much this is all helping.
[Jul 29, 2013 6:08:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 18   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread