Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: High error rate? Heartbeat. |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 5
|
Author |
|
Thanassos
Cruncher Joined: Jun 21, 2013 Post Count: 24 Status: Offline Project Badges: |
Since these units have started back up in the past few days just about all of them my cruncher gets Error after a few hours with Heartbeat problems? Can anyone please explain what exactly that is?
----------------------------------------Any idea? Useful work done per day has taken a sharp nose dive! The rig is a 4x AMD Opteron 6174 with 64GB RAM / 256GB SSD and is on a 150Mbit/40mbit connection. This one was after 8 hours. No heartbeat: Exiting [22:59:53] Number of jobs = 8 [22:59:53] Starting job 0,CPU time has been restored to 0.000000. 23:01:17 (6620): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [23:06:46] Number of jobs = 8 [23:06:46] Starting job 0,CPU time has been restored to 0.000000. 23:08:43 (3900): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting [23:16:11] Number of jobs = 8 [23:16:11] Starting job 0,CPU time has been restored to 0.000000. 23:16:47 (5372): No heartbeat from core client for 30 sec - exiting No heartbeat: Exiting |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
If you've been out of cep2 and then come back to do them again, do not allow to let them start simultaneously. Worse, if the device has to be booted and multiple cep2 were running, on resume start them one by one by hand. Trouble is the hi storage i/o bottlenecking when these tasks start or when multiple resume or when more checkpoint at same time. Even one finishing and uploading (lots of zipping) another starting, could lead to heartbeat issues. Now things have changed recently, the real heavy task is #0 right at the beginning. If checkpointing for several cep2 is simultaneous, things can go cardio palpitative.
A proposal is out to get the stepped starting automated, but there's no 'sponsor' with power-push to get it elevated attention with the developers. Until then expect the default of the project to remain at just one at the time being allowed onto a host for this heartbeat and other reasons. |
||
|
Thanassos
Cruncher Joined: Jun 21, 2013 Post Count: 24 Status: Offline Project Badges: |
Oh okay, many thanks for the information.
----------------------------------------I can't sit there and manually control 48 threads of work though ha. I can add more I/O to the system though, might give that a shot. Thanks again. |
||
|
Yarensc
Advanced Cruncher USA Joined: Sep 24, 2011 Post Count: 134 Status: Offline Project Badges: |
I can't sit there and manually control 48 threads of work though ha. I can add more I/O to the system though, might give that a shot. If the added hardware doesn't help, you could try limiting the number of CEP tasks that can run at one time with an app_config.xml file For instructions on how to set it up see SekeRob's post here http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,34705 |
||
|
Thanassos
Cruncher Joined: Jun 21, 2013 Post Count: 24 Status: Offline Project Badges: |
All good, thanks for the suggestion, will try if there's problems.
----------------------------------------Installed 2 Samsung EVO Pro HDDs in Raid-0. Started up 16 CEP Tasks and the disk activity was hovering around 500MB / second for a few minutes. Didn't think they trashed the drives THAT much! Either way no errors for 24 hours now. Cheers all! |
||
|
|