Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 5
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 679 times and has 4 replies Next Thread
Thanassos
Cruncher
Joined: Jun 21, 2013
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
High error rate? Heartbeat.

Since these units have started back up in the past few days just about all of them my cruncher gets Error after a few hours with Heartbeat problems? Can anyone please explain what exactly that is?

Any idea? Useful work done per day has taken a sharp nose dive!

The rig is a 4x AMD Opteron 6174 with 64GB RAM / 256GB SSD and is on a 150Mbit/40mbit connection.

This one was after 8 hours.

No heartbeat: Exiting
[22:59:53] Number of jobs = 8
[22:59:53] Starting job 0,CPU time has been restored to 0.000000.
23:01:17 (6620): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[23:06:46] Number of jobs = 8
[23:06:46] Starting job 0,CPU time has been restored to 0.000000.
23:08:43 (3900): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[23:16:11] Number of jobs = 8
[23:16:11] Starting job 0,CPU time has been restored to 0.000000.
23:16:47 (5372): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
----------------------------------------

[Sep 4, 2014 1:39:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: High error rate? Heartbeat.

If you've been out of cep2 and then come back to do them again, do not allow to let them start simultaneously. Worse, if the device has to be booted and multiple cep2 were running, on resume start them one by one by hand. Trouble is the hi storage i/o bottlenecking when these tasks start or when multiple resume or when more checkpoint at same time. Even one finishing and uploading (lots of zipping) another starting, could lead to heartbeat issues. Now things have changed recently, the real heavy task is #0 right at the beginning. If checkpointing for several cep2 is simultaneous, things can go cardio palpitative.

A proposal is out to get the stepped starting automated, but there's no 'sponsor' with power-push to get it elevated attention with the developers. Until then expect the default of the project to remain at just one at the time being allowed onto a host for this heartbeat and other reasons.
[Sep 4, 2014 2:01:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Thanassos
Cruncher
Joined: Jun 21, 2013
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate? Heartbeat.

Oh okay, many thanks for the information.

I can't sit there and manually control 48 threads of work though ha.

I can add more I/O to the system though, might give that a shot.

Thanks again.
----------------------------------------

[Sep 4, 2014 10:43:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Yarensc
Advanced Cruncher
USA
Joined: Sep 24, 2011
Post Count: 134
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate? Heartbeat.


I can't sit there and manually control 48 threads of work though ha.

I can add more I/O to the system though, might give that a shot.



If the added hardware doesn't help, you could try limiting the number of CEP tasks that can run at one time with an app_config.xml file

For instructions on how to set it up see SekeRob's post here
http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,34705
[Sep 5, 2014 3:48:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Thanassos
Cruncher
Joined: Jun 21, 2013
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: High error rate? Heartbeat.

All good, thanks for the suggestion, will try if there's problems.

Installed 2 Samsung EVO Pro HDDs in Raid-0. Started up 16 CEP Tasks and the disk activity was hovering around 500MB / second for a few minutes. Didn't think they trashed the drives THAT much!

Either way no errors for 24 hours now. Cheers all!
----------------------------------------

[Sep 6, 2014 4:23:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread