Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 7
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 711 times and has 6 replies Next Thread
Redbird10
Cruncher
Joined: Aug 11, 2011
Post Count: 3
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
confused CEP2 Errors?

<core_client_version>7.0.64</core_client_version>
<![CDATA[
<message>
所有的管道范例都在使用中。
(0xe7) - exit code 231 (0xe7)
</message>
<stderr_txt>
INFO: No state to restore. Start from the beginning.
[18:51:12] Number of jobs = 16
[18:51:12] Starting job 0,CPU time has been restored to 0.000000.
[18:53:28] Finished Job #0
[18:53:28] Starting job 1,CPU time has been restored to 131.118841.
[19:02:24] Finished Job #1
[19:02:24] Starting job 2,CPU time has been restored to 633.488861.
[22:58:59] Finished Job #2
[22:58:59] Starting job 3,CPU time has been restored to 14025.971910.
Application exited with RC = 0x40010004
[00:05:59] Finished Job #3
[00:05:59] Starting job 4,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #4
[00:05:59] Starting job 5,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #5
[00:05:59] Starting job 6,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #6
[00:05:59] Starting job 7,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #7
[00:05:59] Starting job 8,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #8
[00:05:59] Starting job 9,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #9
[00:05:59] Starting job 10,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #10
[00:05:59] Starting job 11,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #11
[00:05:59] Starting job 12,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #12
[00:05:59] Starting job 13,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #13
[00:05:59] Starting job 14,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #14
[00:05:59] Starting job 15,CPU time has been restored to 14203.485447.
[00:05:59] Skipping Job #15
00:06:03 (7884): called boinc_finish
Error reading job description file C.34.C27H15N3S3Si.00527311.2.jobs

</stderr_txt>
]]>


It seems to occur when I pause the workunit...(~35%)
How does it come about?
THX
[May 16, 2013 12:30:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 Errors?

"How does it come about?" Don't know, also because the exit code 231 error only ever has been reported with HCC-GPU tasks. A previous thread discussion http://www.worldcommunitygrid.org/forums/wcg/viewthread_thread,34169 the cause is external to BOINC.
[May 16, 2013 12:56:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mindworks.hu
Cruncher
Joined: Nov 19, 2010
Post Count: 8
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 Errors?

I also get many sig 11 since I joined. I hardly got chunks with any error before. What shall I try to reduce the number of faults?
[May 25, 2013 9:43:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 Errors?

Hello Mindworks.hu,
I doubt that there is anything for you to do. The algorithm works for an adequate sample of the molecular space. The project scientists are happy. Leave these problems for them to decide whether or not to ignore.

confused
Lawrence

Added: knred just posted about some interesting changes coming for CEP2: https://secure.worldcommunitygrid.org/forums/...ead,35152_offset,0#422494
----------------------------------------
[Edit 1 times, last edit by Former Member at May 26, 2013 6:01:46 AM]
[May 26, 2013 12:50:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mindworks.hu
Cruncher
Joined: Nov 19, 2010
Post Count: 8
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 Errors?

The problem is I've wasted 12 days of work of 38 because these fails. I've been running the project on Linux x86-64 and Win XP 32bit with the latest BOINC, so I think the problem is very CEP2-specific.
[May 26, 2013 9:37:02 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 Errors?

Signal 11 is a classic on Linux, not just for CEP2, but more easily for this heavy science. BOINC needs the unimpaired connection between the process controller [boinc core client aka daemon] and the sciences at least one time per every 30 seconds. If it can't, it will often call a heartbeat condition.

I've set BOINC on Ubuntu to pause [with Leave application in memory when suspended], when non-BOINC CPU load is greater than 35%. Plus, I've added a couple of exclusive apps to the cc_config.xml such as apt-get and synaptic. When these run, BOINC is paused, no matter if run always or run based on preferences is selected. I'd rather have the crunch hold for a few minutes, than a task crash out or incur a zero status and drop back to the last checkpoint which particularly with CEP2 can mean hours of lost computing time.

If CEP2 crashes out with signal 11 and the system is idle, then review the overall system load, to include reducing to just one CEP2 at the time. There's huge amounts of IO to the storage systems which Linux is actually poor at managing... starving the client from receiving the heartbeat RPC from the science apps. When you have 4-6-8 running same time, that's on average Linux device a call for trouble. Recommended: Not run more CEP2 than 50% of all threads e.g. on a quad not more than 2 same time. Easily managed by using the 7.0.40+ client and setting <max_concurrent> in a user creatable app_config.xml file [see other threads for discussion].

Edit: P.S. The heartbeat function is a long known core client weakness. For years there's been talk to replace this system, but it depends on Berkeley. If ever replaced [so the science processes continue even if the core client is gone for a little, or quit gracefully], it will be requiring a global transition in how the apps are compiled in the wrapper [hundreds of projects around the world]. Too many different clients, old and new are around, where upgrading clients is a very major undertaking, if not impossible to reach all. The science wrapper would have to cater for legacy systems, so old and new ways are handled properly and no zombie processes are left running.
----------------------------------------
[Edit 1 times, last edit by Former Member at May 26, 2013 10:11:36 AM]
[May 26, 2013 10:06:03 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Mindworks.hu
Cruncher
Joined: Nov 19, 2010
Post Count: 8
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 Errors?

SekeRob, thanks for the reply! Yes, I've been running CEP2 on all cores. I think it's not a Linux-specific or 64bit-specific issue, because this error is happening on WinXP too.

Too bad, I can't forecast a system slowdown as I'm actively using both node and all my apps can do occasionally heavy I/O. Weird, but I really haven't experienced this with other WCG projects.
[May 26, 2013 4:02:11 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread