Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 9
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 711 times and has 8 replies Next Thread
gdlxn
Cruncher
Joined: Nov 16, 2004
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
exited with zero status but no 'finished' file

I recently starting getting quite a few "exited with zero status but no 'finished' file" errors from "The Clean Energy Project - Phase 2" work units:
1/26/2013 1:51:36 PM World Community Grid Task E211466_489_C.31.C27H14OS2Se.01679803.4.set1d06_0 exited with zero status but no 'finished' file
I'm running World Community Grid - BOINC 6.10.58 on Windows 7 64-bit. Any ideas what might be causing this and how to resolve the problem?
----------------------------------------
[Edit 1 times, last edit by gdlxn at Jan 26, 2013 8:01:51 PM]
[Jan 26, 2013 8:01:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with zero status but no 'finished' file

Hello gdlxn,
This is not a problem unless your Results Status page starts showing errors / invalids. If you are interested in research on this problem, start by searching for finished in Start Here.

Lawrence
[Jan 26, 2013 8:50:13 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gdlxn
Cruncher
Joined: Nov 16, 2004
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with zero status but no 'finished' file

The problem is that the work units that have the "exited with zero status but no 'finished' file" appear to continually restart and really never finish. For example, I found today that of the eight running process on one system, all of them had expired deadlines. I aborted these work units. I'm also resetting the project on this system to see that helps.

I also searched for finished in Start Here, but that resulted in 0 hits.
[Feb 1, 2013 8:19:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with zero status but no 'finished' file

Hi gdlxn,
Here is the FAQ: https://secure.worldcommunitygrid.org/forums/...ead,16784_offset,0#133103

What you need is to keep the work unit running continuously until it finishes. The first thing to do is to work on the profile. Select My Grid - Device Manager - (Selected Profile). If it is default, changing that could be confusing. I use 'home' as my profile. First I checked 'Maximum Output', then I selected 'Custom Profile' to allow me to change those settings. I checked 'Set as Default'. I set everything to run full blast, no GPU, only 1 CEP project at a time, no HCC since the GPU computers are much more efficient than my CPU and under the Memory Usage section I have 'Leave applications in memory while suspended? YES'. Hit SAVE at the bottom. There are various ways to make BOINC load your profile, but reboot always works.

Once you are running on your new profile, check Task Manager to make sure that your BOINC projects are using nearly 100% of your CPU without anything interfering. This should do the trick.

Lawrence
----------------------------------------
[Edit 1 times, last edit by Former Member at Feb 1, 2013 9:43:35 PM]
[Feb 1, 2013 9:40:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gdlxn
Cruncher
Joined: Nov 16, 2004
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
CEP2 work units continually restarting (was exited with zero status but no 'finished' file)

Lawrence,

Thanks for the suggestions.

My work units are using "100%" of the system. I have Number of workunits per host for The Clean Energy Project - Phase 2? set to Unlimited. What is the reason you suggest setting it to 1 - Default?

I've found the problem with my work units continually restarting isn't due to the "exited with zero status but no 'finished' file", as the running work units have recently restarted and there are no new "exited with zero status but no 'finished' file" messages.

I checked the stderr.txt file for one of the restarted work units and found:
INFO: No state to restore. Start from the beginning.
[15:18:45] Number of jobs = 16
[15:18:45] Starting job 0,CPU time has been restored to 0.000000.
[15:27:41] Finished Job #0
[15:27:41] Starting job 1,CPU time has been restored to 438.097608.
[15:56:42] Finished Job #1
[15:56:42] Starting job 2,CPU time has been restored to 2037.123458.
Quit requested: Exiting
[20:51:56] Number of jobs = 16
[20:51:56] Starting job 2,CPU time has been restored to 2037.123458.
Quit requested: Exiting
[20:52:56] Number of jobs = 16
[20:52:56] Starting job 2,CPU time has been restored to 2037.123458.

The work unit seems to have restarted at 20:51:56 and 20:52:56. Looking at my messages I find:
2/1/2013 8:51:45 PM Suspending computation - CPU usage is too high
2/1/2013 8:51:55 PM Resuming computation
2/1/2013 8:52:45 PM Suspending computation - CPU usage is too high
2/1/2013 8:52:55 PM Resuming computation

So it seems that the restart is happening when the work unit is suspended and then resumed. I've changed Leave applications in memory while suspended? from No to Yes to see if that helps.

Geoff
[Feb 2, 2013 4:41:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 work units continually restarting (was exited with zero status but no 'finished' file)

Hi gdlxn,
Glad to hear that you are making progress in tracking down the problem. CPU Usage is in your profile. You get to choose a limit past which BOINC shuts down. Naturally, my limit is 100.0%. Why would I want my CPU to ever go idle? CEP2 is a very demanding program that can can fill up a long queue of I/O operations for the hard disk. From caution, I only run one CEP2 work unit at a time even though any fast computer should be able to run 2 at a time and some people boast of running 4 or even 8 on their speed demons. But we added the control statement to the profile when we first started Clean Energy and quickly ran into trouble on normal computers.

The Leave Application in Memory is vital since anything that stops the project, like a cpu limit in the profile, will otherwise force CEP to start over at a checkpoint and lose as many as 3 hours computing since there are only 15 check points, some of them hours apart.

Lawrence
[Feb 2, 2013 6:58:44 AM]   Link   Report threatening or abusive post: please login first  Go to top 
gdlxn
Cruncher
Joined: Nov 16, 2004
Post Count: 24
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: CEP2 work units continually restarting (was exited with zero status but no 'finished' file)

I'm continuing to see the "exited with zero status but no 'finished' file / No heartbeat from core client for 30 sec - exiting" errror on all of my systems, which is causing significate amounts of work to be lost:
INFO: No state to restore. Start from the beginning.
[15:59:41] Number of jobs = 16
[15:59:41] Starting job 0,CPU time has been restored to 0.000000.
[16:03:29] Finished Job #0
[16:03:29] Starting job 1,CPU time has been restored to 217.828125.
20:40:05 (3608): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[03:54:21] Number of jobs = 16
[03:54:21] Starting job 1,CPU time has been restored to 217.828125.
[04:05:24] Finished Job #1
[04:05:24] Starting job 2,CPU time has been restored to 857.343750.
06:37:58 (6064): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[18:13:08] Number of jobs = 16
[18:13:08] Starting job 2,CPU time has been restored to 857.343750.
20:09:17 (5216): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[20:09:50] Number of jobs = 16
[20:09:50] Starting job 2,CPU time has been restored to 857.343750.
01:05:53 (2516): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[01:08:46] Number of jobs = 16
[01:08:46] Starting job 2,CPU time has been restored to 857.343750.
06:04:48 (984): No heartbeat from core client for 30 sec - exiting
No heartbeat: Exiting
[06:05:20] Number of jobs = 16
[06:05:20] Starting job 2,CPU time has been restored to 857.343750.
As I've only seen this problem with the CEP2 project and since it happens on all of my systems, I'm beginning to suspect that it's a problem with the CEP2 project rather than problem in my environment.
[Feb 4, 2013 2:28:47 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 work units continually restarting (was exited with zero status but no 'finished' file)

Summarizing previous advice in this thread and expanding:

1) Revert to setting 1 CEP2 task at the time. With clients prior to 7.0.39 that can only be done through the web-device profile. With 7.0.39 and up, it can be done by setting a <max_concurrent> in a user creatable app_config.xml project file.
2) Set Leave application in memory when suspended to On / tick mark.
3) Set in Local/Computing Preferences the option "While processor usage is less than 25%" (Experiment the percent up or down to see when heartbeat/zero status issues stops).
4) Set in BOINC Manager > Activity menu the CPU to Run based on Preferences.

These settings work for me, on all my devices.

CEP2 is *not* recommended to run concurrent for more than half of the processors, meaning if you have a quad, not run more than 2 simultaneous. Individual devices can be tuned up or down based on their typical use. One where your working on can run less without you noticing or getting problems, than devices that are not used for anything but crunching, or a little file serving.

Since I can run 8 concurrent on my octo-hyperthreaded with 98% efficiency or better, but only when absolutely left alone, nothing else running, no user input, there's little doubt it's the individual device setup and utilization and capability [for instance slow CPU, shortage of RAM, too much memory ot disk swapping]. For sure, since we know CEP2 a tough cooky to run, the project is explicitly Opt-in. Since we do not want any member to be impaired by this/or any WCG project running, or frequently producing invalid/error results or results that restart many times, thus run very inefficient, the advice is indeed to opt-out of this science.

Let us know if the points 1-4 do not resolve the matter, even for running 1 at the time.
[Feb 4, 2013 2:50:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: CEP2 work units continually restarting (was exited with zero status but no 'finished' file)

Hi gdlxn,
If you can run other projects without any problems, then the easy way to proceed is to drop CEP2 from your Projects. The idea is to be happy about the research you are contributing without a lot of worry or bother.

tongue
Lawrence
[Feb 4, 2013 2:51:34 PM]   Link   Report threatening or abusive post: please login first  Go to top 
[ Jump to Last Post ]
Post new Thread