World Community Grid - View Thread - exited with zero status

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: exited with zero status

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 26

[ ]

Author

This topic has been viewed 1831 times and has 25 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with zero status

Hello tfmagnetism,

11-Mar-2012 20:31:23 [---] don't compute while active

Is this a laptop? Do you have Leave Application In Memory selected? Is the CPU set to run BOINC 100% or does it turn BOINC projects off every few seconds?

Just trying to figure out what might be causing strange behavior.
Lawrence

[Mar 11, 2012 10:40:06 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with zero status

Suggest you insert this line into the <options> section of the cc_config.xml

<start_delay>120</start_delay>

This will ensure that your operating system and any other auto-start bits are loaded and running before BOINC begins to compute. Here's the cc_config.xml manual: http://boinc.berkeley.edu/wiki/Cc_config.xml

Consider to set the processor % to 50 so the client only does 1 task at the time when running CEP2.

--//--

[Mar 11, 2012 10:44:35 PM]

tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 17
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

10 year badge for Mapping Cancer Markers

14 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

2 year badge for OpenPandemics - COVID-19


Re: exited with zero status

Hi Lawrence,

No it's a desktop. LAIM is on. Not quite sure what you mean about the cpu, but it's all on 100%, with a period of 0.04 minutes idle time. Ah you guys have been holding out on me lol! From what I just read in the forums this is all normal (ish)! A checkpoint only at 11-12mins cpu time, and job 2 must be taking an eternity until it makes another checkpoint. What's WEIRD is why this only happens in about 50% of WUs. I've been reading that it's normal for job 2 to take hours before checkpointing again. Anyway heres my global prefs if you wan't to have a look:

<global_preferences>
<source_project>http://www.worldcommunitygrid.org/</source_project>
<source_scheduler>https://grid.worldcommunitygrid.org/boinc/wcg_cgi/fcgi</source_scheduler>
<mod_time>1328483436</mod_time>
<cpu_scheduling_period_minutes>120</cpu_scheduling_period_minutes>
<disk_interval>60.0</disk_interval>
<disk_max_used_gb>10.0</disk_max_used_gb>
<disk_max_used_pct>50.0</disk_max_used_pct>
<disk_min_free_gb>0.5</disk_min_free_gb>
<end_hour>0</end_hour>
<idle_time_to_run>0.04</idle_time_to_run>
<leave_apps_in_memory/>
<max_bytes_sec_down>0.0</max_bytes_sec_down>
<max_bytes_sec_up>0.0</max_bytes_sec_up>
<daily_xfer_period_days>0</daily_xfer_period_days>
<daily_xfer_limit_mb>0.0</daily_xfer_limit_mb>
<max_cpus>16</max_cpus>
<max_ncpus_pct>100.0</max_ncpus_pct>
<suspend_cpu_usage>0.0</suspend_cpu_usage>
<net_end_hour>0</net_end_hour>
<net_start_hour>0</net_start_hour>
<start_hour>0</start_hour>
<cpu_usage_limit>100.0</cpu_usage_limit>
<ram_max_used_busy_pct>50.0</ram_max_used_busy_pct>
<ram_max_used_idle_pct>75.0</ram_max_used_idle_pct>
<vm_max_used_pct>75.0</vm_max_used_pct>
<work_buf_min_days>0.0</work_buf_min_days>
<work_buf_additional_days>0.0</work_buf_additional_days>
<suspend_if_no_recent_input>20.0</suspend_if_no_recent_input>
</global_preferences>

Sekerob,
Not sure what that's going to achieve since this seems to be a checkpoint problem like what several others have been reporting. But I'll certainly consider it.

I'm going to look into hibernation OR boosting CEP2 a bit before shutdown until it makes a checkpoint.

In BOINC, slots, 0 folder I get in stdout:

[00:18:31] [INFO] Skipping checkpoint per user settings.
[00:21:05] [INFO] Checkpoint complete.
[00:32:16] [INFO] Checkpoint complete.

But I'm sure that's nothing to worry about. A bit worried about the "skipping" bit there.

This from client_state in BOINC directory:

<active_task>
<project_master_url>http://www.worldcommunitygrid.org/</project_master_url>
<result_name>E206605_695_C.24.C21H16N2Si.01364647.4.set1d06_0</result_name>
<active_task_state>1</active_task_state>
<app_version_num>640</app_version_num>
<slot>0</slot>
<checkpoint_cpu_time>735.310714</checkpoint_cpu_time>
<checkpoint_elapsed_time>812.553997</checkpoint_elapsed_time>
<checkpoint_fraction_done>0.000000</checkpoint_fraction_done>
<checkpoint_fraction_done_elapsed_time>0.000000</checkpoint_fraction_done_elapsed_time>
<current_cpu_time>3578.678500</current_cpu_time>
<once_ran_edf>0</once_ran_edf>
<swap_size>226050048.000000</swap_size>
<working_set_size>67145728.000000</working_set_size>
<working_set_size_smoothed>67145728.000000</working_set_size_smoothed>
<page_fault_rate>14.694281</page_fault_rate>
</active_task>

Now that above was an updated client_state when the other WU running today, non CEP2, has just downloaded. So quite why it shows 0% fraction done - seems a strange but perhaps that's normal.

This CEP2 is certainly a strange one I'll say that!

[Mar 12, 2012 5:16:20 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with zero status

Hi tfmagnetism,

This CEP2 is certainly a strange one I'll say that!

That seems to be the key. As we run through the library of molecules to check, we keep running into new behavior. Back when we started, the first few check points came in a few hours, then we would start running into long hours between check points.

Lawrence

[Mar 12, 2012 5:58:33 PM]

tfmagnetism
Cruncher
Joined: Jul 22, 2011
Post Count: 17
Status: Offline
Project Badges:


Re: exited with zero status

Hi Lawrence,

I just pushed the boat out, and ran up to 30% and it just "checkpointed". So it is working after all, just the checkpoint of job 2 (or something) must be really variable! Wow, my other 50% WUs were checkpointing so that I didn't have problems shutting down/restarting even after only several percent. CPU time 3h:34m at checkpoint, elapsed time 3h:45m! Wow. Oh my god that's quite a jump from a few percent. That's a really variable checkpointing system there.
boinc_task_state ( BOINC,slots,0 folder) gives:

<active_task>
<project_master_url>http://www.worldcommunitygrid.org/</project_master_url>
<result_name>E206605_695_C.24.C21H16N2Si.01364647.4.set1d06_0</result_name>
<checkpoint_cpu_time>12890.737033</checkpoint_cpu_time>
<checkpoint_elapsed_time>13330.422573</checkpoint_elapsed_time>
<fraction_done>0.298397</fraction_done>
</active_task>

Now I'm understanding things. I better undelete that hibernation file that I got rid of - I never use hibernation. Wow that's lots of lost work for a lot of people presumably. Now we're really understanding things better. But it only seems to be so noticeable for around 50% of WUs. I think I'll have another look at the FAQs later and see if that's made clear. You see, we only do about 20-25% usually of CEP2 before we shutdown again. OK well at least now I know what's happening.

[Mar 12, 2012 7:18:25 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: exited with zero status

May I offer a few suggestions.

Defrag your drive, with CEP throwing thousands of small files on a drive, they can get fragged pretty easily after a while. Check the smart status of your drive. If you find there are a significant amount of read / write sector remaps, it might be time to get a new drive, or... a good hard defrag of it. I had one drive die and had to ghost it to a new drive and do a windows 'boot up fix' on the new clone to get it to run, but after that, after defragging the old drive and running check disk on it, the full blown one not the abbreviated, the old drive seems to be working a lot better and the errors are not happening much anymore.

As it was said, if your system gets busy it can cause this to happen, a HDD thrashing around can definately cause that 'busy' problem.

Aaron

[Mar 29, 2012 8:16:23 PM]

[ ]