World Community Grid - View Thread - Restarting tasks. Why?? TCEPP2

World Community Grid Forums

Category: Completed Research

Forum: The Clean Energy Project - Phase 2 Forum

Thread: Restarting tasks. Why?? TCEPP2

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 13

[ ]

Author

This topic has been viewed 3003 times and has 12 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Restarting tasks. Why?? TCEPP2

A product very likely sourced in 'too many' simultaneous CEP2 starting up. Every time I have to run the CEP2 only machine into a boot cycle, I have to suspended all tasks, and then release them one by one... a staggered start, which would be nice to have and controlled via the app_config.xml. Sample.

<app_config>
   <app>
      <name>cep2</name>
      <max_concurrent>4</max_concurrent>
      <start_interval>300</start_interval>
   </app>
</app_config>

In a production environment, what is likely happening if there is more than CEP2 on the host, the other sciences will start on the idle cores, then the practical interval will be as this other work completes.

At any rate, knreed submitted a development ticket to Dr.A's group in Berkeley month of 3-4 ago. http://boinc.berkeley.edu/trac/ticket/1321 . We need this to prop up CEP2 production and not be forced into substantial MMing restarts.

The staggered manual start can be achieved by editing the app_config.xml <max_concurrent> value [BOINC v7.040 and up]. Start with 1 after a boot and slowly increase the value an do a read config. Practically I'm doing this every 15 minutes, also to make sure to allow a CEP2 task to get past the heaviest phase job#0 which has massive storage IO and model building. Anotehr way is to set the processor percent to for instance 12.5%, for 1 core on an octo, then increase it to 25% 37.5% etc at time intervals.

[Nov 25, 2013 2:14:50 PM]

Randzo
Senior Cruncher
Slovakia
Joined: Jan 10, 2008
Post Count: 339
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

45 day badge for Influenza Antiviral Drug Search

180 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

90 day badge for Computing for Sustainable Water


Re: Restarting tasks. Why?? TCEPP2

I think that maybe QChem developers could check this too.
In case of lack of resources the app should just run slower as result of waiting on them (resources) rather than just throw an error. I do not know any other application with such behavior.

[Nov 25, 2013 5:33:21 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Restarting tasks. Why?? TCEPP2

Lol Randzo, as 8 concurrent started on my octo leads to an hour plus of Elapsed time before the first seconds of CPU time are being logged... oddly my octo succeeds getting to run all 8 without reaching the fatal 100 times zero status / restart, but this stretch to leads to initial 8 core hours without any CPU second for showing... so which one to slow? It's a BOINC problem which can be taken control of with a wrapper readable control.

Let's see what WCG cook up with CEP2v2... last asked the response was 'few more higher priority items'... that was 2 full moons ago or so. Maybe the multi-copy, unpacking solution will find the same fix as with MCM, soft-linking back to one set in the project folder... no more throwing around X slots times 6700 files to unzip and create in each job folder. That's one of the key suckers it appears. 'A' tech expressed interest, so who knows.

[Nov 25, 2013 6:08:44 PM]

[ ]