Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 13
Posts: 13   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 3003 times and has 12 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Restarting tasks. Why?? TCEPP2

A product very likely sourced in 'too many' simultaneous CEP2 starting up. Every time I have to run the CEP2 only machine into a boot cycle, I have to suspended all tasks, and then release them one by one... a staggered start, which would be nice to have and controlled via the app_config.xml. Sample.
<app_config>
<app>
<name>cep2</name>
<max_concurrent>4</max_concurrent>
<start_interval>300</start_interval>
</app>
</app_config>


In a production environment, what is likely happening if there is more than CEP2 on the host, the other sciences will start on the idle cores, then the practical interval will be as this other work completes.

At any rate, knreed submitted a development ticket to Dr.A's group in Berkeley month of 3-4 ago. http://boinc.berkeley.edu/trac/ticket/1321 . We need this to prop up CEP2 production and not be forced into substantial MMing restarts.

The staggered manual start can be achieved by editing the app_config.xml <max_concurrent> value [BOINC v7.040 and up]. Start with 1 after a boot and slowly increase the value an do a read config. Practically I'm doing this every 15 minutes, also to make sure to allow a CEP2 task to get past the heaviest phase job#0 which has massive storage IO and model building. Anotehr way is to set the processor percent to for instance 12.5%, for 1 core on an octo, then increase it to 25% 37.5% etc at time intervals.
[Nov 25, 2013 2:14:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Randzo
Senior Cruncher
Slovakia
Joined: Jan 10, 2008
Post Count: 339
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Restarting tasks. Why?? TCEPP2

I think that maybe QChem developers could check this too.
In case of lack of resources the app should just run slower as result of waiting on them (resources) rather than just throw an error. I do not know any other application with such behavior.
[Nov 25, 2013 5:33:21 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Restarting tasks. Why?? TCEPP2

I think that maybe QChem developers could check this too.
In case of lack of resources the app should just run slower as result of waiting on them (resources) rather than just throw an error. I do not know any other application with such behavior.

Lol Randzo, as 8 concurrent started on my octo leads to an hour plus of Elapsed time before the first seconds of CPU time are being logged... oddly my octo succeeds getting to run all 8 without reaching the fatal 100 times zero status / restart, but this stretch to leads to initial 8 core hours without any CPU second for showing... so which one to slow? It's a BOINC problem which can be taken control of with a wrapper readable control.

Let's see what WCG cook up with CEP2v2... last asked the response was 'few more higher priority items'... that was 2 full moons ago or so. Maybe the multi-copy, unpacking solution will find the same fix as with MCM, soft-linking back to one set in the project folder... no more throwing around X slots times 6700 files to unzip and create in each job folder. That's one of the key suckers it appears. 'A' tech expressed interest, so who knows.
[Nov 25, 2013 6:08:44 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 13   Pages: 2   [ Previous Page | 1 2 ]
[ Jump to Last Post ]
Post new Thread