Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: Restarting tasks. Why?? TCEPP2 |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 13
|
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
A product very likely sourced in 'too many' simultaneous CEP2 starting up. Every time I have to run the CEP2 only machine into a boot cycle, I have to suspended all tasks, and then release them one by one... a staggered start, which would be nice to have and controlled via the app_config.xml. Sample.
<app_config> In a production environment, what is likely happening if there is more than CEP2 on the host, the other sciences will start on the idle cores, then the practical interval will be as this other work completes. At any rate, knreed submitted a development ticket to Dr.A's group in Berkeley month of 3-4 ago. http://boinc.berkeley.edu/trac/ticket/1321 . We need this to prop up CEP2 production and not be forced into substantial MMing restarts. The staggered manual start can be achieved by editing the app_config.xml <max_concurrent> value [BOINC v7.040 and up]. Start with 1 after a boot and slowly increase the value an do a read config. Practically I'm doing this every 15 minutes, also to make sure to allow a CEP2 task to get past the heaviest phase job#0 which has massive storage IO and model building. Anotehr way is to set the processor percent to for instance 12.5%, for 1 core on an octo, then increase it to 25% 37.5% etc at time intervals. |
||
|
Randzo
Senior Cruncher Slovakia Joined: Jan 10, 2008 Post Count: 339 Status: Offline Project Badges: |
I think that maybe QChem developers could check this too.
In case of lack of resources the app should just run slower as result of waiting on them (resources) rather than just throw an error. I do not know any other application with such behavior. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I think that maybe QChem developers could check this too. In case of lack of resources the app should just run slower as result of waiting on them (resources) rather than just throw an error. I do not know any other application with such behavior. Lol Randzo, as 8 concurrent started on my octo leads to an hour plus of Elapsed time before the first seconds of CPU time are being logged... oddly my octo succeeds getting to run all 8 without reaching the fatal 100 times zero status / restart, but this stretch to leads to initial 8 core hours without any CPU second for showing... so which one to slow? It's a BOINC problem which can be taken control of with a wrapper readable control. Let's see what WCG cook up with CEP2v2... last asked the response was 'few more higher priority items'... that was 2 full moons ago or so. Maybe the multi-copy, unpacking solution will find the same fix as with MCM, soft-linking back to one set in the project folder... no more throwing around X slots times 6700 files to unzip and create in each job folder. That's one of the key suckers it appears. 'A' tech expressed interest, so who knows. |
||
|
|