Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
World Community Grid Forums
Category: Completed Research Forum: The Clean Energy Project - Phase 2 Forum Thread: Multiple tasks failing (exit with zero stats but no finished file) |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 10
|
Author |
|
imakuni
Advanced Cruncher Joined: Jun 11, 2009 Post Count: 90 Status: Offline Project Badges: |
Title. It's been happening for a few days, on one of my machines. The majority of the tasks end up with this error, although eventually one or 2 WUs succed.
----------------------------------------I already tryed resetting the project, as well as detaching WCG. Neither have worked. The client is set to use all disk space available (and there is plenty), so it's not that either. I never happened with MCM, and when I was running CEP (around a month ago), it was doing just fine). Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840 |
||
|
Seoulpowergrid
Veteran Cruncher Joined: Apr 12, 2013 Post Count: 799 Status: Offline Project Badges: |
I've been having multiple errors for CEP2 these days on multiple machines (Win 7 and Linux/Ubunto recent install). What part of the code should I look at to see if I've been having the same errors as you've had?
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
All the answers and possible mitigation actions are covered extensively in this forum. Default is 1 at the time!, repeat, 1 at the time when opting in. Only if your hardware/OS is tuned to handle more, you can.
|
||
|
imakuni
Advanced Cruncher Joined: Jun 11, 2009 Post Count: 90 Status: Offline Project Badges: |
All the answers and possible mitigation actions are covered extensively in this forum. Default is 1 at the time!, repeat, 1 at the time when opting in. Only if your hardware/OS is tuned to handle more, you can. How about a 16 thread (8core) Xeon with 16gb of RAM and 2tb of HD? Shouldn't that be enough to handle extra WUs.? Afterall, it used to, around a month ago (when I was running CEP on that machine). But now, it seems to be failing, for some reason. And I DOUBT it is because of weak hardware. I also have MANY other weaker machines that can run multiple CEP at a time, none of them are showing that problem. Last, if this is "extensively covered in this forum", I suppose it wouldn't be that hard to copy one of those many links and post it here, no? We appreciate the help, but we'll be waiting for a proper response. Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840 [Edit 1 times, last edit by imakuni at Jun 22, 2015 6:54:38 PM] |
||
|
petehardy
Senior Cruncher USA Joined: May 4, 2007 Post Count: 318 Status: Offline Project Badges: |
How about a 16 thread (8core) Xeon with 16gb of RAM and 2tb of HD? Shouldn't that be enough to handle extra WUs.? You need your Boinc data directory to be on a RAM disk or an SSD. "Patience is a virtue", I can't wait to learn it! |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
There's a pop-up discussion started by the BOINC developer regarding a project of WCG that causes him failures, on Windows. He's not mentioned which, but it may be related to an API used in compiling the science prior to Oct.31 2014. The Techs might like to get in touch with David Anderson to ascertain which one exactly [zombie processes if LAIM is -NOT- set and BOINC is suspended, which is when sciences are supposed to unload]. In case of CEP2 it's always been strongly recommended to run with LAIM -ON-, but zombie processes related to CEP2 I've not heard of before. My thinking goes towards any of the sciences that have a controller/stager part and a worker, the controller exiting, but not the worker.
----------------------------------------Edit: Not CEP2 but VINA, which narrows it down to OET1 and making FAHV an academic case, if it happens to this one. [Edit 1 times, last edit by Former Member at Jun 22, 2015 10:14:35 PM] |
||
|
Yarensc
Advanced Cruncher USA Joined: Sep 24, 2011 Post Count: 134 Status: Offline Project Badges: |
How about a 16 thread (8core) Xeon with 16gb of RAM and 2tb of HD? Shouldn't that be enough to handle extra WUs.? You're powerful computer might actually be more of a problem than a benefit for you're specific problem. Since it can run so many tasks at once and (presumably) quicker than normal, there's a good chance multple tasks are trying to write to your one harddrive at the same time. This can result in a timeout where a task crashes because it thinks the system has crashed, when in reality there are just 2 or 3 tasks ahead of it in line for the drive. CEP2 has very large files (which is why it has a default of 1 at a time as SekeRob pointed out. This problem can be mitigated by running on a SSD or Ramdisk as petehardy pointed out, or by setting up an app_config.xml file to limit the number running concurrently and staggering their start. If you don't want to go through the work for either of those, just reset your profile to one at a time and fill the other 15 threads with other projects. setting up max_concurrent with app_config -> https://secure.worldcommunitygrid.org/forums/...ead,37845_offset,0#487614 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
An interesting observation here is that CEP2 on my 4770 at 1900MHz instead of the regular 3700MHz [summer heat dictated], provided several percentage points better efficiency [now 98-99%]... implying the HD is not keeping up going full out.
Sample of last few recorded by BOINCTasks history: 7.00 cep2 E231218_479_S.244.C20F6H10N4S2.CECRYFGIAHSAQJ-UHFFFAOYSA-N.11_s1_14_0 07:21:37 (07:15:28) 6/23/2015 5:21:46 PM 6/23/2015 5:25:46 PM 98,61 Reported: OK * 207.66 MB 415.66 MB 7.00 cep2 E231217_429_S.240.C32H34N2.ZIQIFKYUGFEYKA-UHFFFAOYSA-N.6_s1_14_0 14:08:18 (13:58:06) 6/23/2015 10:02:21 AM 6/23/2015 10:06:22 AM 98,80 Reported: OK * 281.31 MB 476.03 MB |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: |
There's a pop-up discussion started by the BOINC developer regarding a project of WCG that causes him failures, on Windows. SekeRob is this on one of the BOINC Dev email lists? Thanks, armstrdj |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Yes, the noted Dr.A commented on the client alpha mail list and pointed at a VINA app of WCG, then another user wrote another project got it from WCG (Find?), which was not checkpointing, which to me sets the finger to be homing in at the first OET which did not do so.
|
||
|
|