Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 10
|
![]() |
Author |
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi folks, sorry if this may have been asked before.
One of my computers is pretty strong and stays on all the time, so I'm experimenting with contributing to CEP-2. I have it running eleven copies, with one core supporting the GPU tasks for GPU Grid. There doesn't seem to be a bottleneck - there's plenty of hard disk space for the project, and they're using less RAM then the system requirements warned (6.5 in use out of 16 gig max). But for the last few days, when I check on them, most are progressing steadily, on their first HOUR of work ![]() Any advice? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
CEP2 does a lot of I/O in bursts, so the bottleneck might be the HDD especially as you're running so many in parallel. You may notice heartbeat errors in the logs, particularly if they're all (re)starting at the same time.
My advice would be to run fewer at once. You can tweak BOINC to set a maximum. See elsewhere in this forum for how-to. |
||
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18665 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
CEP2 also checkpoints very infrequently so the slightest hiccup will back a WU up to the last checkpoint. That many concurrent WUs could be maxing out your memory making LAIM pointless. First thing to do would be to review the BOINC event log for unusual messages. Also, if you follow the directory tree down, you may be able to find the log for one of the active WUs (stdaeerr.txt I *think*) and see what's been written to it so far. Post what you find here.
---------------------------------------- |
||
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Oho. Exited with zero status but no 'finished' file. Finally, an error message!
I'd be happy to decrease the number of work units, if that's correlated with this error message. 1/1/2015 1:16:11 PM | World Community Grid | Task E227433_692_S.254.C34H22N4.QRZDCTKWRXZCIL-UHFFFAOYSA-N.7_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:16:11 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:16:11 PM | World Community Grid | Task E227435_398_S.254.C32H20N6.KMBASDFPNPOUPR-UHFFFAOYSA-N.9_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:16:11 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:16:11 PM | World Community Grid | Task E227449_173_S.254.C34H22N4.CQMPUMPMOALEFI-UHFFFAOYSA-N.6_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:16:11 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:16:11 PM | World Community Grid | Task E227451_930_S.256.C32H20N4S1.DNPKIYWXPSOUDJ-UHFFFAOYSA-N.3_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:16:11 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:16:11 PM | World Community Grid | Task E227454_357_S.256.C34H22N2S1.HEPVMXLHWJGLHY-UHFFFAOYSA-N.9_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:16:11 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:16:11 PM | World Community Grid | Computation for task E227432_996_S.254.C28H12N6S2.KQZHOPHCFKURLS-UHFFFAOYSA-N.3_s1_14_0 finished 1/1/2015 1:16:11 PM | FiND@Home | Sending scheduler request: To fetch work. 1/1/2015 1:16:11 PM | FiND@Home | Requesting new tasks for CPU and NVIDIA GPU 1/1/2015 1:17:03 PM | World Community Grid | Task E227420_961_S.250.C36H27N1.KUHWGGPVXFMOLS-UHFFFAOYSA-N.6_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:17:03 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:17:03 PM | World Community Grid | Task E227423_672_S.252.C26H15N7S2.RREHHLBQLUMRNP-UHFFFAOYSA-N.3_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:17:03 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:17:03 PM | World Community Grid | Task E227423_721_S.252.C30H21N5O2.HICZTGFSAJMRCA-UHFFFAOYSA-N.16_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:17:03 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:17:03 PM | World Community Grid | Task E227427_507_S.252.C32H24N4O1.SIOKLUZBYXREFR-UHFFFAOYSA-N.6_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:17:03 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:17:03 PM | World Community Grid | Task E227429_654_S.252.C27H20N10.LUFRHPJDNNECAF-UHFFFAOYSA-N.17_s1_14_1 exited with zero status but no 'finished' file 1/1/2015 1:17:03 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:17:03 PM | World Community Grid | Task E227431_2_S.254.C32H20N6.IIOWKCXICUURSX-UHFFFAOYSA-N.12_s1_14_0 exited with zero status but no 'finished' file 1/1/2015 1:17:03 PM | World Community Grid | If this happens repeatedly you may need to reset the project. 1/1/2015 1:17:04 PM | World Community Grid | Started upload of E227432_996_S.254.C28H12N6S2.KQZHOPHCFKURLS-UHFFFAOYSA-N.3_s1_14_0_0 1/1/2015 1:17:04 PM | World Community Grid | Started upload of E227432_996_S.254.C28H12N6S2.KQZHOPHCFKURLS-UHFFFAOYSA-N.3_s1_14_0_1 1/1/2015 1:17:05 PM | World Community Grid | Finished upload of E227432_996_S.254.C28H12N6S2.KQZHOPHCFKURLS-UHFFFAOYSA-N.3_s1_14_0_0 1/1/2015 1:17:05 PM | World Community Grid | Started upload of E227432_996_S.254.C28H12N6S2.KQZHOPHCFKURLS-UHFFFAOYSA-N.3_s1_14_0_2 |
||
|
Mamajuanauk
Master Cruncher United Kingdom Joined: Dec 15, 2012 Post Count: 1900 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It's been recorded in the forum that I've been running 64 instances of CEP2 on the same machine. That said, the point is there is not a problem per se, I found the main problem was availability of RAM and starting too many instances at the same time.
----------------------------------------Manually 'suspend' and stagger the restart of the tasks/wu's should help. I also have my machines on UPS/battery backup to prevent any power glitches causing problems...
Mamajuanauk is the Name! Crunching is the Game!
![]() ![]() |
||
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
64 instances is amazing!
Okay, my first step is manually restarting the tasks one by one. I'll let you know how it goes. Not sure if I can do anything about the power supply for the next few days. |
||
|
Mamajuanauk
Master Cruncher United Kingdom Joined: Dec 15, 2012 Post Count: 1900 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
64 instances is amazing! I've just moved one of my machines back to CEP2, loaded it up with 64 wu's and ensured they all started at different times...Okay, my first step is manually restarting the tasks one by one. I'll let you know how it goes. Not sure if I can do anything about the power supply for the next few days. I'll post any errors... UPS not essential, just helps if there are any power glitches.
Mamajuanauk is the Name! Crunching is the Game!
![]() ![]() |
||
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Alright, I've paid more attention to my BOINC manager then I thought possible and caught one of my work units restarting. I kept only 2 CEP2 running at a time, and found one with just 40 seconds on the runtime. Checked the log, and it hadn't been updated for a few minutes - no activity was recorded.
The second CEP2 was running with no interruption. Does that still point to a power issue? |
||
|
Mamajuanauk
Master Cruncher United Kingdom Joined: Dec 15, 2012 Post Count: 1900 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Check how much memory and disk space you have allocated in Boinc Manager locally, be careful to check the actual memory available in the log, it's within the first few lines at startup.
----------------------------------------The percentage allocation can restrict the what is actually available even though you have alocated 20Gb, if the % is restricted to 50 or the leave xx Gb then there may be less than you intend. Check Task manager to see how much memory is being used in total. Post some log entries from Boinc again, the first 20-30 from the start of the log and some of the latest ones... Edit - Look out for these lines- 31/12/2014 15:38:58 | | max memory usage when active: 29399.16MB
Mamajuanauk is the Name! Crunching is the Game!
----------------------------------------![]() ![]() [Edit 1 times, last edit by Mamajuanauk at Jan 3, 2015 2:20:59 PM] |
||
|
Dayle Diamond
Senior Cruncher Joined: Jan 31, 2013 Post Count: 452 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My event log doesn't go back any farther than yesterday, it's full of FiND & Malaria Control projects that have been running while I slowly restart CEP2.
----------------------------------------Settings are to use 70% memory while in use, 90% while idle. Should I increase that, and if so, to what? Again, I'm running 16GB of DDR3. Edit: Screw it, Harvard or no, I can't justify wasting any more months of computation to this project, I'll stick with the default mix. [Edit 1 times, last edit by Dayle Diamond at Jan 3, 2015 3:06:54 PM] |
||
|
|
![]() |