Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 33
|
![]() |
Author |
|
LCB001
Advanced Cruncher CANADA Joined: Oct 14, 2009 Post Count: 69 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
6.10.17 - Vista x64 HP + 2x gpu Folding@Home
----------------------------------------6.10.18 - Vista x64 HP + 1x gpu Folding@Home 6.10.18 - W7 x64 Ult + 3x gpu Folding@Home 6.10.18 - W7 x64 HP + 1x gpu SETI (part-time) - Laptop Zero Problems with DDDT2 ![]() |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
im running 6.10.56 on my vista with out any issues... my windows xp with a slightly older version did run into a block of Errored WU's in a row (like 12) but i haven't been running into many errors and all the errors were less than 1 min Have you confirmed these were DDDT2 or is sleeplessness causing you to mix this with HPF2's typical behavior?
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Got just one bad DDDT2 WU: ts05_a239_pqa008
Mine was copy _4, copies _1, _2 & _3 gave the same error, with these lines in their error logs: > The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) > ... > CHARGE OUTSIDE INNER GSBP REGION > Encountered error. Exiting. The error occurred early in the WU, after about 2.4 claimed credit's worth of crunching. Copy _0 is still In progress. Looks like a genuine bad WU to me ... |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Seemingly there are still devices that can handle them straight out, or rather, some that can't:
----------------------------------------ts05_ a192_ pca003_ 2-- 617 Valid 8/18/10 17:57:44 8/19/10 15:13:07 4.65 58.1 / 85.1 < Joe, The Underclaiming Repairman on 6.2.15 ts05_ a192_ pca003_ 1-- 617 Error 8/18/10 11:36:17 8/18/10 17:46:27 1.24 40.6 / 40.6 ts05_ a192_ pca003_ 0-- 617 Valid 8/18/10 11:35:54 8/20/10 10:24:58 4.72 85.1 / 85.1 < Moi, in grant hog heaven. Bill, The Error Generator's log: Result Name: ts05_ a192_ pca003_ 1-- -227 =<core_client_version>6.10.56</core_client_version> <![CDATA[ <message> process exited with code 29 (0x1d, -227) </message> <stderr_txt> Calling gridPlatform.init() INFO: No state to restore. Start from the beginning. CHARGE OUTSIDE INNER GSBP REGION Encountered error. Exiting. </stderr_txt> ]]> ERR_RMDIR -227 In BOINC 6.0 and above: Remove (delete) directory failed.
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
evilkats
Senior Cruncher USA Joined: May 4, 2007 Post Count: 162 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think that most errors are due to the fact that each WU uses like 400-500 MB of Page File. Multiply that by the number of cores and you will get a situation when the system just can't allocate that much space in the seconds the task requires. This would not be a problem for systems running 24/7 where tasks are gradually start and finish in sequence, but when the system with a schedule it is. When 4+ tasks all at once request 400+ MB of Swap File each, you get nothing but failed tasks with 'INFO: No state to restore. Start from the beginning.
forrtl: severe (98): cannot allocate memory for the file buffer - out of memory' within seconds. What's worse, the failed task may not release all that paged space and more tasks will fail in sequence because the system is out of resources. I had 20 tasks fail one after another in the matter of minutes on a 16 Core server twice already in the past week. ![]() ![]() |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
> evilkats: What's worse, the failed task may not release all that paged space and more tasks will fail in sequence because the system is out of resources. I had 20 tasks fail one after another in the matter of minutes on a 16 Core server twice already in the past week.
If your machine runs out of pagefile space, the error log files can contain: > CreateProcess() failed - The paging file is too small for this operation to complete. (0x5af) I bumped into this when I suspended some DDDT2 WUs to change the order in which they would be run, and when I hit the pagefile limit, the rest of the WUs in the cache all tried to start but failed, and they went down like a row of dominoes. I think the programmers of the DDDT2 CHARMM software expected it to be run under proper operating systems, where only those chunks of the 500MB VM reservation that are actually written to or read from, are actually taken from the pagefile data pool. Sekerob mentioned a few days ago that his Linux system(s) use very little pagefile space with DDDT2, and I think that is the expanation. Also, if DDDT2 was actually using all of the 500MB of pagefile space per WU, we would be seeing much more disc activity, especially on multi-core and HT machines. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
The sequential starting of same science tasks to try find one that fits in the available memory was frankly, stupid. The newest clients were supposed to learn what each science app needs and not do that churning through the whole buffer of same. Properly, if not enough memory space is available, the client is supposed to suspend one or more cores, so what was that client version/platform again?
----------------------------------------Review too your Disk/Memory permissions. They are working to the 'least of all" rule... or or or. 50% of 2GB free space is less than 10% of 100GB. Yes, running system monitor suggest very little VM use even with 4 concurrent. Currently it has 2 DDDT2 and 2 CEP2 + 2 waiting to start DDDT2's (to skip the CEP2 jobs ahead) i.e. 6 in memory with LAIM (Leave app in memory when pre-empted) on. SM shows 1% of 3GB VM used **. I'm not sure, but presume that LAIM is on with evilkats' for that build up to take place. Can't remember anyone mentioning this in this thread. DDDT2 has short checkpoints, so LAIM is not all too important (works instantaneous from local prefs). For CEP2 it is, as checkpoints can be multiple hours apart. ** conflicts with TOP info, showing fixed VM use of 399M for DDDT2 and 305M for CEP2? edit: inserted in first line "one that fits in the available memory was"
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Aug 22, 2010 4:28:13 PM] |
||
|
sk..
Master Cruncher http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif Joined: Mar 22, 2007 Post Count: 2324 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There are a few bad work units that people might encounter:
----------------------------------------Project Name: Discovering Dengue Drugs - Together - Phase 2 Created: 17/08/10 Name: ts05_a239_pqb005 Minimum Quorum: 2 Replication: 2 ts05_ a239_ pqb005_ 4-- 617 Server Aborted 21/08/10 06:07:03 21/08/10 21:37:00 0.00 0.0 / 0.0 ts05_ a239_ pqb005_ 3-- 617 Error 20/08/10 18:16:36 21/08/10 04:51:11 0.10 2.2 / 2.2 ts05_ a239_ pqb005_ 2-- 617 Error 20/08/10 15:34:38 20/08/10 18:04:35 0.21 2.8 / 2.8 ts05_ a239_ pqb005_ 1-- 617 Error 18/08/10 14:29:52 20/08/10 15:32:21 0.12 2.6 / 2.6 ts05_ a239_ pqb005_ 0-- 617 Error 18/08/10 14:29:44 21/08/10 20:34:17 0.19 2.5 / 2.5 Tasks that may be somewhat error prone: Project Name: Discovering Dengue Drugs - Together - Phase 2 Created: 17/08/10 Name: ts05_a169_pr89a0 Minimum Quorum: 2 Replication: 2 ts05_ a169_ pr89a0_ 3-- 617 Valid 20/08/10 17:27:52 21/08/10 13:29:28 1.87 48.2 / 50.0 ts05_ a169_ pr89a0_ 2-- 617 Valid 19/08/10 17:40:56 20/08/10 17:26:30 2.38 51.7 / 50.0 ts05_ a169_ pr89a0_ 1-- 617 Error 18/08/10 08:06:27 19/08/10 17:37:15 0.00 0.0 / 0.0 ts05_ a169_ pr89a0_ 0-- 617 Error 18/08/10 08:05:57 19/08/10 02:42:18 5.68 58.2 / 0.0 Some that we are not sure about yet: Project Name: Discovering Dengue Drugs - Together - Phase 2 Created: 14/08/10 Name: ts05_d317_pr78a1 Minimum Quorum: 2 Replication: 2 ts05_ d317_ pr78a1_ 2-- - In Progress 22/08/10 07:27:56 26/08/10 07:27:56 0.00 0.0 / 0.0 ts05_ d317_ pr78a1_ 1-- 617 Error 15/08/10 19:14:02 16/08/10 11:45:50 5.65 59.1 / 0.0 ts05_ d317_ pr78a1_ 0-- 617 Inconclusive 15/08/10 19:14:01 22/08/10 07:13:26 3.87 75.0 / 0.0 Some that should turn out OK: Project Name: Discovering Dengue Drugs - Together - Phase 2 Created: 19/08/10 Name: ts05_d279_sr34a1 Minimum Quorum: 2 Replication: 2 ts05_ d279_ sr34a1_ 2-- - In Progress 21/08/10 19:19:01 25/08/10 19:19:01 0.00 0.0 / 0.0 ts05_ d279_ sr34a1_ 1-- 617 Pending Validation 20/08/10 22:59:10 21/08/10 00:23:22 1.15 18.6 / 0.0 ts05_ d279_ sr34a1_ 0-- 617 Error 20/08/10 22:59:09 21/08/10 19:08:19 0.00 0.0 / 0.0 ...and half a million that worked ![]() [Edit 1 times, last edit by skgiven at Aug 22, 2010 2:44:01 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have gotten no errors on windows xp, vista, or 7. We lost power yesterday so had to reboot all, but no problems. I am running BOINC 6.10.56.
|
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
This is a question regarding the Inconclusive status for a workunit. I see my system completed a WU in 0.93 hours while my wingman took 2.98 hours. Is this unusual or can the processing in some instances have that much of a difference? I am using a Lenovo ThinkPad in this case and I have not seen other WUs run in quite the same amount of time.
Thanks, Dave Workunit Status Project Name: Discovering Dengue Drugs - Together - Phase 2 Created: 8/20/10 Name: ts05_e476_pr67a0 Minimum Quorum: 2 Replication: 3 Result Name App Version Number Status Sent Time Time Due / Return Time CPU Time (hours) Claimed/ Granted BOINC Credit ts05_ e476_ pr67a0_ 2-- - In Progress 8/26/10 09:38:03 8/30/10 09:38:03 0.00 0.0 / 0.0 ts05_ e476_ pr67a0_ 1-- 617 Inconclusive 8/22/10 06:12:35 8/25/10 16:19:01 0.93 14.2 / 0.0 ts05_ e476_ pr67a0_ 0-- 617 Inconclusive 8/22/10 06:12:34 8/26/10 04:14:31 2.98 76.3 / 0.0 |
||
|
|
![]() |