World Community Grid - View Thread - Getting lots of Computation errors on DDDT2

World Community Grid Forums

Category: Completed Research

Forum: Discovering Dengue Drugs - Together - Phase 2 Forum

Thread: Getting lots of Computation errors on DDDT2

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 33

[ ]

Author

This topic has been viewed 11543 times and has 32 replies

LCB001
Advanced Cruncher
CANADA
Joined: Oct 14, 2009
Post Count: 69
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Nutritious Rice for the World

10 year badge for Help Fight Childhood Cancer

10 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

10 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

10 year badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Getting lots of Computation errors on DDDT2

6.10.17 - Vista x64 HP + 2x gpu Folding@Home
6.10.18 - Vista x64 HP + 1x gpu Folding@Home
6.10.18 - W7 x64 Ult + 3x gpu Folding@Home
6.10.18 - W7 x64 HP + 1x gpu SETI (part-time) - Laptop

Zero Problems with DDDT2

----------------------------------------

[Aug 19, 2010 9:39:42 PM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Getting lots of Computation errors on DDDT2

im running 6.10.56 on my vista with out any issues...

my windows xp with a slightly older version did run into a block of Errored WU's in a row (like 12) but i haven't been running into many errors and all the errors were less than 1 min

Have you confirmed these were DDDT2 or is sleeplessness causing you to mix this with HPF2's typical behavior?

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 20, 2010 1:52:26 AM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project


Re: Getting lots of Computation errors on DDDT2

Got just one bad DDDT2 WU: ts05_a239_pqa008
Mine was copy _4, copies _1, _2 & _3 gave the same error, with these lines in their error logs:
> The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)
> ...
> CHARGE OUTSIDE INNER GSBP REGION
> Encountered error. Exiting.
The error occurred early in the WU, after about 2.4 claimed credit's worth of crunching.
Copy _0 is still In progress.

Looks like a genuine bad WU to me ...

[Aug 20, 2010 5:56:13 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Getting lots of Computation errors on DDDT2

Seemingly there are still devices that can handle them straight out, or rather, some that can't:

ts05_ a192_ pca003_ 2-- 617 Valid 8/18/10 17:57:44 8/19/10 15:13:07 4.65 58.1 / 85.1 < Joe, The Underclaiming Repairman on 6.2.15
ts05_ a192_ pca003_ 1-- 617 Error 8/18/10 11:36:17 8/18/10 17:46:27 1.24 40.6 / 40.6
ts05_ a192_ pca003_ 0-- 617 Valid 8/18/10 11:35:54 8/20/10 10:24:58 4.72 85.1 / 85.1 < Moi, in grant hog heaven.

Bill, The Error Generator's log:

Result Name: ts05_ a192_ pca003_ 1--
<core_client_version>6.10.56</core_client_version>
<![CDATA[
<message>
process exited with code 29 (0x1d, -227)
</message>
<stderr_txt>
Calling gridPlatform.init()
INFO: No state to restore. Start from the beginning.
CHARGE OUTSIDE INNER GSBP REGION
Encountered error. Exiting.

</stderr_txt>
]]>

-227 =

ERR_RMDIR -227

In BOINC 6.0 and above: Remove (delete) directory failed.

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

[Aug 20, 2010 10:49:49 AM]

evilkats
Senior Cruncher
USA
Joined: May 4, 2007
Post Count: 162
Status: Offline
Project Badges:

20 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

5 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

20 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

20 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

20 year badge for Drug Search for Leishmaniasis

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Getting lots of Computation errors on DDDT2

I think that most errors are due to the fact that each WU uses like 400-500 MB of Page File. Multiply that by the number of cores and you will get a situation when the system just can't allocate that much space in the seconds the task requires. This would not be a problem for systems running 24/7 where tasks are gradually start and finish in sequence, but when the system with a schedule it is. When 4+ tasks all at once request 400+ MB of Swap File each, you get nothing but failed tasks with 'INFO: No state to restore. Start from the beginning.
forrtl: severe (98): cannot allocate memory for the file buffer - out of memory' within seconds. What's worse, the failed task may not release all that paged space and more tasks will fail in sequence because the system is out of resources. I had 20 tasks fail one after another in the matter of minutes on a 16 Core server twice already in the past week. angry

[Aug 20, 2010 1:16:59 PM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:


Re: Getting lots of Computation errors on DDDT2

> evilkats: What's worse, the failed task may not release all that paged space and more tasks will fail in sequence because the system is out of resources. I had 20 tasks fail one after another in the matter of minutes on a 16 Core server twice already in the past week.

If your machine runs out of pagefile space, the error log files can contain:
> CreateProcess() failed - The paging file is too small for this operation to complete. (0x5af)
I bumped into this when I suspended some DDDT2 WUs to change the order in which they would be run, and when I hit the pagefile limit, the rest of the WUs in the cache all tried to start but failed, and they went down like a row of dominoes.
I think the programmers of the DDDT2 CHARMM software expected it to be run under proper operating systems, where only those chunks of the 500MB VM reservation that are actually written to or read from, are actually taken from the pagefile data pool. Sekerob mentioned a few days ago that his Linux system(s) use very little pagefile space with DDDT2, and I think that is the expanation. Also, if DDDT2 was actually using all of the 500MB of pagefile space per WU, we would be seeing much more disc activity, especially on multi-core and HT machines.

[Aug 22, 2010 6:29:14 AM]

Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline


Re: Getting lots of Computation errors on DDDT2

The sequential starting of same science tasks to try find one that fits in the available memory was frankly, stupid. The newest clients were supposed to learn what each science app needs and not do that churning through the whole buffer of same. Properly, if not enough memory space is available, the client is supposed to suspend one or more cores, so what was that client version/platform again?

Review too your Disk/Memory permissions. They are working to the 'least of all" rule... or or or. 50% of 2GB free space is less than 10% of 100GB.

Yes, running system monitor suggest very little VM use even with 4 concurrent. Currently it has 2 DDDT2 and 2 CEP2 + 2 waiting to start DDDT2's (to skip the CEP2 jobs ahead) i.e. 6 in memory with LAIM (Leave app in memory when pre-empted) on. SM shows 1% of 3GB VM used **. I'm not sure, but presume that LAIM is on with evilkats' for that build up to take place. Can't remember anyone mentioning this in this thread. DDDT2 has short checkpoints, so LAIM is not all too important (works instantaneous from local prefs). For CEP2 it is, as checkpoints can be multiple hours apart.

** conflicts with TOP info, showing fixed VM use of 399M for DDDT2 and 305M for CEP2?

edit: inserted in first line "one that fits in the available memory was"

----------------------------------------

WCG

Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!

----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 22, 2010 4:28:13 PM]

[Aug 22, 2010 7:02:02 AM]

sk..
Master Cruncher
http://s17.rimg.info/ccb5d62bd3e856cc0d1df9b0ee2f7f6a.gif
Joined: Mar 22, 2007
Post Count: 2324
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

180 day badge for The Clean Energy Project

1 year badge for Influenza Antiviral Drug Search

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

5 year badge for Microbiome Immunity Project

45 day badge for OpenPandemics - COVID-19


Re: Getting lots of Computation errors on DDDT2

There are a few bad work units that people might encounter:

Project Name: Discovering Dengue Drugs - Together - Phase 2
Created: 17/08/10
Name: ts05_a239_pqb005
Minimum Quorum: 2
Replication: 2

ts05_ a239_ pqb005_ 4-- 617 Server Aborted 21/08/10 06:07:03 21/08/10 21:37:00 0.00 0.0 / 0.0
ts05_ a239_ pqb005_ 3-- 617 Error 20/08/10 18:16:36 21/08/10 04:51:11 0.10 2.2 / 2.2
ts05_ a239_ pqb005_ 2-- 617 Error 20/08/10 15:34:38 20/08/10 18:04:35 0.21 2.8 / 2.8
ts05_ a239_ pqb005_ 1-- 617 Error 18/08/10 14:29:52 20/08/10 15:32:21 0.12 2.6 / 2.6
ts05_ a239_ pqb005_ 0-- 617 Error 18/08/10 14:29:44 21/08/10 20:34:17 0.19 2.5 / 2.5

Tasks that may be somewhat error prone:

Project Name: Discovering Dengue Drugs - Together - Phase 2
Created: 17/08/10
Name: ts05_a169_pr89a0
Minimum Quorum: 2
Replication: 2

ts05_ a169_ pr89a0_ 3-- 617 Valid 20/08/10 17:27:52 21/08/10 13:29:28 1.87 48.2 / 50.0
ts05_ a169_ pr89a0_ 2-- 617 Valid 19/08/10 17:40:56 20/08/10 17:26:30 2.38 51.7 / 50.0
ts05_ a169_ pr89a0_ 1-- 617 Error 18/08/10 08:06:27 19/08/10 17:37:15 0.00 0.0 / 0.0
ts05_ a169_ pr89a0_ 0-- 617 Error 18/08/10 08:05:57 19/08/10 02:42:18 5.68 58.2 / 0.0

Some that we are not sure about yet:

Project Name: Discovering Dengue Drugs - Together - Phase 2
Created: 14/08/10
Name: ts05_d317_pr78a1
Minimum Quorum: 2
Replication: 2

ts05_ d317_ pr78a1_ 2-- - In Progress 22/08/10 07:27:56 26/08/10 07:27:56 0.00 0.0 / 0.0
ts05_ d317_ pr78a1_ 1-- 617 Error 15/08/10 19:14:02 16/08/10 11:45:50 5.65 59.1 / 0.0
ts05_ d317_ pr78a1_ 0-- 617 Inconclusive 15/08/10 19:14:01 22/08/10 07:13:26 3.87 75.0 / 0.0

Some that should turn out OK:

Project Name: Discovering Dengue Drugs - Together - Phase 2
Created: 19/08/10
Name: ts05_d279_sr34a1
Minimum Quorum: 2
Replication: 2

ts05_ d279_ sr34a1_ 2-- - In Progress 21/08/10 19:19:01 25/08/10 19:19:01 0.00 0.0 / 0.0
ts05_ d279_ sr34a1_ 1-- 617 Pending Validation 20/08/10 22:59:10 21/08/10 00:23:22 1.15 18.6 / 0.0
ts05_ d279_ sr34a1_ 0-- 617 Error 20/08/10 22:59:09 21/08/10 19:08:19 0.00 0.0 / 0.0

...and half a million that worked biggrin

----------------------------------------
[Edit 1 times, last edit by skgiven at Aug 22, 2010 2:44:01 PM]

[Aug 22, 2010 2:06:23 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Getting lots of Computation errors on DDDT2

I have gotten no errors on windows xp, vista, or 7. We lost power yesterday so had to reboot all, but no problems. I am running BOINC 6.10.56.

[Aug 22, 2010 4:19:18 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Getting lots of Computation errors on DDDT2

This is a question regarding the Inconclusive status for a workunit. I see my system completed a WU in 0.93 hours while my wingman took 2.98 hours. Is this unusual or can the processing in some instances have that much of a difference? I am using a Lenovo ThinkPad in this case and I have not seen other WUs run in quite the same amount of time.

Thanks,
Dave

Workunit Status

Project Name: Discovering Dengue Drugs - Together - Phase 2
Created: 8/20/10
Name: ts05_e476_pr67a0
Minimum Quorum: 2
Replication: 3

Result Name App Version Number Status Sent Time Time Due /
Return Time CPU Time (hours) Claimed/ Granted BOINC Credit
ts05_ e476_ pr67a0_ 2-- - In Progress 8/26/10 09:38:03 8/30/10 09:38:03 0.00 0.0 / 0.0
ts05_ e476_ pr67a0_ 1-- 617 Inconclusive 8/22/10 06:12:35 8/25/10 16:19:01 0.93 14.2 / 0.0
ts05_ e476_ pr67a0_ 0-- 617 Inconclusive 8/22/10 06:12:34 8/26/10 04:14:31 2.98 76.3 / 0.0

[Aug 28, 2010 12:04:29 AM]

[ ]