Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 42
Posts: 42   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 51755 times and has 41 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

After 4% and 2hrs both WUs use 216MB with 805MB VM. During checkpointing, the max used RAM value jumped from 218MB to 221MB. Maybe this rise is the crucial point for my computer since the error occurred always with percentage .960%.
I made a new backup. When restarting both WUs had 209MB RAM and 805MB VM, but RAM moved soon to 213MB before the percentage was raised at all.
BTW the computer is an intel i7 920 with 6GB RAM, HT active. VM is set to max 6434. The latest error occurred while running both remaining DDDT2 WUs along with 4 HCMD2 WUs and 2 HFCC WUs.
And concerning backups - if I wrote 'stop boinc' I meant stopping all programs/services/etc. of boinc, not only the boinc manager. And as I wrote in my first posting restoring the WU comprises the slot directory, the client state file (active_tasks, file error codes, result section) and the changed project file(s). CPDN has some extensive manuals how to achieve it. And if checkpointing works as it should you can restore ANY WU from backup if you know what you are doing. Of course power failures during checkpoint writing are another topic - that way I had some HFCC WUs which restarted at 0% after 80% done with active time remaining unchanged getting the usual credit for twice the time...
[Apr 18, 2010 9:10:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

hmmm 221+805 is more than 1Gb.... but that should give an exceed memory size error. Not looked at the rsc control parms to know what the set limit is.

edit, from the help sections system requirement page
Research Project  	Memory Available  	Disk Space  	Operating Systems
Discovering Dengue Drugs - Together - Phase 2 1024 MB 250 MB Windows1, Mac, Linux3

----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Apr 18, 2010 9:28:46 AM]
[Apr 18, 2010 9:26:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

It seems like these monster tasks for DDDT-2 were poorly designed, particularly in the case where ts05_a193_ps0000_1 and ts05_b159_ps0000_1 appear to be running successfully by stopping and starting BOINC. I think of BOINC as a user interface to see what is being executed, and not the actual execution of the tasks which are being continuously executed in the background with BOINC active or not. Wouldn't suspending and resuming a task with BOINC have the same results? The checkpoint of the task is to provide a point at which to restart should your computer go down or needs to be rebooted for some other reason such as a Windows update for security reasons.


dkt, the tasks are not finished yet - actually at 68% and 92%, but beyond all other copies of that WUs which had errors. It's just a theory which I try to support or refute.
And suspending/resuming isn't sufficient as long as you have the leave_apps_in_memory option active (didn't test whether it works without this option). For backups all boinc services must be stopped to avoid lock file conflicts.
[Apr 18, 2010 9:32:58 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

After 4% and 2hrs both WUs use 216MB with 805MB VM. During checkpointing, the max used RAM value jumped from 218MB to 221MB. Maybe this rise is the crucial point for my computer since the error occurred always with percentage .960%.
I made a new backup. When restarting both WUs had 209MB RAM and 805MB VM, but RAM moved soon to 213MB before the percentage was raised at all.
BTW the computer is an intel i7 920 with 6GB RAM, HT active. VM is set to max 6434. The latest error occurred while running both remaining DDDT2 WUs along with 4 HCMD2 WUs and 2 HFCC WUs.
And concerning backups - if I wrote 'stop boinc' I meant stopping all programs/services/etc. of boinc, not only the boinc manager. And as I wrote in my first posting restoring the WU comprises the slot directory, the client state file (active_tasks, file error codes, result section) and the changed project file(s). CPDN has some extensive manuals how to achieve it. And if checkpointing works as it should you can restore ANY WU from backup if you know what you are doing. Of course power failures during checkpoint writing are another topic - that way I had some HFCC WUs which restarted at 0% after 80% done with active time remaining unchanged getting the usual credit for twice the time...

Ah, see it now, indeed restoring and doing it right is not for the faint hearted, computing for the masses, set and forget cruncher. VM max setting I've backed away from, only setting a minimum these days to reduce chance of fragmentation. I've just created a mini partition on W7-64 and will be moving the VM writing there. This way there will as I anticipate never be Swap File fragging [page defrag no longer works].
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Apr 18, 2010 9:47:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

Thanks for the explanation of how you resurrect dead WUs, Mathias (aka mweisensee). Very clever. Can you do it for humans, too? wink
However, before "everybody" starts doing this, I think we should hear the opinion of the scientists. It is possible that it could introduce bias into the results, as I explain in thread Changes to distribution of error work units in my post about Monte Carlo methods in the DDDT2 CHARMM program.

I don't think that "... some of the code 29 errors are caused by extremely ambitious requirements of some (monster) tasks." It is more likely that some parameters are going out of range and causing the equivalent of the simulation bumping into the side of the simulation space. This explanation was given for some problems encountered by the CEP.
[Apr 18, 2010 11:23:07 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

Thanks for the explanation of how you resurrect dead WUs, Mathias (aka mweisensee). Very clever. Can you do it for humans, too? wink


I could. But at the moment I see no need of doing so. If more than a handful WUs need special treatment there is something wrong. The average WCG user should not need to edit the client state file because it can crash all of his current WUs (yes, it's a first-hand experience...). CPDN has some very detailed manuals how to do it because it can be very useful for their long running tasks. WCG should not need it.
And remember - my WUs still are not complete. If the results shall be comparable bitwise they can have errors at 99.96%, even when using Monte Carlo.
I'll keep you informed whether they complete or crash.

Matthias
[Apr 18, 2010 12:12:43 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

PS: v.v Resources, if anyone sees more than 210Mb RAM use and 730Mb VM for the A-Type, please speak up with the result name.

How much more RAM is significant?

Here are 3 that I'm currently running that use more than 210MB:
ts05_a048_ps0000_2 - using 211,562K of RAM.
ts05_b494_ps0000_1 - using 215,220K of RAM.
ts05_b426_ps0000_0 - using 216,744K of RAM.
I don't know how much VM.
----------------------------------------

[Apr 18, 2010 1:07:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

hmmm 221+805 is more than 1Gb.... but that should give an exceed memory size error. Not looked at the rsc control parms to know what the set limit is.

The wu-parameter for memory-limit isn't enforced by the client, but they are used by scheduling-server when deciding if can send a task or not to the client.

The user-supplied max limits of "Use at most N % of memory..." on the other hand is enforced.

So, in case a project has mis-configured their wu's with too low memory-requirement of example 100 MB, but actual usage is 500 MB, any 1 GB-computer with 50% or higher memory-setting won't error-out any of these wu's. The same 1 GB-computer on the other hand with a 40% memory-setting will error-out all these wu's, since his max memory-limit was exceeded.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
----------------------------------------
[Edit 1 times, last edit by Ingleside at Apr 18, 2010 1:29:29 PM]
[Apr 18, 2010 1:28:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

piroque,

In Task Manager you can add columns for different items measured. In Process Explorer (By MS Sysinternals too), you can right-click for the properties screen and visit the performance tab. There's a Virtual and Physical memory section.

edit: address
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Apr 18, 2010 1:31:16 PM]
[Apr 18, 2010 1:29:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

All of the VM Sizes are < 693MB.
----------------------------------------

[Apr 18, 2010 1:40:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 42   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread