Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 42
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
After 4% and 2hrs both WUs use 216MB with 805MB VM. During checkpointing, the max used RAM value jumped from 218MB to 221MB. Maybe this rise is the crucial point for my computer since the error occurred always with percentage .960%.
I made a new backup. When restarting both WUs had 209MB RAM and 805MB VM, but RAM moved soon to 213MB before the percentage was raised at all. BTW the computer is an intel i7 920 with 6GB RAM, HT active. VM is set to max 6434. The latest error occurred while running both remaining DDDT2 WUs along with 4 HCMD2 WUs and 2 HFCC WUs. And concerning backups - if I wrote 'stop boinc' I meant stopping all programs/services/etc. of boinc, not only the boinc manager. And as I wrote in my first posting restoring the WU comprises the slot directory, the client state file (active_tasks, file error codes, result section) and the changed project file(s). CPDN has some extensive manuals how to achieve it. And if checkpointing works as it should you can restore ANY WU from backup if you know what you are doing. Of course power failures during checkpoint writing are another topic - that way I had some HFCC WUs which restarted at 0% after 80% done with active time remaining unchanged getting the usual credit for twice the time... |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
hmmm 221+805 is more than 1Gb.... but that should give an exceed memory size error. Not looked at the rsc control parms to know what the set limit is.
----------------------------------------edit, from the help sections system requirement page Research Project Memory Available Disk Space Operating Systems
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Apr 18, 2010 9:28:46 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
It seems like these monster tasks for DDDT-2 were poorly designed, particularly in the case where ts05_a193_ps0000_1 and ts05_b159_ps0000_1 appear to be running successfully by stopping and starting BOINC. I think of BOINC as a user interface to see what is being executed, and not the actual execution of the tasks which are being continuously executed in the background with BOINC active or not. Wouldn't suspending and resuming a task with BOINC have the same results? The checkpoint of the task is to provide a point at which to restart should your computer go down or needs to be rebooted for some other reason such as a Windows update for security reasons. dkt, the tasks are not finished yet - actually at 68% and 92%, but beyond all other copies of that WUs which had errors. It's just a theory which I try to support or refute. And suspending/resuming isn't sufficient as long as you have the leave_apps_in_memory option active (didn't test whether it works without this option). For backups all boinc services must be stopped to avoid lock file conflicts. |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
After 4% and 2hrs both WUs use 216MB with 805MB VM. During checkpointing, the max used RAM value jumped from 218MB to 221MB. Maybe this rise is the crucial point for my computer since the error occurred always with percentage .960%. I made a new backup. When restarting both WUs had 209MB RAM and 805MB VM, but RAM moved soon to 213MB before the percentage was raised at all. BTW the computer is an intel i7 920 with 6GB RAM, HT active. VM is set to max 6434. The latest error occurred while running both remaining DDDT2 WUs along with 4 HCMD2 WUs and 2 HFCC WUs. And concerning backups - if I wrote 'stop boinc' I meant stopping all programs/services/etc. of boinc, not only the boinc manager. And as I wrote in my first posting restoring the WU comprises the slot directory, the client state file (active_tasks, file error codes, result section) and the changed project file(s). CPDN has some extensive manuals how to achieve it. And if checkpointing works as it should you can restore ANY WU from backup if you know what you are doing. Of course power failures during checkpoint writing are another topic - that way I had some HFCC WUs which restarted at 0% after 80% done with active time remaining unchanged getting the usual credit for twice the time... Ah, see it now, indeed restoring and doing it right is not for the faint hearted, computing for the masses, set and forget cruncher. VM max setting I've backed away from, only setting a minimum these days to reduce chance of fragmentation. I've just created a mini partition on W7-64 and will be moving the VM writing there. This way there will as I anticipate never be Swap File fragging [page defrag no longer works].
WCG
Please help to make the Forums an enjoyable experience for All! |
||
|
Rickjb
Veteran Cruncher Australia Joined: Sep 17, 2006 Post Count: 666 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the explanation of how you resurrect dead WUs, Mathias (aka mweisensee). Very clever. Can you do it for humans, too?
![]() However, before "everybody" starts doing this, I think we should hear the opinion of the scientists. It is possible that it could introduce bias into the results, as I explain in thread Changes to distribution of error work units in my post about Monte Carlo methods in the DDDT2 CHARMM program. I don't think that "... some of the code 29 errors are caused by extremely ambitious requirements of some (monster) tasks." It is more likely that some parameters are going out of range and causing the equivalent of the simulation bumping into the side of the simulation space. This explanation was given for some problems encountered by the CEP. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Thanks for the explanation of how you resurrect dead WUs, Mathias (aka mweisensee). Very clever. Can you do it for humans, too? ![]() I could. But at the moment I see no need of doing so. If more than a handful WUs need special treatment there is something wrong. The average WCG user should not need to edit the client state file because it can crash all of his current WUs (yes, it's a first-hand experience...). CPDN has some very detailed manuals how to do it because it can be very useful for their long running tasks. WCG should not need it. And remember - my WUs still are not complete. If the results shall be comparable bitwise they can have errors at 99.96%, even when using Monte Carlo. I'll keep you informed whether they complete or crash. Matthias |
||
|
pirogue
Veteran Cruncher USA Joined: Dec 8, 2008 Post Count: 685 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
PS: v.v Resources, if anyone sees more than 210Mb RAM use and 730Mb VM for the A-Type, please speak up with the result name. How much more RAM is significant? Here are 3 that I'm currently running that use more than 210MB: ts05_a048_ps0000_2 - using 211,562K of RAM. ts05_b494_ps0000_1 - using 215,220K of RAM. ts05_b426_ps0000_0 - using 216,744K of RAM. I don't know how much VM. |
||
|
Ingleside
Veteran Cruncher Norway Joined: Nov 19, 2005 Post Count: 974 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
hmmm 221+805 is more than 1Gb.... but that should give an exceed memory size error. Not looked at the rsc control parms to know what the set limit is. The wu-parameter for memory-limit isn't enforced by the client, but they are used by scheduling-server when deciding if can send a task or not to the client. The user-supplied max limits of "Use at most N % of memory..." on the other hand is enforced. So, in case a project has mis-configured their wu's with too low memory-requirement of example 100 MB, but actual usage is 500 MB, any 1 GB-computer with 50% or higher memory-setting won't error-out any of these wu's. The same 1 GB-computer on the other hand with a 40% memory-setting will error-out all these wu's, since his max memory-limit was exceeded. ![]() "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." [Edit 1 times, last edit by Ingleside at Apr 18, 2010 1:29:29 PM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
piroque,
----------------------------------------In Task Manager you can add columns for different items measured. In Process Explorer (By MS Sysinternals too), you can right-click for the properties screen and visit the performance tab. There's a Virtual and Physical memory section. edit: address
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Apr 18, 2010 1:31:16 PM] |
||
|
pirogue
Veteran Cruncher USA Joined: Dec 8, 2008 Post Count: 685 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
All of the VM Sizes are < 693MB.
---------------------------------------- |
||
|
|
![]() |