Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 100
|
![]() |
Author |
|
SekeRob
Master Cruncher Joined: Jan 7, 2013 Post Count: 2741 Status: Offline |
The 'hardcoded' is at end of every structure computed in a task, a natural smaller intermediate result file, the reason why only at the end of a structure , 10 targets included in a task, 10 checkpoints, BUT, the Computing effort is variable for a structure, non-deterministic, so the intervals vary, as do the number of structures packaged in a task. The number is not advertized in the event log, nor the checkpoint sequence number, if checkpoint_debug logging is switched on. The only way I know of how many in the task is visiting the job slot and open the stderr file, where at top it prints how many there are.
----------------------------------------The but, but, but you may address at the tech-programmerss [Edit 1 times, last edit by SekeRob* at Sep 18, 2017 4:34:35 PM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7675 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have noticed another anomaly with this project. I have some machines which run 24/7 and others I need to shut down on a regular basis. One of the systems I shut down is running Windows Vista. I have found with the MIP project that I need to explicitly exit Boinc before initiating a shut down because if I just do a normal shutdown, any existing MIP units running will immediately throw an error when starting that machine later. I do not recall any other projects having this problem in the past nor has it been a problem with any of the other current projects. So there is a work around so the problem does not recur, but perhaps the techs/programmers could also take a peek at this issue.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Interesting observation, restarted BOINC (no reboot) on two Linux machines and the WUs restarted using 50% of the original amount of memory. They were using about 360MB before and 150MB after restart. One box has been runnning about 30 minutes since restart and they still haven't grown back to the original size. On my one Windows 7 box, the WUs use about 100MB less memory than Linux.
|
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1673 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Dear MIP-scientists and WCG Tech Team,
----------------------------------------we would really appreciate your feedback about the observed problems regarding MIP1; at least, to read that you acknowledge that the science is at this time not really efficient and that it is causing troubles. We are volunteers, we support your projects on our own costs and the minimal expected fair play is to be able to operate well designed and efficient sciences. At this time, multiple contributors report reproducible inappropriate MIP1 behaviours and we do not receive any feedback from your side; this situation is not OK ! Should we all rigorously boycott this new science for making you aware that there is a significant problem? Yves --- (addition) PS: You report that the project is temporarily suspended, I suppose that this announcement is related to the batch failures of the previous days. I am not sure that the performance issue is the reason for the suspension. Again, feedback would be really appreciated. ---------------------------------------- [Edit 2 times, last edit by KerSamson at Sep 21, 2017 9:16:03 AM] |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
KerSamson,
We have tried to target MIP workunits to a 4 hour rutime. The first batches that were run when MIP1 launched were considerably shorter protein sequences than we ran in beta. For this reason they did not fit our estimation script very well and ran considerably shorter than desired. The shorter run times didn't add any strain to our system so we did not adjust the sizing. We are now getting into a mix of some much longer sequences mixed in with the shorter ones. For the longer sequences we are much more likely to hit our 4 hour target but the shorter sequences will continue to run shorter. For this reason you can't really compare one run to the next without knowing the sequence length. We will look at the database to see if there are any average runtimes that do not match what we would expect with the sequence length. On your system are you running with hyperthreading on? You may want to see if you get better performance with it turned off for MIP1. Thanks, armstrdj |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Members have identified other anomalies other than run time. The issue is not run time per se but what might be causing the elongated run time. What about the problem where the WU uses half the memory after a BOINC restart and stays that way until the WU ends. That would indicate to me that half of the memory allocated to WU is not needed but never freed. We believe that the elongated run times are due to some concurrency problem. I have seen the behavior on AMD systems that don't have hyperthreading. If your recommendation is to turn hyperthreading OFF, why wasn't that identified during Beta and listed in the requirements page. Most of my observations were related to a single work unit while it was running. If I suspended most of the other work, that work unit would speed up, resume the other work and that same work unit would slow down. It wasn't a comparison between work units. We all expect a work unit to slow down on a fully loaded machine but not by 5X or more. On a 32 thread machine fully loaded, a wu runs almost 8 hours. If i run them one at a time the run time drops to about 1.5 hours. Both units working on 3 structures as were most units in my queue at the time. I found this out by doing a quick "grep nstruct *" in the projects directory to get the WU command line. I also noticed that the WU uses about 150MB until the first checkpoint, after which, it jumps up to about ~360 to 390MB until it ends. If you restart the WU in the middle after the first checkpoint it drops back to 150MB and stays there. What's up with that? I did it on 6 systems and that happened regardless of hyperthreading.
----------------------------------------[Edit 3 times, last edit by Doneske at Sep 21, 2017 4:50:10 PM] |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Doneske,
Can you send me a workunit name where you see that memory behaviour and I will see if I can recreate the issue locally? Thanks, armstrdj |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I did one right after I read your note. It is easy to re-create. Just let them run long enough to reach their first checkpoint, then stop and restart BOINC. I'm doing this on Linux BTW. I have 16 WUs running on this machine and they all exhibited the memory drop. I first noticed it in BoincTasks but I verified the working set using the top command.
----------------------------------------Curiosity question: Why are these programs allocating a shared memory segment since they aren't multi-threaded and are single process? Is it actually being used? What are they sharing it with? To communicate with Boinc? MIP1_00004740_4790_0 Working set before restart was 382.08MB After restart it was 146.73MB Stderr.txt: [2017- 9-21 18: 2:23:] :: BOINC:: Initializing ... ok. [2017- 9-21 18: 2:23:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.11_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00004740.flags -out::file::silent result_silent.out -run:jran 312479616 -nstruct 2 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Sequence Length = 265 Starting work on structure: _0001 Finished _0001 in 4770.6 seconds. Starting work on structure: _0002 [2017- 9-21 20: 2:45:] :: BOINC:: Initializing ... ok. <== Restart [2017- 9-21 20: 2:45:] :: BOINC :: boinc_init() INFO: result number = 0 BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.11_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00004740.flags -out::file::silent result_silent.out -run:jran 312479616 -nstruct 2 -out::level 100 -run::no_scorefile true Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... set_shared_memory_fully_initialized ... abrelax ... abrelax.run Setting up folding (abrelax) ... Beginning folding (abrelax) ... BOINC:: Worker startup. Starting work on structure: _0002 [Edit 3 times, last edit by Doneske at Sep 22, 2017 2:06:34 AM] |
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1673 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi armstrdj,
----------------------------------------Doneske perfectly I do not have the science source code. Nevertheless, I would firstly conduct investigation regarding a possible memory leak and an inadequate memory segmentation, causing that the CPU has to wait too often on data because of recurring cache faults. I can understand that hyperthreading could make the situation worst. However, it is not unusual to contribute to WCG using HT. Nonetheless, Doneske reported similar problems even on AMD CPU without HT. In advance, we thank you for your support and investigation. Cheers, Yves ---------------------------------------- [Edit 1 times, last edit by KerSamson at Sep 22, 2017 1:05:00 PM] |
||
|
armstrdj
Former World Community Grid Tech Joined: Oct 21, 2004 Post Count: 695 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Doneske,
Are you seeing when they resume they use less memory and continue to use less memory for the remainder of the run until the workunit finishes? Thanks, armstrdj |
||
|
|
![]() |