World Community Grid - View Thread

World Community Grid Forums

Category: Completed Research

Forum: Microbiome Immunity Project

Thread: WU Characteristics

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 100

[ ]

Author

This topic has been viewed 17902 times and has 99 replies

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: WU Characteristics

The 'hardcoded' is at end of every structure computed in a task, a natural smaller intermediate result file, the reason why only at the end of a structure , 10 targets included in a task, 10 checkpoints, BUT, the Computing effort is variable for a structure, non-deterministic, so the intervals vary, as do the number of structures packaged in a task. The number is not advertized in the event log, nor the checkpoint sequence number, if checkpoint_debug logging is switched on. The only way I know of how many in the task is visiting the job slot and open the stderr file, where at top it prints how many there are.

The but, but, but you may address at the tech-programmerss

----------------------------------------
[Edit 1 times, last edit by SekeRob* at Sep 18, 2017 4:34:35 PM]

[Sep 18, 2017 4:32:26 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7675
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

10 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

45 day badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

20 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: WU Characteristics

I have noticed another anomaly with this project. I have some machines which run 24/7 and others I need to shut down on a regular basis. One of the systems I shut down is running Windows Vista. I have found with the MIP project that I need to explicitly exit Boinc before initiating a shut down because if I just do a normal shutdown, any existing MIP units running will immediately throw an error when starting that machine later. I do not recall any other projects having this problem in the past nor has it been a problem with any of the other current projects. So there is a work around so the problem does not recur, but perhaps the techs/programmers could also take a peek at this issue.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Sep 18, 2017 6:23:56 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: WU Characteristics

Interesting observation, restarted BOINC (no reboot) on two Linux machines and the WUs restarted using 50% of the original amount of memory. They were using about 360MB before and 150MB after restart. One box has been runnning about 30 minutes since restart and they still haven't grown back to the original size. On my one Windows 7 box, the WUs use about 100MB less memory than Linux.

[Sep 19, 2017 7:28:01 PM]

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1673
Status: Offline
Project Badges:

5 year badge for Human Proteome Folding - Phase 2

180 day badge for Help Cure Muscular Dystrophy

5 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

5 year badge for Computing for Clean Water

2 year badge for GO Fight Against Malaria

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

Dear MIP-scientists and WCG Tech Team,
we would really appreciate your feedback about the observed problems regarding MIP1; at least, to read that you acknowledge that the science is at this time not really efficient and that it is causing troubles.
We are volunteers, we support your projects on our own costs and the minimal expected fair play is to be able to operate well designed and efficient sciences. At this time, multiple contributors report reproducible inappropriate MIP1 behaviours and we do not receive any feedback from your side; this situation is not OK !
Should we all rigorously boycott this new science for making you aware that there is a significant problem?
Yves
---
(addition)
PS: You report that the project is temporarily suspended, I suppose that this announcement is related to the batch failures of the previous days. I am not sure that the performance issue is the reason for the suspension. Again, feedback would be really appreciated.

----------------------------------------

Décrypthon team progress - KerSamson's contribution

----------------------------------------
[Edit 2 times, last edit by KerSamson at Sep 21, 2017 9:16:03 AM]

[Sep 21, 2017 8:59:39 AM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

2 year badge for Help Fight Childhood Cancer

2 year badge for Drug Search for Leishmaniasis

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for OpenPandemics - COVID-19


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

KerSamson,

We have tried to target MIP workunits to a 4 hour rutime. The first batches that were run when MIP1 launched were considerably shorter protein sequences than we ran in beta. For this reason they did not fit our estimation script very well and ran considerably shorter than desired. The shorter run times didn't add any strain to our system so we did not adjust the sizing. We are now getting into a mix of some much longer sequences mixed in with the shorter ones. For the longer sequences we are much more likely to hit our 4 hour target but the shorter sequences will continue to run shorter. For this reason you can't really compare one run to the next without knowing the sequence length. We will look at the database to see if there are any average runtimes that do not match what we would expect with the sequence length.

On your system are you running with hyperthreading on? You may want to see if you get better performance with it turned off for MIP1.

Thanks,
armstrdj

[Sep 21, 2017 2:51:26 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

Members have identified other anomalies other than run time. The issue is not run time per se but what might be causing the elongated run time. What about the problem where the WU uses half the memory after a BOINC restart and stays that way until the WU ends. That would indicate to me that half of the memory allocated to WU is not needed but never freed. We believe that the elongated run times are due to some concurrency problem. I have seen the behavior on AMD systems that don't have hyperthreading. If your recommendation is to turn hyperthreading OFF, why wasn't that identified during Beta and listed in the requirements page. Most of my observations were related to a single work unit while it was running. If I suspended most of the other work, that work unit would speed up, resume the other work and that same work unit would slow down. It wasn't a comparison between work units. We all expect a work unit to slow down on a fully loaded machine but not by 5X or more. On a 32 thread machine fully loaded, a wu runs almost 8 hours. If i run them one at a time the run time drops to about 1.5 hours. Both units working on 3 structures as were most units in my queue at the time. I found this out by doing a quick "grep nstruct *" in the projects directory to get the WU command line. I also noticed that the WU uses about 150MB until the first checkpoint, after which, it jumps up to about ~360 to 390MB until it ends. If you restart the WU in the middle after the first checkpoint it drops back to 150MB and stays there. What's up with that? I did it on 6 systems and that happened regardless of hyperthreading.

----------------------------------------
[Edit 3 times, last edit by Doneske at Sep 21, 2017 4:50:10 PM]

[Sep 21, 2017 4:31:29 PM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

Doneske,

Can you send me a workunit name where you see that memory behaviour and I will see if I can recreate the issue locally?

Thanks,
armstrdj

[Sep 21, 2017 7:04:04 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

I did one right after I read your note. It is easy to re-create. Just let them run long enough to reach their first checkpoint, then stop and restart BOINC. I'm doing this on Linux BTW. I have 16 WUs running on this machine and they all exhibited the memory drop. I first noticed it in BoincTasks but I verified the working set using the top command.

Curiosity question: Why are these programs allocating a shared memory segment since they aren't multi-threaded and are single process? Is it actually being used? What are they sharing it with? To communicate with Boinc?

MIP1_00004740_4790_0 Working set before restart was 382.08MB
After restart it was 146.73MB

Stderr.txt:
[2017- 9-21 18: 2:23:] :: BOINC:: Initializing ... ok.
[2017- 9-21 18: 2:23:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.11_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00004740.flags -out::file::silent result_silent.out -run:jran 312479616 -nstruct 2 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Sequence Length = 265
Starting work on structure: _0001
Finished _0001 in 4770.6 seconds.
Starting work on structure: _0002
[2017- 9-21 20: 2:45:] :: BOINC:: Initializing ... ok. <== Restart
[2017- 9-21 20: 2:45:] :: BOINC :: boinc_init()
INFO: result number = 0
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: ../../projects/www.worldcommunitygrid.org/wcgrid_mip1_rosetta_7.11_x86_64-pc-linux-gnu -in::file::zip MIP1_databasev2.zip @./MIP1_00004740.flags -out::file::silent result_silent.out -run:jran 312479616 -nstruct 2 -out::level 100 -run::no_scorefile true
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/www.worldcommunitygrid.org/mip1.MIP1_databasev2.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
set_shared_memory_fully_initialized ...
abrelax ...
abrelax.run
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting work on structure: _0002

----------------------------------------
[Edit 3 times, last edit by Doneske at Sep 22, 2017 2:06:34 AM]

[Sep 22, 2017 1:19:00 AM]

KerSamson
Master Cruncher
Switzerland
Joined: Jan 29, 2007
Post Count: 1673
Status: Offline
Project Badges:


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

Hi armstrdj,
Doneske perfectly ~~resumed~~ summarised the situation. The increasing duration for computing a WU is not the cause but the consequence of other problems. On my side, I discovered the problem (as reported early September 2017) because the granted credit and the CPU temperature constantly decreased over one week even if the CPU load remained at 100%. Afterwards, I noticed that the memory consumption was a little bit "surprising". You can find my observations above in this thread.
I do not have the science source code. Nevertheless, I would firstly conduct investigation regarding a possible memory leak and an inadequate memory segmentation, causing that the CPU has to wait too often on data because of recurring cache faults.
I can understand that hyperthreading could make the situation worst. However, it is not unusual to contribute to WCG using HT. Nonetheless, Doneske reported similar problems even on AMD CPU without HT.
In advance, we thank you for your support and investigation.
Cheers,
Yves

----------------------------------------

Décrypthon team progress - KerSamson's contribution

----------------------------------------
[Edit 1 times, last edit by KerSamson at Sep 22, 2017 1:05:00 PM]

[Sep 22, 2017 1:28:03 AM]

armstrdj
Former World Community Grid Tech
Joined: Oct 21, 2004
Post Count: 695
Status: Offline
Project Badges:


Re: WU Characteristics - Performance problems - Feedback from scientists is expected

Doneske,

Are you seeing when they resume they use less memory and continue to use less memory for the remainder of the run until the workunit finishes?

Thanks,
armstrdj

[Sep 22, 2017 2:07:41 AM]

[ ]