World Community Grid - View Thread - Fight AIDS@Home, Phase 2

World Community Grid Forums

Category: Completed Research

Forum: FightAIDS@Home Phase 2

Thread: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 22

[ ]

Author

This topic has been viewed 28605 times and has 21 replies

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Right now the purpose I see in the trickles is 2 fold:

1) To increase the interactive dynamicity if you will
2) To salvage any progress, should the client stall, crash or be powered
down for longer... if you are on trickle 7 and disaster strikes, those tirckles can be used to seed the next task.

The one issue is the task going invalid at end, and the 'don't know what happens then with the good trickles?" Been asked, by moi, and left unanswered. Is invalid for the whole task or just for the last trickle?, which is what the description was, to use the good part, short for, there (hate the word) 'should' not be any invalid, unless trickle 1 is flawed.

[Oct 4, 2016 3:25:14 PM]

wflynny
FightAIDS@Home Scientist
Joined: Jun 17, 2014
Post Count: 10
Status: Offline
Project Badges:

2 year badge for FightAIDS@Home - Phase 2


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Re: trickles

Expanding on SekeRob's post above, trickles enable work on a single trajectory (30+ WUs) to continue as quickly as possible. If a WU trickles at 50k steps and then disappears, we send out another WU to a different volunteer picking up at 50k steps. While some complete trajectories never experience this problem, the vast majority of the thousands that have already run have had to be restarted from a trickle message, often several times, so we do see a benefit from them on our end. But I understand the pushback against trickles. If the AsyncRE WUs are short enough, then we may not need to rely on trickles to ensure trajectories aren't stalling. I'm fairly certain reliability scores per host are already baked into the BOINC scheduler, but I don't know if a dichotomous "reliable hosts don't have to trickle" and "unreliable hosts do" scheme is feasible (or necessarily desirable).

Re: invalid trickles

Not sure, will ask around.

Re: AsyncRE manager

On another subject, reading the papers on the AsyncRE computing architecture, it seems the AsyncRE Manager runs on the same server as BOINC. It raises the concern around backend resources as we see messages from BOINC today regarding deferring scheduler requests due to high load. I would imagine that the AsyncRE Manager will have to handle input from maybe 100,000 cores and this would just add to the existing load. Additionally, it seems like communication between BOINC and the AsyncRE Manager is a potential bottleneck.

Uplinger (and others) are integrating a BOINC-native implementation on the BOINC head server with the intent to make the exchange part of replica exchange ("information sharing") be as quick and efficient as possible with little additional CPU load. Luckily we won't be exchanging information between every replica currently running, only among groups which correspond to the same protein-ligand complex, which cuts down on the CPU cycles needed for the exchange process.

What happens when the validator gets stuck and needs the proverbial kicking.

We currently face the same problem when the validator gets stuck since new WUs rely on the output previous WUs. AsyncRE essentially just changes how to choose the parameters assigned to each new WU (but these choices are key in accelerating the simulations properly).

----------------------------------------
[Edit 1 times, last edit by wflynny at Oct 4, 2016 4:15:22 PM]

[Oct 4, 2016 4:14:49 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Concerning the load on the server, the exchange piece is probably a small part of the processing load. relatively speaking. The AsyncRE Manager consists of several other modules that will be doing other work in addition to the exchange process such as telling BOINC there is other work to schedule and the background processing required to build that work token. I used 100,000 cores as a conservative estimate based on 216,000 active hosts and assuming a very conservative 4 cores per host giving ~800,000 cores (will probably be closer to 1,000,000) and then assuming 20% participation (will most likely be higher at the beginning of the new process) rate. The paper mentioned that this architecture could support GPGPU nodes. Any thought given to implementing this on GPUs also?

Since this seems like a significant architecture change, is this still going to be FAAH2 or is this FAAH3 (or maybe FAAH 2.5)?

[Oct 4, 2016 5:45:16 PM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Think you can read back in a Uplinger post when the trickle validator last stalled for longer, it took several hours to catch up. It's like a whole extra results set per day, effectively all combined the scheduler thus seems to handle over 2 million 'validation attempt' per day, in the present setup.

[Oct 4, 2016 6:01:46 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Not wanting to focus too much on the individual functions but look at the overall processing environment. Recent messages seem to suggest that the BOINC environment reaches processing limits. As evidenced by scheduler messages saying requests are delayed due to heavy load. This would indicate to me that the BOINC process is shedding work to handle other work. It seems reasonable to assume that the server/BOINC process is close to some resource limit. During normal processing, everything is fine but any "anomoly" in day to day activities stretches resources to that limit. Just suggesting that AsyncRE could be that "anomoly" when trying to manage 1,000,000 cores worth of work. Remember, this is work on top of what is already there. Don't forget that CEP2 will be coming back online at some point (but also shedding UGM so maybe a wash).

[Oct 4, 2016 8:48:38 PM]

Papa3
Senior Cruncher
Joined: Apr 23, 2006
Post Count: 360
Status: Offline
Project Badges:

14 day badge for The Clean Energy Project - Phase 2

100 year badge for Mapping Cancer Markers

90 day badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

100 year badge for Microbiome Immunity Project

10 year badge for OpenPandemics - COVID-19


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

trickles enable work on a single trajectory (30+ WUs) to continue as quickly as possible. If a WU trickles at 50k steps and then disappears, we send out another WU to a different volunteer picking up at 50k steps. While some complete trajectories never experience this problem, the vast majority of the thousands that have already run have had to be restarted from a trickle message, often several times

With that much WU failure happening, it can't just be attributable to volunteers aborting jobs or turning off their computers, etc. - it sounds much more like lots of WUs are being sent out with major birth defects. In other words, the project is royally screwing up WU production and the trickled-upon volunteers are paying the price! Is that correct?

[Oct 4, 2016 10:11:44 PM]

SekeRob
Master Cruncher
Joined: Jan 7, 2013
Post Count: 2741
Status: Offline


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

With a good measure of buffer, there's no reason whatsoever for me to look at the event log... it's the department of resignation to have been arm-wrestled into making it 'not my problem'. If I do look in the buffer, I find FAH2 is drowning out every other subscribed project at WCG, hence the app_config max_concurrent, 3 of 8 cores on the octo. I've tried to convince Techs they need to do something about it, so the results net to speedier return, but they are resolved to stick to "at this time we are not planning to...", so if there's no meeting somewhere in the middle, I have no qualms to use the abort key, which I do when I see yet another pileup...No point in having these tasks sit in the queue for 3 days, eventually leading to idle cores, because BOINC determines the cache is overcommitted, then without an initial trickle observing them being server aborted, code 202 in the BOINCTasks history. Could as well speed up that process for the scheduler so another client can start on them sooner... manually helping the FAH2 scientists to get their result sooner. Don't like it AT ALL, but let me be in charge of how the volunteered resources are being distributed, same as with money going to charitable causes.

[Oct 4, 2016 10:22:41 PM]

wflynny
FightAIDS@Home Scientist
Joined: Jun 17, 2014
Post Count: 10
Status: Offline
Project Badges:


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

@Doneske re: AsyncRE bottlenecks, etc.

In terms of current WUs from FA@H2, there already exists a background process which builds the WUs by parsing parameters from a configuration file and parsing the step number from the last output corresponding to that trajectory. AsyncRE will parse a bit more (certain energy values, etc.) reported in the last output file, but otherwise only is adding on the exchanging mechanisms. While the Python implementation of AsyncRE available on Github has lots modules to support different architectures, the WCG version be stripped down to only include the essentials. So in terms of *additional* CPU load to the head server, that should only come from the exchange routine, which we hope will be minimal. (I can't really comment on the current system hitting hardware limits since I can only see that from my own BOINC logs.)

In terms of GPGPUs, while AsyncRE (the framework for asynchronous replica exchange) supports GPUs, the MD engine IMPACT currently does not have a BEDAM GPU implementation. So for now GPGPU computing for this project isn't on the table (though collaborators at Brooklyn College are working on the necessary components to make this a possibility).

@Papa3 re: bad workunits

Bad WUs tend to blow up pretty quickly. If you get to the first trickle without problem, the workunit is most likely fine. So I would attribute the "restarted-by-trickle" WUs to be a result of volunteers aborting/restarting/logging off.

[Oct 5, 2016 12:03:05 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

@wflynny

Thanks for the additional information. I think we have just about beat this topic into submission :). Let's get 'er going and find out what happens. Looking forward to the beta testing....

[Oct 5, 2016 1:41:10 AM]

Rickjb
Veteran Cruncher
Australia
Joined: Sep 17, 2006
Post Count: 666
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

1 year badge for Discovering Dengue Drugs - Together

45 day badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

1 year badge for Help Fight Childhood Cancer

180 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

10 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

10 year badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

50 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

50 year badge for OpenPandemics - COVID-19


Re: Fight AIDS@Home, Phase 2 - project update, Sept. 2016

Anything that will improve the accuracy and speed of screening for potential therapeutic drugs for any disease will be a great advance.
Thanks for your hard work, Team Levy.

However. there is something about running ARE on WCG that I haven't seen mentioned. As I understand ARE, in order to be able to exchange info between WUs running on different members' machines, the WUs would need to be running concurrently. That may be simple to arrange on an in-house cluster, but us members have different work cache and project mix settings, so some replicas will want to be finished before others have even started. If my understanding is correct, how do you propose to overcome this?

[Oct 5, 2016 5:26:36 AM]

[ ]