World Community Grid - View Thread - All OPNG tasks erroring on all gpus-all hosts

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: All OPNG tasks erroring on all gpus-all hosts

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 36

[ ]

Author

This topic has been viewed 6292 times and has 35 replies

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 983
Status: Recently Active
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.

where does this 10x value come from? looking at the few jobs I completed recently they only ran ~88 autodock jobs, which is in line with what they were pushing out a few months ago. these are not excessively long and seem to take the same amount of time as previous work

did they change to really short WUs with only ~10 autodock jobs since a few months ago and then change back to "longer" tasks?

For information (and only related to the current receptor):

OPNG has been tackling the current target since the end of November 2021. If the tasks I have been receiving form a fair sample of what everyone else was getting, there were no tasks with huge job counts until the 11th of January, and from that point on the only low job counts were for retries!

The new, bigger, tasks have lots of ligands with either fewer atoms, lower branch counts or both. That causes the workunit builder to think the individual jobs will run a lot quicker (I wish...) so we get a lot more of them!!!

As for 10 times as many jobs - true if one takes the extreme cases for both low and high counts! From my relatively small sample of about 2600 tasks for this receptor, I've seen less than 100 tasks with a dozen jobs or less and less than 40 with 100 jobs or more.

If it's of any interest I'll post some basic statistics related to the above (though that's likely to get very long...); I can also go back through previous target receptor data to find other cases of the many versus the few... Whatever, I hope the above helps...

Cheers - Al.

[Jan 16, 2022 3:52:14 AM]

immortal
Cruncher
USA
Joined: Nov 4, 2018
Post Count: 8
Status: Offline
Project Badges:

90 day badge for FightAIDS@Home - Phase 2

14 day badge for Africa Rainfall Project


Re: All OPNG tasks erroring on all gpus-all hosts

Something has definitely changed recently. My system had been completing OPNG tasks in under 3 minutes up until 1/9 and since then it has been taking 12-13 minutes.

[Jan 16, 2022 4:27:11 AM]

bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 272
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.

The jogs inside the WUs have beem increased substantially. Go to your Results page here and click on a completed recent and also a couple weeks older WU and you'll see the Jobs # difference in the file.

----------------------------------------

https://xs4s.org/index.php
https://discord.gg/ePTkyue2

[Jan 16, 2022 5:44:40 AM]

Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:

90 day badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

I did. the jobs I checked had 88 jobs in them. is that outside of the norm? because that's about the same number of jobs as early OPNG tasks.

example: [url]https://www.worldcommunitygrid.org/contribution/workunit/116283546[/url]

and referencing some of my older posts, there were times when the WUs had over 1000 jobs... [url]https://www.worldcommunitygrid.org/forums/wcg...d,43386_offset,370#657089[/url]

this one was an anomaly, but 75-100+ jobs was fairly normal from what I remember. 88 tasks is in that same range.

----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti

[Jan 16, 2022 2:21:50 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 983
Status: Recently Active
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

When I posted up-thread I forgot to include a link to the post by Keith (Uplinger) discussing the work unit packaging algorithm.

Unfortunately, it appears that ligands of a similar size and number of branches tend to be clustered in the data provided for work unit generation. So instead of work units getting a mix of jobs across the range of possible effort required, there are either a few large/complex ligand jobs or lots of small/simple ligand jobs... Then, unfortunately, we get the swings in run time and, in some cases, tasks failing because of limits being exceeded :-( -- during the great OPNG job storm last year, we saw all sorts of tasks at once so it tended to hide the swing effect.

I have some monitoring software which collects OPN1/OPNG job parameters per task and can confirm that (at least for my two systems[1]) it appears that tasks are [still] built using that algorithm, or something very similar. It also appears that the majority of tasks for most receptors have had job counts in the lower ranges. That would be expected if jobs of similar potential difficulty tend to get clustered... To date, I've got data for 27108 OPNG tasks (over a variety of target receptors), comprising 986498 jobs. That's an average of just over 36 jobs/task, but the actual distribution is heavily biased towards tasks with a lower job count. However, using the WCG job assessment algorithm to score the individual jobs, over 55% of them scored so low that they would end up in 75+ job tasks if grouped only with their peers!

If the scientists were able to tweak the ordering of the ligand selections per batch, this might improve things... However, that may not be as straightforward as it sounds! And there are other issues with the algorithm and [as a side-effect] credit assignment anyway; hopefully, this can all be revisited after the transition to Krembil is completed (see Igor Jurisica's post in response to a thread about low credit...)

Cheers - Al.

[1] I only have a single 1050Ti and a single 1660 Ti in a pair of Linux machines and don't run anything to try to maintain a continuous stream of OPNG tasks, so the statistics I see won't have as many of the really extreme cases as might show up for a big hitter!

[Jan 17, 2022 2:37:49 AM]

[ ]