Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 36
Posts: 36   Pages: 4   [ Previous Page | 1 2 3 4 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 6292 times and has 35 replies Next Thread
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 983
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: All OPNG tasks erroring on all gpus-all hosts

I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.


where does this 10x value come from? looking at the few jobs I completed recently they only ran ~88 autodock jobs, which is in line with what they were pushing out a few months ago. these are not excessively long and seem to take the same amount of time as previous work

did they change to really short WUs with only ~10 autodock jobs since a few months ago and then change back to "longer" tasks?
For information (and only related to the current receptor):

OPNG has been tackling the current target since the end of November 2021. If the tasks I have been receiving form a fair sample of what everyone else was getting, there were no tasks with huge job counts until the 11th of January, and from that point on the only low job counts were for retries!

The new, bigger, tasks have lots of ligands with either fewer atoms, lower branch counts or both. That causes the workunit builder to think the individual jobs will run a lot quicker (I wish...) so we get a lot more of them!!!

As for 10 times as many jobs - true if one takes the extreme cases for both low and high counts! From my relatively small sample of about 2600 tasks for this receptor, I've seen less than 100 tasks with a dozen jobs or less and less than 40 with 100 jobs or more.

If it's of any interest I'll post some basic statistics related to the above (though that's likely to get very long...); I can also go back through previous target receptor data to find other cases of the many versus the few... Whatever, I hope the above helps...

Cheers - Al.
[Jan 16, 2022 3:52:14 AM]   Link   Report threatening or abusive post: please login first  Go to top 
immortal
Cruncher
USA
Joined: Nov 4, 2018
Post Count: 8
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All OPNG tasks erroring on all gpus-all hosts

Something has definitely changed recently. My system had been completing OPNG tasks in under 3 minutes up until 1/9 and since then it has been taking 12-13 minutes.
[Jan 16, 2022 4:27:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 272
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All OPNG tasks erroring on all gpus-all hosts

I'll go out on a limb and say it's just not the custom Einstein code you're running but also these newer OPNG WUs which have 10x more jobs inside them also causing the issue.


where does this 10x value come from? looking at the few jobs I completed recently they only ran ~88 autodock jobs, which is in line with what they were pushing out a few months ago. these are not excessively long and seem to take the same amount of time as previous work

did they change to really short WUs with only ~10 autodock jobs since a few months ago and then change back to "longer" tasks?


The jogs inside the WUs have beem increased substantially. Go to your Results page here and click on a completed recent and also a couple weeks older WU and you'll see the Jobs # difference in the file.
----------------------------------------
[Jan 16, 2022 5:44:40 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ian-n-Steve C.
Senior Cruncher
United States
Joined: May 15, 2020
Post Count: 180
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: All OPNG tasks erroring on all gpus-all hosts

I did. the jobs I checked had 88 jobs in them. is that outside of the norm? because that's about the same number of jobs as early OPNG tasks.

example: [url]https://www.worldcommunitygrid.org/contribution/workunit/116283546[/url]

and referencing some of my older posts, there were times when the WUs had over 1000 jobs... [url]https://www.worldcommunitygrid.org/forums/wcg...d,43386_offset,370#657089[/url]

this one was an anomaly, but 75-100+ jobs was fairly normal from what I remember. 88 tasks is in that same range.
----------------------------------------

EPYC 7V12 / [5] RTX A4000
EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti
EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060
[2] EPYC 7642 / [2] RTX 2080Ti
[Jan 16, 2022 2:21:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 983
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: All OPNG tasks erroring on all gpus-all hosts

I did. the jobs I checked had 88 jobs in them. is that outside of the norm? because that's about the same number of jobs as early OPNG tasks.

example: [url]https://www.worldcommunitygrid.org/contribution/workunit/116283546[/url]

and referencing some of my older posts, there were times when the WUs had over 1000 jobs... [url]https://www.worldcommunitygrid.org/forums/wcg...d,43386_offset,370#657089[/url]

this one was an anomaly, but 75-100+ jobs was fairly normal from what I remember. 88 tasks is in that same range.
When I posted up-thread I forgot to include a link to the post by Keith (Uplinger) discussing the work unit packaging algorithm.

Unfortunately, it appears that ligands of a similar size and number of branches tend to be clustered in the data provided for work unit generation. So instead of work units getting a mix of jobs across the range of possible effort required, there are either a few large/complex ligand jobs or lots of small/simple ligand jobs... Then, unfortunately, we get the swings in run time and, in some cases, tasks failing because of limits being exceeded :-( -- during the great OPNG job storm last year, we saw all sorts of tasks at once so it tended to hide the swing effect.

I have some monitoring software which collects OPN1/OPNG job parameters per task and can confirm that (at least for my two systems[1]) it appears that tasks are [still] built using that algorithm, or something very similar. It also appears that the majority of tasks for most receptors have had job counts in the lower ranges. That would be expected if jobs of similar potential difficulty tend to get clustered... To date, I've got data for 27108 OPNG tasks (over a variety of target receptors), comprising 986498 jobs. That's an average of just over 36 jobs/task, but the actual distribution is heavily biased towards tasks with a lower job count. However, using the WCG job assessment algorithm to score the individual jobs, over 55% of them scored so low that they would end up in 75+ job tasks if grouped only with their peers!

If the scientists were able to tweak the ordering of the ligand selections per batch, this might improve things... However, that may not be as straightforward as it sounds! And there are other issues with the algorithm and [as a side-effect] credit assignment anyway; hopefully, this can all be revisited after the transition to Krembil is completed (see Igor Jurisica's post in response to a thread about low credit...)

Cheers - Al.

[1] I only have a single 1050Ti and a single 1660 Ti in a pair of Linux machines and don't run anything to try to maintain a continuous stream of OPNG tasks, so the statistics I see won't have as many of the really extreme cases as might show up for a big hitter!
[Jan 17, 2022 2:37:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 36   Pages: 4   [ Previous Page | 1 2 3 4 ]
[ Jump to Last Post ]
Post new Thread