World Community Grid - View Thread - All OPNG tasks erroring on all gpus-all hosts

World Community Grid Forums

Category: Active Research

Forum: OpenPandemics - COVID-19 Project

Thread: All OPNG tasks erroring on all gpus-all hosts

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 36

[ ]

Author

This topic has been viewed 6302 times and has 35 replies

bluestang
Senior Cruncher
USA
Joined: Oct 1, 2010
Post Count: 272
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

Might have something to do with these new WUs that have significantly more jobs inside them. You nay have seen runtimes increase 3-6x longer than before.

Not sure how many concurrent tasks you are running, but might that be crashing your GPU now with these new WUs, thus causing the Error?

Either way, these new WUs just suck!

----------------------------------------

https://xs4s.org/index.php
https://discord.gg/ePTkyue2

[Jan 13, 2022 4:28:18 AM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:

2 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

Richard, what the heck is a -255 error? I looked at the old BOINC error list page and it is not one of the mentioned ones.

I don't recognise it from BOINC either, and a quick google didn't throw up anything that seemed relevant (except an example from SETI@Home in 2006 ...!)

Your logs all end abruptly, after starting Autodock and identifying the device. Taking a valid result from one of my own machines, it ends

INFO:[08:29:44] Start AutoDock for OB3ZINC000901198554_1--6y84_005_nter-missing-thr199rot--LYS137_inert.dpf(Job #85)...
OpenCL device: NVIDIA GeForce GTX 1660 SUPER
INFO:[08:29:52] End AutoDock...
INFO:Cpu time = 621.383183
08:29:55 (15956): called boinc_finish(0)

So yours end within an Autodock run (although many runs have finished normally by then). The 255 error is probably generated by Autodock. I'll keep scratching my head about it.

[Jan 13, 2022 8:42:27 AM]

Richard Haselgrove
Senior Cruncher
United Kingdom
Joined: Feb 19, 2021
Post Count: 360
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

Richard, what the heck is a -255 error? I looked at the old BOINC error list page and it is not one of the mentioned ones.

I've had a quick look through the sources, and I can confirm that we're both right - 255 doesn't occur in BOINC, either as an exit code or an error number.

I've also looked at the autodock-vina sources on GitHub, and it doesn't appear in there either. There are plenty of abnormal end-cases, but they all print a text description and exit with EXIT_FAILURE. That isn't defined locally, so it must be the standard language code, which would be 8. Not 255.

So that leaves us with the actual executable program being run, which is "wcgrid_opng_autodockgpu", with various version/platform/device specifiers on the end. I don't think the specific wcgrid modifications to the underlying autodock code are open source? Autodock itself is under an Apache open source licence, but there's no separate licence file in the derivative version distributed here (and I don't think Apache requires one). We might need input from an IBM or Krembil tecchie on that.

----------------------------------------
[Edit 1 times, last edit by Richard Haselgrove at Jan 13, 2022 12:38:26 PM]

[Jan 13, 2022 12:37:26 PM]

PMH_UK
Veteran Cruncher
UK
Joined: Apr 26, 2007
Post Count: 774
Status: Offline
Project Badges:

1 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

1 year badge for The Clean Energy Project

180 day badge for Influenza Antiviral Drug Search

2 year badge for Discovering Dengue Drugs - Together - Phase 2

20 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

My understanding below (I was a mainframe coder, but not in C).
The message is "process exited with code 255 (0xff, -1)" so the hex value is FF, as a number it is either 255 if treated as an unsigned integer or -1 if signed.
-1 is often used as a (non-specific) error in code or functions.
This likely means an issue not specifically coded for.
Possibly a bad return from a lower level function such as memory allocation or free.

Paul.

----------------------------------------

Paul.

[Jan 13, 2022 4:38:27 PM]

Keith Myers
Senior Cruncher
USA
Joined: Apr 6, 2021
Post Count: 193
Status: Offline
Project Badges:

1 year badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

The tasks all appear to run correctly for some time until they quit. Then some other task for another project takes its place. No gpu has ever fallen off the bus and they all continue to run normally with other projects work.

I have 32 GBs of RAM each in three hosts and 128 GB of RAM each in two other hosts.
There shouldn't be any RAM pressure even with these supposed bigger work units.
I have enough to run the python task from GPUGrid easiliy.

So without any clue as to what is wrong with all of my hosts ONLY with WCG, I have suspended the project. No point in wasting bandwidth or time with these tasks that all fail.

----------------------------------------

A proud member of the OFA (Old Farts Association)

----------------------------------------
[Edit 1 times, last edit by Keith Myers at Jan 13, 2022 5:10:23 PM]

[Jan 13, 2022 5:09:28 PM]

Keith Myers
Senior Cruncher
USA
Joined: Apr 6, 2021
Post Count: 193
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

I also want to point out a great majority of my errored tasks also has a wingman error with the same -255 error.
Case in point:
https://www.worldcommunitygrid.org/contribution/results/186982635/log

----------------------------------------

A proud member of the OFA (Old Farts Association)

[Jan 13, 2022 9:08:18 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 986
Status: Recently Active
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for Africa Rainfall Project


Re: All OPNG tasks erroring on all gpus-all hosts

I'm with Richard on this one! I reckon it's an issue somewhere in the wrapper code (either checkpoint management or the set-up/tear-down of individual GPU sessions...) And my reasoning? As mentioned further up-thread , these tasks are much larger in terms of the amount of output produced, and in the intensity of said output. On one of my systems I can actually see the effects of this on overall BOINC agent behaviour! Thus...

I only have two systems which run GPU tasks - both are on Ubuntu 20.04 and have reasonably up-to-date approved NVIDIA drivers (470.86 at present). On both systems there are other things going on, including monitoring the arrival and completion of BOINC tasks(!), so I don't let BOINC have too many cores...

My i7-7700+GTX1050Ti has a usual work mix of two ARP1, two MCM1, one OPN1 and a non-WCG GPU job; it just drops one of the WCG CPU tasks them handles these without seeming in the least bit concerned (though disk activity does increase somewhat...).

My Ryzen 3700X+GTX1660Ti has a usual work mix of 3 ARP1, five MCM1, one OPN1 and a non-WCG GPU task will try to run two OPNG at once if it has them. I've got one spare core allocated to WCG to allow for the first one, and the second one kicks out a CPU task... And that's when the fun starts! -- once I have two OPNG tasks running, the disk activity light comes on and stays on!!! And if I look at disk I/O stats it isn't kidding! It seems to be the same whether it has kicked out an ARP1 task or not -- ARP1 is a bit wordy too, but nothing like OPN1/OPNG on short jobs!

Now you might be thinking "So what? It does a lot of I/O!" But it definitely chokes up BOINC task wrap-up; my monitoring scripts allow about 90 seconds from the completion entry appearing in the project's job log to search for the stderr section in the results block for the task in client_state.xml) and there are occasions where it gives up and [if I notice in time] I have to get the data off the web site instead. The worst case of this that I've seen (during the OPNG soak test in 2021) actually resulted in a "Finish file not found" message using a recent client! (DIdn't they stretch the time for that to something like 5 or 10 minutes?) It only seems to have these problems when OPNG is running (or when I used to have lots of OPN1 running - same bulk I/O volumes?...) I don't know how much influence the storage medium might have on the sluggishness - I'm not using SSDs for BOINC storage...

Now, if there are any watchdog threads associated with OPNG and something holds up progress for too long, maybe the entire process gets killed without benefit of any error messages! This is speculative, based on my experience of sluggish BOINC agent work with two OPNG tasks, and the fact that lots of folks with lesser equipment (and less time stress per job) don't seem to be having problems at all (and are saying so!), whilst I suspect that the people being bitten have systems most likely to overtax the wrapper/agent code (e.g. multiple GPUs, lots of tasks running at once. Keith's case in point had a powerful GPU - I wonder how many tasks that user is running at once?!?)

Anyway, without access to the WCG OPNG-specific BOINC interface and task management code, we are left guessing (and I suspect that unless they start getting workunits lost because no-one can run them, there won't be any effort put into sorting it out at the moment)

I notice that Keith has posted a plea for help in the GPU Support Forum. Worth a try!!! In another, older, thread in there, adriverhoef has recently posted about a different issue with these long jobs, and I remarked that I've tried to use "Contact us" to ask for them to put a ceiling on the number of jobs in a task (as they did for OPN1!) -- I have no idea whether contact was successful...

So now we wait... :-(

Cheers - Al.

P.S. PMH-UK is right about the 8-bit 255 error code (though well-written code should be able to cope with memory management issues and at least tail gracefully! It's [also] a generic O/S response for "What the heck happened" cases, unrecognised error states, et cetera, and could well be a result of a forced termination not done via a signal...

[Edit - added clarification to when I lose task monitoring output...]

----------------------------------------
[Edit 1 times, last edit by alanb1951 at Jan 13, 2022 11:13:22 PM]

[Jan 13, 2022 11:00:49 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: All OPNG tasks erroring on all gpus-all hosts

After reviewing your results and the other results returned for the workunits you have run, I don't see any common reason as to why yours are failing. Over the past 12 hour we have around 70,000 results returned and 97.2% ran correctly. This is a normal level that indicates that the application and current set of data is running in a healthy state (especially for a GPU apps which tend to have more errors). I've also checked and we are not seeing individual workunits failing.

It is very likely that something changed on your machine - my guess would be the graphics driver (or the inter-operation of it and the kernel). I realize that your computer is running GPU apps from other projects successfully, but there is nothing on our side that points to why your specific machine is suddenly having issues.

[Jan 13, 2022 11:01:22 PM]

Keith Myers
Senior Cruncher
USA
Joined: Apr 6, 2021
Post Count: 193
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

Alan, on the question of how loaded my systems are . . . . all of them are running at 90% or less. I always keep 2-4 cores free from BOINC use for system maintenance activity.

I never have any sluggishness noted in the UI on any of my hosts.

Storage is by M.2 PCIE SSD with none approaching 25% of storage space allocated.

I too notice the OPNG tasks hitting the storage activity LED more actively than previous tasks.

----------------------------------------

A proud member of the OFA (Old Farts Association)

[Jan 14, 2022 12:42:42 AM]

Keith Myers
Senior Cruncher
USA
Joined: Apr 6, 2021
Post Count: 193
Status: Offline
Project Badges:


Re: All OPNG tasks erroring on all gpus-all hosts

All the gpus have been on the default Nvidia LTS 470 driver for months with no issues with any of my projects including WCG, until just recently with WCG.

The kernel has been updated often with all the kernel security updates. I have looked at the failed tasks wingmen and I see cases of similar 255 failures with the same kernel I am running and with other kernels.

I have also seen wingman on my tasks validate the task using the same kernel as I am using.

I don't see any resolution to this issue unless I backlevel about 4-5 kernel updates to when I was successfully running the tasks.

I am not interested or not as invested in WCG to compromise the kernel security.

So unless someone can brilliantly figure out what is going on or can duplicate my error case I'll just suspend the project and check back in down the road and run a few test tasks to see if whatever is causing the issue on all my hosts has been remedied.

----------------------------------------

A proud member of the OFA (Old Farts Association)

[Jan 14, 2022 12:56:06 AM]

[ ]