Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 36
|
![]() |
Author |
|
bluestang
Senior Cruncher USA Joined: Oct 1, 2010 Post Count: 272 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Might have something to do with these new WUs that have significantly more jobs inside them. You nay have seen runtimes increase 3-6x longer than before.
----------------------------------------Not sure how many concurrent tasks you are running, but might that be crashing your GPU now with these new WUs, thus causing the Error? Either way, these new WUs just suck! |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
Richard, what the heck is a -255 error? I looked at the old BOINC error list page and it is not one of the mentioned ones. I don't recognise it from BOINC either, and a quick google didn't throw up anything that seemed relevant (except an example from SETI@Home in 2006 ...!) Your logs all end abruptly, after starting Autodock and identifying the device. Taking a valid result from one of my own machines, it ends INFO:[08:29:44] Start AutoDock for OB3ZINC000901198554_1--6y84_005_nter-missing-thr199rot--LYS137_inert.dpf(Job #85)... OpenCL device: NVIDIA GeForce GTX 1660 SUPER INFO:[08:29:52] End AutoDock... INFO:Cpu time = 621.383183 08:29:55 (15956): called boinc_finish(0) So yours end within an Autodock run (although many runs have finished normally by then). The 255 error is probably generated by Autodock. I'll keep scratching my head about it. |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
Richard, what the heck is a -255 error? I looked at the old BOINC error list page and it is not one of the mentioned ones. I've had a quick look through the sources, and I can confirm that we're both right - 255 doesn't occur in BOINC, either as an exit code or an error number. I've also looked at the autodock-vina sources on GitHub, and it doesn't appear in there either. There are plenty of abnormal end-cases, but they all print a text description and exit with EXIT_FAILURE. That isn't defined locally, so it must be the standard language code, which would be 8. Not 255. So that leaves us with the actual executable program being run, which is "wcgrid_opng_autodockgpu", with various version/platform/device specifiers on the end. I don't think the specific wcgrid modifications to the underlying autodock code are open source? Autodock itself is under an Apache open source licence, but there's no separate licence file in the derivative version distributed here (and I don't think Apache requires one). We might need input from an IBM or Krembil tecchie on that. [Edit 1 times, last edit by Richard Haselgrove at Jan 13, 2022 12:38:26 PM] |
||
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 774 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My understanding below (I was a mainframe coder, but not in C).
----------------------------------------The message is "process exited with code 255 (0xff, -1)" so the hex value is FF, as a number it is either 255 if treated as an unsigned integer or -1 if signed. -1 is often used as a (non-specific) error in code or functions. This likely means an issue not specifically coded for. Possibly a bad return from a lower level function such as memory allocation or free. Paul.
Paul.
|
||
|
Keith Myers
Senior Cruncher USA Joined: Apr 6, 2021 Post Count: 193 Status: Offline Project Badges: ![]() |
The tasks all appear to run correctly for some time until they quit. Then some other task for another project takes its place. No gpu has ever fallen off the bus and they all continue to run normally with other projects work.
----------------------------------------I have 32 GBs of RAM each in three hosts and 128 GB of RAM each in two other hosts. There shouldn't be any RAM pressure even with these supposed bigger work units. I have enough to run the python task from GPUGrid easiliy. So without any clue as to what is wrong with all of my hosts ONLY with WCG, I have suspended the project. No point in wasting bandwidth or time with these tasks that all fail. ![]() A proud member of the OFA (Old Farts Association) [Edit 1 times, last edit by Keith Myers at Jan 13, 2022 5:10:23 PM] |
||
|
Keith Myers
Senior Cruncher USA Joined: Apr 6, 2021 Post Count: 193 Status: Offline Project Badges: ![]() |
I also want to point out a great majority of my errored tasks also has a wingman error with the same -255 error.
----------------------------------------Case in point: https://www.worldcommunitygrid.org/contribution/results/186982635/log ![]() A proud member of the OFA (Old Farts Association) |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 986 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'm with Richard on this one! I reckon it's an issue somewhere in the wrapper code (either checkpoint management or the set-up/tear-down of individual GPU sessions...) And my reasoning? As mentioned further up-thread , these tasks are much larger in terms of the amount of output produced, and in the intensity of said output. On one of my systems I can actually see the effects of this on overall BOINC agent behaviour! Thus...
----------------------------------------I only have two systems which run GPU tasks - both are on Ubuntu 20.04 and have reasonably up-to-date approved NVIDIA drivers (470.86 at present). On both systems there are other things going on, including monitoring the arrival and completion of BOINC tasks(!), so I don't let BOINC have too many cores... My i7-7700+GTX1050Ti has a usual work mix of two ARP1, two MCM1, one OPN1 and a non-WCG GPU job; it just drops one of the WCG CPU tasks them handles these without seeming in the least bit concerned (though disk activity does increase somewhat...). My Ryzen 3700X+GTX1660Ti has a usual work mix of 3 ARP1, five MCM1, one OPN1 and a non-WCG GPU task will try to run two OPNG at once if it has them. I've got one spare core allocated to WCG to allow for the first one, and the second one kicks out a CPU task... And that's when the fun starts! -- once I have two OPNG tasks running, the disk activity light comes on and stays on!!! And if I look at disk I/O stats it isn't kidding! It seems to be the same whether it has kicked out an ARP1 task or not -- ARP1 is a bit wordy too, but nothing like OPN1/OPNG on short jobs! Now you might be thinking "So what? It does a lot of I/O!" But it definitely chokes up BOINC task wrap-up; my monitoring scripts allow about 90 seconds from the completion entry appearing in the project's job log to search for the stderr section in the results block for the task in client_state.xml) and there are occasions where it gives up and [if I notice in time] I have to get the data off the web site instead. The worst case of this that I've seen (during the OPNG soak test in 2021) actually resulted in a "Finish file not found" message using a recent client! (DIdn't they stretch the time for that to something like 5 or 10 minutes?) It only seems to have these problems when OPNG is running (or when I used to have lots of OPN1 running - same bulk I/O volumes?...) I don't know how much influence the storage medium might have on the sluggishness - I'm not using SSDs for BOINC storage... Now, if there are any watchdog threads associated with OPNG and something holds up progress for too long, maybe the entire process gets killed without benefit of any error messages! This is speculative, based on my experience of sluggish BOINC agent work with two OPNG tasks, and the fact that lots of folks with lesser equipment (and less time stress per job) don't seem to be having problems at all (and are saying so!), whilst I suspect that the people being bitten have systems most likely to overtax the wrapper/agent code (e.g. multiple GPUs, lots of tasks running at once. Keith's case in point had a powerful GPU - I wonder how many tasks that user is running at once?!?) Anyway, without access to the WCG OPNG-specific BOINC interface and task management code, we are left guessing (and I suspect that unless they start getting workunits lost because no-one can run them, there won't be any effort put into sorting it out at the moment) I notice that Keith has posted a plea for help in the GPU Support Forum. Worth a try!!! In another, older, thread in there, adriverhoef has recently posted about a different issue with these long jobs, and I remarked that I've tried to use "Contact us" to ask for them to put a ceiling on the number of jobs in a task (as they did for OPN1!) -- I have no idea whether contact was successful... So now we wait... :-( Cheers - Al. P.S. PMH-UK is right about the 8-bit 255 error code (though well-written code should be able to cope with memory management issues and at least tail gracefully! It's [also] a generic O/S response for "What the heck happened" cases, unrecognised error states, et cetera, and could well be a result of a forced termination not done via a signal... [Edit - added clarification to when I lose task monitoring output...] [Edit 1 times, last edit by alanb1951 at Jan 13, 2022 11:13:22 PM] |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
After reviewing your results and the other results returned for the workunits you have run, I don't see any common reason as to why yours are failing. Over the past 12 hour we have around 70,000 results returned and 97.2% ran correctly. This is a normal level that indicates that the application and current set of data is running in a healthy state (especially for a GPU apps which tend to have more errors). I've also checked and we are not seeing individual workunits failing.
It is very likely that something changed on your machine - my guess would be the graphics driver (or the inter-operation of it and the kernel). I realize that your computer is running GPU apps from other projects successfully, but there is nothing on our side that points to why your specific machine is suddenly having issues. |
||
|
Keith Myers
Senior Cruncher USA Joined: Apr 6, 2021 Post Count: 193 Status: Offline Project Badges: ![]() |
Alan, on the question of how loaded my systems are . . . . all of them are running at 90% or less. I always keep 2-4 cores free from BOINC use for system maintenance activity.
----------------------------------------I never have any sluggishness noted in the UI on any of my hosts. Storage is by M.2 PCIE SSD with none approaching 25% of storage space allocated. I too notice the OPNG tasks hitting the storage activity LED more actively than previous tasks. ![]() A proud member of the OFA (Old Farts Association) |
||
|
Keith Myers
Senior Cruncher USA Joined: Apr 6, 2021 Post Count: 193 Status: Offline Project Badges: ![]() |
All the gpus have been on the default Nvidia LTS 470 driver for months with no issues with any of my projects including WCG, until just recently with WCG.
----------------------------------------The kernel has been updated often with all the kernel security updates. I have looked at the failed tasks wingmen and I see cases of similar 255 failures with the same kernel I am running and with other kernels. I have also seen wingman on my tasks validate the task using the same kernel as I am using. I don't see any resolution to this issue unless I backlevel about 4-5 kernel updates to when I was successfully running the tasks. I am not interested or not as invested in WCG to compromise the kernel security. So unless someone can brilliantly figure out what is going on or can duplicate my error case I'll just suspend the project and check back in down the road and run a few test tasks to see if whatever is causing the issue on all my hosts has been remedied. ![]() A proud member of the OFA (Old Farts Association) |
||
|
|
![]() |