Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 14
|
![]() |
Author |
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Several of the latest batch of GPU units appear to be sticking on one of my machines
----------------------------------------X0900075400xxxxxxxxxxxxxx_x these modules are now stuck for over two hours. One of them is at 100% progress. Previously, I have been running 3 GPUs without any problems in that machine (each GPU with 0.333 allocation). Another almost identical machine with the same configuration is running the new batch without a problem. Any suggestions? Should I just bite the bullet and abort the offending WUs? ...a couple of examples EDIT: Post title changed Application Help Conquer Cancer 7.05 (ati_hcc1) Workunit name X0900075400201200609081303 State Running Received 13/11/2012 15:16:00 Report deadline 20/11/2012 15:16:01 Estimated app speed 13.35 GFLOPs/sec Estimated task size 13'107 GFLOPs Resources 1 CPUs + 0.333 ATI GPUs (device 1) CPU time at last checkpoint 00:00:00 CPU time 02:12:08 Elapsed time 02:12:17 Estimated time remaining -- Fraction done 0.000% Virtual memory size 73.81 MB Working set size 32.04 MB Directory slots/8 Process ID 5396 Application Help Conquer Cancer 7.05 (ati_hcc1) Workunit name X0900075400202200609081303 State Running Received 13/11/2012 15:16:00 Report deadline 20/11/2012 15:16:00 Estimated app speed 13.35 GFLOPs/sec Estimated task size 13'107 GFLOPs Resources 1 CPUs + 0.333 ATI GPUs (device 0) CPU time at last checkpoint 00:00:00 CPU time 02:16:41 Elapsed time 02:17:45 Estimated time remaining 00:19:02 Fraction done 16.569% Virtual memory size 133.85 MB Working set size 86.48 MB Directory slots/3 Process ID 1404 ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. [Edit 1 times, last edit by coolstream at Nov 13, 2012 7:30:28 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
GPU units that normally take ~5 minutes, stuck for 2 hours... Plank them [after taking copy of the slot files]. ;o
----------------------------------------[Edit 1 times, last edit by Former Member at Nov 13, 2012 7:29:31 PM] |
||
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks, Rob.
----------------------------------------Not sure what you mean by 'after taking copy of the slot files' ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
In the BOINC data dir each task has an assigned numeric slot. Match the slot to the result name that has a problem and make a copy of the files in there, where stderr.txt is one of particular interest. J[ust in case it becomes a returning event beyond incidental and techs develop interest]. After copy, push the task over the edge.
(My octo has at the moment 24 slots set. Many pre-empted, so it can take a little digging to find the right slot). |
||
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks again. Tthrottle proved useful in quickly identifying the relevant slots.
----------------------------------------All of the offending folders have now been saved. Is there anywhere I can upload them to, or should I just wait for a request from admins? ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
@coolstream. Instead of aborting try to resume or restart them to see if they will complete. May give the techs some added info on what happened. If they still get stuck then aborting may be the only option.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
![]() ![]() |
||
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OK will do.
----------------------------------------They seem to be pretty unresponsive. One I aborted was stuck at 0% ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
nanoprobe
Master Cruncher Classified Joined: Aug 29, 2008 Post Count: 2998 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
OK will do. They seem to be pretty unresponsive. One I aborted was stuck at 0% What driver version are you running?
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.
![]() ![]() |
||
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Discovered another batch of six with abnormal times (again one stuck at 0%). Put GPU into snooze and Captured the slots as suggested by SekeRob. Came out of snooze and ALL units started from 0%.
----------------------------------------All six have now completed without a problem! On another machine, I also found one stuck unit and did a PAUSE and RESUME which restarted the unit from 0% and then completed without a problem. (Relevant slot details saved and available if required). I have no idea what is causing them to stick, but I know that I have lost over 36 hours of processing due to this today. If I find more, I will continue to pause and resume them. Does anyone have a rule for BoincTasks that will send an email alert for stuck units? ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
coolstream
Senior Cruncher SCOTLAND Joined: Nov 8, 2005 Post Count: 475 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
What driver version are you running? nanoprobe, I'm using ATI 12.10 on BOINC 7.0.36 ![]() Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY. |
||
|
|
![]() |