Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 14
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 1870 times and has 13 replies Next Thread
coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
X0900075400xxxxxxxxxxxxxx_x GPU units sticking

Several of the latest batch of GPU units appear to be sticking on one of my machines

X0900075400xxxxxxxxxxxxxx_x

these modules are now stuck for over two hours. One of them is at 100% progress.

Previously, I have been running 3 GPUs without any problems in that machine (each GPU with 0.333 allocation). Another almost identical machine with the same configuration is running the new batch without a problem.

Any suggestions? Should I just bite the bullet and abort the offending WUs?

...a couple of examples

EDIT: Post title changed
Application Help Conquer Cancer 7.05 (ati_hcc1)
Workunit name X0900075400201200609081303
State Running
Received 13/11/2012 15:16:00
Report deadline 20/11/2012 15:16:01
Estimated app speed 13.35 GFLOPs/sec
Estimated task size 13'107 GFLOPs
Resources 1 CPUs + 0.333 ATI GPUs (device 1)
CPU time at last checkpoint 00:00:00
CPU time 02:12:08
Elapsed time 02:12:17
Estimated time remaining --
Fraction done 0.000%
Virtual memory size 73.81 MB
Working set size 32.04 MB
Directory slots/8
Process ID 5396

Application Help Conquer Cancer 7.05 (ati_hcc1)
Workunit name X0900075400202200609081303
State Running
Received 13/11/2012 15:16:00
Report deadline 20/11/2012 15:16:00
Estimated app speed 13.35 GFLOPs/sec
Estimated task size 13'107 GFLOPs
Resources 1 CPUs + 0.333 ATI GPUs (device 0)
CPU time at last checkpoint 00:00:00
CPU time 02:16:41
Elapsed time 02:17:45
Estimated time remaining 00:19:02
Fraction done 16.569%
Virtual memory size 133.85 MB
Working set size 86.48 MB
Directory slots/3
Process ID 1404
----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.
----------------------------------------
[Edit 1 times, last edit by coolstream at Nov 13, 2012 7:30:28 PM]
[Nov 13, 2012 7:22:10 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GPU units sticking

GPU units that normally take ~5 minutes, stuck for 2 hours... Plank them [after taking copy of the slot files]. ;o
----------------------------------------
[Edit 1 times, last edit by Former Member at Nov 13, 2012 7:29:31 PM]
[Nov 13, 2012 7:28:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

Thanks, Rob.

Not sure what you mean by 'after taking copy of the slot files'
----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.
[Nov 13, 2012 7:32:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: GPU units sticking

In the BOINC data dir each task has an assigned numeric slot. Match the slot to the result name that has a problem and make a copy of the files in there, where stderr.txt is one of particular interest. J[ust in case it becomes a returning event beyond incidental and techs develop interest]. After copy, push the task over the edge.

(My octo has at the moment 24 slots set. Many pre-empted, so it can take a little digging to find the right slot).
[Nov 13, 2012 7:46:32 PM]   Link   Report threatening or abusive post: please login first  Go to top 
coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

Thanks again. Tthrottle proved useful in quickly identifying the relevant slots.

All of the offending folders have now been saved. Is there anywhere I can upload them to, or should I just wait for a request from admins?
----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.
[Nov 13, 2012 8:55:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

@coolstream. Instead of aborting try to resume or restart them to see if they will complete. May give the techs some added info on what happened. If they still get stuck then aborting may be the only option.
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Nov 13, 2012 9:20:15 PM]   Link   Report threatening or abusive post: please login first  Go to top 
coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

OK will do.

They seem to be pretty unresponsive. One I aborted was stuck at 0%
----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.
[Nov 13, 2012 11:40:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
nanoprobe
Master Cruncher
Classified
Joined: Aug 29, 2008
Post Count: 2998
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

OK will do.

They seem to be pretty unresponsive. One I aborted was stuck at 0%

What driver version are you running?
----------------------------------------
In 1969 I took an oath to defend and protect the U S Constitution against all enemies, both foreign and Domestic. There was no expiration date.


[Nov 14, 2012 2:45:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

Discovered another batch of six with abnormal times (again one stuck at 0%). Put GPU into snooze and Captured the slots as suggested by SekeRob. Came out of snooze and ALL units started from 0%.

All six have now completed without a problem!

On another machine, I also found one stuck unit and did a PAUSE and RESUME which restarted the unit from 0% and then completed without a problem. (Relevant slot details saved and available if required).

I have no idea what is causing them to stick, but I know that I have lost over 36 hours of processing due to this today. If I find more, I will continue to pause and resume them.

Does anyone have a rule for BoincTasks that will send an email alert for stuck units?
----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.
[Nov 14, 2012 3:08:09 AM]   Link   Report threatening or abusive post: please login first  Go to top 
coolstream
Senior Cruncher
SCOTLAND
Joined: Nov 8, 2005
Post Count: 475
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: GPU units sticking

What driver version are you running?


nanoprobe, I'm using ATI 12.10 on BOINC 7.0.36
----------------------------------------

Crunching in memory of my Mum PEGGY, cousin ROPPA and Aunt AUDREY.
[Nov 14, 2012 3:17:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 14   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread