Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 39
Posts: 39   Pages: 4   [ 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 4240 times and has 38 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Sticking W.U.

Hi
This work unit:
Project Name: Human Proteome Folding - Phase 2
Created: 07/12/2008 04:00:22
Name: lu841_00013

Stuck at 59% for about 6 hours it restarted and now has gone well past the point with a full BOINC restart.
This issue I know is pretty rare , just posting this as an on guard thing really in case there is a batch of em out there.
On this WU. I may well be the only victim as two others are already pending.
Cheers
Chris.
[Jul 13, 2008 9:38:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

Hello Chris!
Thank you for the warning.

Your analysis of this type of problem is very good: it is something which may happen sometimes, normally very rarely, depending on the project and particular circumstances in a given machine. From memory I think it happens a little more often for HPF2 than for other projects, but even for HPF2 it is rather exceptional.

Usually the trick to unlock the situation is to suspend the WCG project (not the task) from the Project tab in the advanced view and to resume it about 60 seconds later. Then the task usually goes back to the latest checkpoint and runs fine till the end.

Obviously we have no clear explanation of this random phenomenon, otherwise we would have tried to correct it. smile

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Jul 13, 2008 11:40:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

Thanks for the warning, I had one of those too (that's why I entered the forum).

It has been at 67.10% / 22:20 hours, the time increasing, the progress did not.

After a full BOINC restart, the time went back to 02:19 and the progress immediately increased to 67.18%, it seems to run fine from this point.

So it obviously hasn't done anything for ~20 hours, including not writing checkpoints and stuff.

I didn't backup slot and project folder so I cannot see the last modification times before the restart anymore.


p.s.: the workunit is this one : https://secure.worldcommunitygrid.org/ms/devi...us.do?workunitId=35866695
my result name is lv893_ 00002_ 7--


p.p.s.: I don't think that suspending the project would have helped in my case, as I leave apps in memory while suspended


p.p.p.s.: Quadcore Q9450, Win2k server, 3.2GB RAM (4GB installed but no r64 OS), BOINC graphics disabled.
----------------------------------------
[Edit 4 times, last edit by Former Member at Aug 9, 2008 10:25:09 AM]
[Aug 9, 2008 9:59:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

Hi Ananas,

It's why suspending project instead of suspending the job. Suspending the project unloads the job, at least that's what it does when I have this event which is 4-5 month ago I last encountered it, with LIM on. Either way, we have an FAQ (Section 3 of the Support FAQ index, link in my sig) for these hung cases and the steps to take in attempting to restart them.

Using BOINC throttle (ThreadMaster GUI is superior on Vista, works on W2K too)? Did Benchmark run at time of that job? What Client Version?

If only the feature like in BOINCview with colour coding and a progress alert pop-up, there'd be no 22 hours lost time on semi attended clients. Hey, I love Speedfan. Sends me a mail if a system gets too hot biggrin

ciao
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 9, 2008 10:53:03 AM]
[Aug 9, 2008 10:51:15 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

It's why suspending project instead of suspending the job. Suspending the project unloads the job,

Hmmm ... it doesn't do that for any other project on my boxes, so it must be a special feature of either BOINC 6.x or WCG

Using BOINC throttle (ThreadMaster GUI is superior on Vista, works on W2K too)? Did Benchmark run at time of that job? What Client Version?

There was a benchmark while that WU was running, yes. No throtteling (it's a headless cruncher), CC 5.10.28

If only the feature like in BOINCview with colour coding and a progress alert pop-up ...

I'm using BOINCview and have set it to alert me when not enough CPU time per wallclock time is used - but a "no progress" alert wouldn't work as some projects have a quite jumpy progress. Yoyo Muons (the quickest runs) have only 1/3 and 2/3 progress - it would alert me all the time there.

I did see that % was stuck but I assumed that it has to be like that, HPF jobs on FaD used quite some time too - but then I saw that many of my wingmen had finished the complete thing already and my box isn't what I would call a slug ;-)
[Aug 9, 2008 11:04:54 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

There was a benchmark while that WU was running, yes. No throtteling (it's a headless cruncher), CC 5.10.28


Same here, benchmarks interrupted W.U. in the origional post .Happened during my night time so could not be 100% sure this was the cause but later checking messages it must have been very close!
Chris.
[Aug 9, 2008 11:58:42 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

Now that is very interesting.... seen related reports at Berkeley where one or the other freeze occurred and also armstrdj, one of the WCG programmers asked that question earlier.

At one time I thought that it might be also of use to

1. Time the Benchmark to fall in a checkpoint, but was 1 core minded and thus does not work in the real world of dual/quad/octo

Do think it could help

2. Switch LIM off and extend the Project Time slicing to 240 minutes. The Benchmarking is I think these days not causing an offload like it did under 5.4 and causing Checkpoint Resume. Most many have enough ram these days to not have excessive disk i/o, so does LIM still have a valid right to be?

[edit: spelling)
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Aug 9, 2008 12:48:04 PM]
[Aug 9, 2008 12:46:48 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

3. Don't renew the benchmark values if there are no active projects that require a benchmark. Just for new work requests, an outdated benchmark is as good as a current one.


LIM is required by some projects, switching it off isn't an option if you crunch those. Others can become unstable without LIM.

My checkpoints are set to 180 seconds for a long time already. I don't need that many checkpoints as I rather loose 3 minutes every 3 months than wasting 5 seconds every minute ;-)

Application rescheduling is set to 100 minutes for me, as several projects have results that can be finished without interrupt then.
----------------------------------------
[Edit 2 times, last edit by Former Member at Aug 9, 2008 1:10:02 PM]
[Aug 9, 2008 12:59:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

Well that one I'd not thought about, but surely you know it runs once every 140 hours wallclock time. If you are hands on managing your projects, force the update prior to running WCG. Then it's one item excluded as being a potential source.

Meantime we'll highlight this to the programmers. Something in BOINC, but given it's 9 out of 10 a HPF2 that gets reported, it may the particular combination.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
[Aug 9, 2008 1:10:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
JmBoullier
Former Community Advisor
Normandy - France
Joined: Jan 26, 2007
Post Count: 3715
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Sticking W.U.

Most many have enough ram these days to not have excessive disk i/o, so does LIM still have a valid right to be?

I think it has been explicitly mentioned some Boinc release notes ago that benchmarking is no longer causing the offloading of WUs but I cannot confirm it since I have always run with LIM ON.

But there is still the more frequent case of "repair" WUs in multicore machines. While it seems that a repair job with a reasonable shorter deadline rarely cause suspending the WU in progress in a single-core machine, it seems to be almost automatic in a quad, even if there is no real urgency. In that case LIM ON avoids losing much work from going back to the previous checkpoint of the suspended WU. Which can be much for some projects.

Cheers. Jean.
----------------------------------------
Team--> Decrypthon -->Statistics/Join -->Thread
[Aug 9, 2008 1:29:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 39   Pages: 4   [ 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread