Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 129
Posts: 129   Pages: 13   [ Previous Page | 2 3 4 5 6 7 8 9 10 11 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8681 times and has 128 replies Next Thread
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

WhoCrazy, I think in majority the BOINC agent and the Science are independent. BOINC just manages the traffic. They (BOINC, BOINCMgr & Science) talk to each other thru RPC as i understand it........but I could be wrong confused

If 11 hours uninterrupted and stuck on a percent and no CPU time visibly consumed in TaskManager by HPF2_Rosetta process, you know what needs doing in BOINC i.e. select the WU and hit the abort button.......try killing BOINCMgr.exe first if you see the rosetta still eating a hi CPU number. Could be it lost contact with the science. Count till 10 and restart BOINCmgr.exe again...who knows.....had that often in early days of BOINC 5.2.13....but for the RPC reasons.

Meantime, after the first general bug fixes, i think its in substance a few hardware related issues. I'm now on 40 odd HPF2's done since start with 5.49. Only the first 2, on HPF2 v 5.06 were invalid....not a single error, unless by my own dumm actions. blushing

PS, just got 2 HPF2's, that had each errors reported....if i get thru those and get valid, pointers continue homing in. They are for the interrestee:

za086_ 00086 (with 2 errors reported)
za114_ 00454 (with 4 errors reported)

Latter with 4 errors, only got 2 more send out with 'in progress'. Not sure if this one is already lined up for pulling, as the 4th error is after i received the 2nd 'in progress' Matbe the system waits to receive a 'pending validation' copy before sending out any more ???????
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Jul 14, 2006 6:47:13 PM]
[Jul 14, 2006 6:46:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: My unusual HPF2 work units

Since I downloaded the updated HPF2 program on 06/28/2006, I have had 1 error and 4 invalids. Here they are, together with my Computer ID, etc, from the Messages Tab in BOINC Manager.

Starting BOINC client version 5.4.9 for windows_intelx86
Processor: 1 AuthenticAMD AMD Sempron(tm) Processor 3100+
Memory: 895.48 MB physical, 2.12 GB virtual
Disk: 40.33 GB total, 29.13 GB free
Computer ID: 32229;

Work Unit --- Time Sent ------------- Time Returned – CPU Time

Error

za095_ 00852 07/05/2006 21:18:42 07/06/2006 09:23:15 3.68

Invalid

za053_ 00268 07/03/2006 08:35:42 07/03/2006 17:38:16 4.58
za083_ 00265 07/03/2006 01:38:35 07/03/2006 12:49:27 4.48
za082_ 00363 07/02/2006 10:18:15 07/02/2006 19:25:46 4.73
za067_ 00001 07/01/2006 03:25:32 07/01/2006 14:56:31 5.06

[Jul 14, 2006 7:52:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

I have been getting the same results as others.
It has run for 48:00+ and still 0%. Tried a reboot and now at 31:00+ still the same 0%.
Agent Version 3.0 (2844)
Device ID 209699
Any thoughts.?
[Jul 14, 2006 8:39:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

Latter with 4 errors, only got 2 more send out with 'in progress'. Not sure if this one is already lined up for pulling, as the 4th error is after i received the 2nd 'in progress' Matbe the system waits to receive a 'pending validation' copy before sending out any more ???????


That's normal. The ones that fail for everybody normally end up with just 6 Errors returned and nothing else, so there must be a limit of 6 (Errors + In Progress) copies to stop wasting time on bad WUs.

I'm surprised you never got one of these until now. I'll be even more surprised if it works for you when it fails for everybody else.
[Jul 14, 2006 8:44:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

BOINC User ID 225561, Host ID 41341.

Error #13: za114_ 00337, returned 07/14/2006 13:30:51, aborted after 2.5 hours of normal checkpoints with exit code 10, Exception code: 0xc0000005, Exception address: 0x00A876DD. 4 other copies errors, 1 other copy still in progress.
[Jul 14, 2006 8:48:37 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

ok. here's the gory details
za114_00557_2, device id=50207.
I hope this is all the info you need. Aborting work unit now.
[Jul 14, 2006 9:17:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

I have been getting the same results as others.
It has run for 48:00+ and still 0%. Tried a reboot and now at 31:00+ still the same 0%.
Agent Version 3.0 (2844)
Device ID 209699
Any thoughts.?
does that mean 48 hours and 31 hours?
[Jul 14, 2006 11:43:14 PM]   Link   Report threatening or abusive post: please login first  Go to top 
jholdren
Cruncher
Joined: Jul 8, 2005
Post Count: 5
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

Looks like I'm not the only one. A Mr. Hardin suggested that I drop this info into this thread for those interested in finding out why it is happening.

I am also having this 'stuck' problem with my current work unit/job. It shows 45 hr 42 min right now and still ticking but the task progress has not gone past 0%. I am using the UD agent running Proteome_folding_2.
Agent version 3.0 (2844)
Device ID 347381
Last results returned 07/09/06 16:14:48 UTC

and I am using Rosetta v5.0.5.3
[Jul 15, 2006 3:49:16 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

Hi there. If someone aborts a work unit and then they write in and tell you the device id, does that not give you much to go on?
Perhaps if next time someone gets a dodgy work unit, they upload the wcg_hpf2.out file to somewhere?
wouldn't this help you debug the work unit quicker?
[Jul 15, 2006 8:45:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: A few unusual HPF2 work units

WCG do get the error logs.

But what really helps them is being able to link reports and descriptions of how it failed with a particular work unit, so they can dig out the failed result and start debugging it based on the problem description.

Just looking at all the raw errors is going to be unprofitable. First, you have to filter out all the normal failures, caused by overclocking and broken computers. Then, you have to work out which types of error seem to correspond with a particular bug (the same bug won't always produce the same error). Then, you can finally get an idea of what bugs are causing most trouble.

Having the verbal descriptions makes WCG's task infinitely easier. They know what the main issues are, and by taking a sample of work units displaying a particular problem, they can narrow down the type of error and eventually pinpoint the bug causing the problem.
[Jul 15, 2006 8:58:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 129   Pages: 13   [ Previous Page | 2 3 4 5 6 7 8 9 10 11 | Next Page ]
[ Jump to Last Post ]
Post new Thread