Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 129
|
![]() |
Author |
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
WhoCrazy, I think in majority the BOINC agent and the Science are independent. BOINC just manages the traffic. They (BOINC, BOINCMgr & Science) talk to each other thru RPC as i understand it........but I could be wrong
----------------------------------------![]() If 11 hours uninterrupted and stuck on a percent and no CPU time visibly consumed in TaskManager by HPF2_Rosetta process, you know what needs doing in BOINC i.e. select the WU and hit the abort button.......try killing BOINCMgr.exe first if you see the rosetta still eating a hi CPU number. Could be it lost contact with the science. Count till 10 and restart BOINCmgr.exe again...who knows.....had that often in early days of BOINC 5.2.13....but for the RPC reasons. Meantime, after the first general bug fixes, i think its in substance a few hardware related issues. I'm now on 40 odd HPF2's done since start with 5.49. Only the first 2, on HPF2 v 5.06 were invalid....not a single error, unless by my own dumm actions. ![]() PS, just got 2 HPF2's, that had each errors reported....if i get thru those and get valid, pointers continue homing in. They are for the interrestee: za086_ 00086 (with 2 errors reported) za114_ 00454 (with 4 errors reported) Latter with 4 errors, only got 2 more send out with 'in progress'. Not sure if this one is already lined up for pulling, as the 4th error is after i received the 2nd 'in progress' Matbe the system waits to receive a 'pending validation' copy before sending out any more ???????
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Jul 14, 2006 6:47:13 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Since I downloaded the updated HPF2 program on 06/28/2006, I have had 1 error and 4 invalids. Here they are, together with my Computer ID, etc, from the Messages Tab in BOINC Manager. Starting BOINC client version 5.4.9 for windows_intelx86 Processor: 1 AuthenticAMD AMD Sempron(tm) Processor 3100+ Memory: 895.48 MB physical, 2.12 GB virtual Disk: 40.33 GB total, 29.13 GB free Computer ID: 32229; Work Unit --- Time Sent ------------- Time Returned – CPU Time Error za095_ 00852 07/05/2006 21:18:42 07/06/2006 09:23:15 3.68 Invalid za053_ 00268 07/03/2006 08:35:42 07/03/2006 17:38:16 4.58 za083_ 00265 07/03/2006 01:38:35 07/03/2006 12:49:27 4.48 za082_ 00363 07/02/2006 10:18:15 07/02/2006 19:25:46 4.73 za067_ 00001 07/01/2006 03:25:32 07/01/2006 14:56:31 5.06 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have been getting the same results as others.
It has run for 48:00+ and still 0%. Tried a reboot and now at 31:00+ still the same 0%. Agent Version 3.0 (2844) Device ID 209699 Any thoughts.? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Latter with 4 errors, only got 2 more send out with 'in progress'. Not sure if this one is already lined up for pulling, as the 4th error is after i received the 2nd 'in progress' Matbe the system waits to receive a 'pending validation' copy before sending out any more ??????? That's normal. The ones that fail for everybody normally end up with just 6 Errors returned and nothing else, so there must be a limit of 6 (Errors + In Progress) copies to stop wasting time on bad WUs. I'm surprised you never got one of these until now. I'll be even more surprised if it works for you when it fails for everybody else. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
BOINC User ID 225561, Host ID 41341.
Error #13: za114_ 00337, returned 07/14/2006 13:30:51, aborted after 2.5 hours of normal checkpoints with exit code 10, Exception code: 0xc0000005, Exception address: 0x00A876DD. 4 other copies errors, 1 other copy still in progress. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
ok. here's the gory details
za114_00557_2, device id=50207. I hope this is all the info you need. Aborting work unit now. |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I have been getting the same results as others. does that mean 48 hours and 31 hours?It has run for 48:00+ and still 0%. Tried a reboot and now at 31:00+ still the same 0%. Agent Version 3.0 (2844) Device ID 209699 Any thoughts.? |
||
|
jholdren
Cruncher Joined: Jul 8, 2005 Post Count: 5 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Looks like I'm not the only one. A Mr. Hardin suggested that I drop this info into this thread for those interested in finding out why it is happening.
I am also having this 'stuck' problem with my current work unit/job. It shows 45 hr 42 min right now and still ticking but the task progress has not gone past 0%. I am using the UD agent running Proteome_folding_2. Agent version 3.0 (2844) Device ID 347381 Last results returned 07/09/06 16:14:48 UTC and I am using Rosetta v5.0.5.3 |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Hi there. If someone aborts a work unit and then they write in and tell you the device id, does that not give you much to go on?
Perhaps if next time someone gets a dodgy work unit, they upload the wcg_hpf2.out file to somewhere? wouldn't this help you debug the work unit quicker? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
WCG do get the error logs.
But what really helps them is being able to link reports and descriptions of how it failed with a particular work unit, so they can dig out the failed result and start debugging it based on the problem description. Just looking at all the raw errors is going to be unprofitable. First, you have to filter out all the normal failures, caused by overclocking and broken computers. Then, you have to work out which types of error seem to correspond with a particular bug (the same bug won't always produce the same error). Then, you can finally get an idea of what bugs are causing most trouble. Having the verbal descriptions makes WCG's task infinitely easier. They know what the main issues are, and by taking a sample of work units displaying a particular problem, they can narrow down the type of error and eventually pinpoint the bug causing the problem. |
||
|
|
![]() |