Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 42
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
WU ts05_b159_ps0000 finished properly:
ts05_b159_ps0000_3-- - In Progress 18.04.10 06:59:00 26.04.10 06:59:00 0.00 0.0 / 0.0 ts05_b159_ps0000_2-- 612 Error 16.04.10 14:46:12 18.04.10 05:21:52 29.41 443.6 / 0.0 ts05_b159_ps0000_0-- 612 Error 14.04.10 20:26:15 16.04.10 14:16:33 17.68 271.7 / 0.0 ts05_b159_ps0000_1-- 612 Pending Validation 14.04.10 20:26:15 18.04.10 14:14:14 63.10 1,259.4 / 0.0 Both errors happened at the different locations 63,28% and 38,72%. So it seems to be proven that a task which has error 29 while running uninterrupted can be completed by stopping boinc (and maybe restarting the computer) from time to time (best immediately before the error strikes...). I leave the scientifical aspect of this fact to other, more competent people. I'm quite confident that my other WU will complete as well (at the moment it is at 76%). But it's most probably for love, because all other tasks will have error 29 and therefore no credit and no time will be credited... |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
When looking at run time and claim, it strikes me as there being allot more then there should be... the scenario of it having resumed from the start without loosing CPU time.
----------------------------------------But it's most probably for love, because all other tasks will have error 29 and therefore no credit and no time will be credited... To add an ad nauseam, policy change applied a number of weeks ago, all run time on error results, credit, result count is given after the 4th or 5th result has been computed. But that depends how that error occurred and if it's in the 'known science app failure list". So far all DDDT-2 fails I had have been given credit. edit: Today I'll use ZoSo's QED [Latin is kind of half Italian, but what would I know about that ;>] ts05_ b198_ ps0000_ 1-- 1112084 Error 14-4-10 20:29:48 15-4-10 11:45:05 11.50 210.2 / 210.2 ts05_ a260_ ps0000_ 1-- 95711 Error 14-4-10 19:47:53 15-4-10 06:00:54 8.98 91.6 / 91.6 ts01_ a442_ pe0000_ 3-- 1112084 Error 11-4-10 16:03:55 11-4-10 18:22:56 1.36 24.6 / 24.6
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Apr 18, 2010 2:56:43 PM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
Ok, had not that luck with my second WU. ts05_a193_ps0000_1 had finally error 29 at 95.72%. But at least this error position was reproducible (three times exactly the same position). So it could be exhausted system resources, task internal restrictions or Monte Carlo (reproducible?). System monitor showed last RAM usage of 221MB and (as always) 805MB VM.
Up to now all other three tasks had error 29 as well, two at position 34.88% and one at position 16.72%. Any conclusions? No (I'm no scientist). But at least a tiny number of error 29 WUs can be completed successfully on an average computer just by stopping and restarting boinc from time to time (WU ts05_b159_ps0000_1). So technical restrictions cannot be excluded a priori. Definitely I would not advice anyone now to stop/restart boinc periodical - it's not worth the effort. Error 29 is not that frequent and in most cases it well could be that the task just detects that it is of no further use. But I'd be glad if this mechanism could be reviewed. Matthias |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
But I'd be glad if this mechanism could be reviewed. Matthias Agreed, but I'm not sure if, in order to get this fixed, CHARMM itself has to be modified, which may take some time ![]() |
||
|
wplachy
Senior Cruncher Joined: Sep 4, 2007 Post Count: 423 Status: Offline |
To add an ad nauseam, policy change applied a number of weeks ago, all run time on error results, credit, result count is given after the 4th or 5th result has been computed. But that depends how that error occurred and if it's in the 'known science app failure list". So far all DDDT-2 fails I had have been given credit. It appears that credit is not always being granted for 0x1d errors. I have one WU that 1 repair job and I errored out. The other initial wing man and 1 repair job validated. The two valids were awarded credit, the 0x1d errors were not. ts05_ b200_ ps0000_ 3-- 612 Valid 4/17/10 05:19:17 4/20/10 08:56:00 44.35 680.6 / 777.2 ts05_ b200_ ps0000_ 2-- 612 Error 4/15/10 21:33:32 4/17/10 05:19:12 14.28 331.9 / 0.0 ts05_ b200_ ps0000_ 1-- 612 Valid 4/14/10 20:30:05 4/17/10 01:39:34 42.02 1,006.7 / 777.2 ts05_ b200_ ps0000_ 0-- 612 Error 4/14/10 20:30:03 4/15/10 21:33:29 24.61 465.3 / 0.0 <---mine Both error results were: The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) I have another WU that appears to be headed for the same fate. Mine and 1 repair job had 0x1d errors. the other initial wing man is PV and the repair job is In Progress. To me this raises 2 questions. The first, if the 0x1d indicates "the science that is being run is determined not a good lab test case" and is really not an error why are 2 wus not getting the same result? The second is if the statement that 0x1d "errors" will receive credit still correct if some of the wus error and some do not? Edit: corrected typo
Bill P
----------------------------------------![]() [Edit 1 times, last edit by wplachy at Apr 20, 2010 11:38:58 PM] |
||
|
Sekerob
Ace Cruncher Joined: Jul 24, 2005 Post Count: 20043 Status: Offline |
Guess: 3 errors tell it was bad, 2 errors / 2 valid tells it's error, at least maybe how the rule was coded. With 2 valid the distribution is ended, noting that 1 of the original 2 was valid. I've seen several of these, simply not clear if it was intended like that.
----------------------------------------Am I glad they're running this softly softly and taken an intermission to study the ts (test-process verification units) output. over to techs! PS: I did not boot the duo that had 4 of this run for the whole duration. The first task went with same, the next 3 ran through and their quorums show no error generations.
WCG
----------------------------------------Please help to make the Forums an enjoyable experience for All! [Edit 1 times, last edit by Sekerob at Apr 21, 2010 6:05:48 AM] |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
To me this raises 2 questions. The first, if the 0x1d indicates "the science that is being run is determined not a good lab test case" and is really not an error why are 2 wus not getting the same result? The second is if the statement that 0x1d "errors" will receive credit still correct if some of the wus error and some do not? Edit: corrected typo Just a note... I had one that I invested 21 hours in and it came back as an error. As someone noted way back in this thread the task has to fail 5 times to be dead. Today, the 5 wingman came back also with an error and though the task is still listed as error in the stats I did get 461.7 CS as claimed ... the others also got what they claimed. Bottom line, though it does not follow the "normal" BOINC protocols as we know them for most projects, the error is not an error in the normal sense and in the end all is likely to end well ... My point I suppose is that for me, I think I will just let the project run and when work becomes much more available I will up my numbers and work as hard as I can on the sub-project ... :) Happy crunching... |
||
|
pirogue
Veteran Cruncher USA Joined: Dec 8, 2008 Post Count: 685 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
To add an ad nauseam, policy change applied a number of weeks ago, all run time on error results, credit, result count is given after the 4th or 5th result has been computed. But that depends how that error occurred and if it's in the 'known science app failure list". So far all DDDT-2 fails I had have been given credit. It appears that credit is not always being granted for 0x1d errors. I have one WU that 1 repair job and I errored out. The other initial wing man and 1 repair job validated. The two valids were awarded credit, the 0x1d errors were not. ts05_ b200_ ps0000_ 3-- 612 Valid 4/17/10 05:19:17 4/20/10 08:56:00 44.35 680.6 / 777.2 ts05_ b200_ ps0000_ 2-- 612 Error 4/15/10 21:33:32 4/17/10 05:19:12 14.28 331.9 / 0.0 ts05_ b200_ ps0000_ 1-- 612 Valid 4/14/10 20:30:05 4/17/10 01:39:34 42.02 1,006.7 / 777.2 ts05_ b200_ ps0000_ 0-- 612 Error 4/14/10 20:30:03 4/15/10 21:33:29 24.61 465.3 / 0.0 <---mine Both error results were: The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d) I have another WU that appears to be headed for the same fate. Mine and 1 repair job had 0x1d errors. the other initial wing man is PV and the repair job is In Progress. To me this raises 2 questions. The first, if the 0x1d indicates "the science that is being run is determined not a good lab test case" and is really not an error why are 2 wus not getting the same result? The second is if the statement that 0x1d "errors" will receive credit still correct if some of the wus error and some do not? What wplachy said^^^^^^^^^^^^^^^^ How can these sometimes be errors receiving credit and sometimes be errors and not receive credit? ts05_ a048_ ps0000_ 2-- Error 4/17/10 06:16:23 4/18/10 17:36:51 22.25 559.5 / 0.0 That's over 200 hours of code 29 errors that might or might not be WU errors, with no credit granted. Maybe I missed something, but what is the exact explanation for this discrepancy? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
That's over 200 hours of code 29 errors that might or might not be WU errors, with no credit granted. Maybe I missed something, but what is the exact explanation for this discrepancy? I guess it's in the nature of the validator. If it sees two valid results, all error results are counted as errors. If it sees at least four error results (with at most one PV) it looks at the errors and accepts them if it is error 29. The validator should equate error 29 results with valid ones, but don't think that it can be configured that easy. The very problem is that error 29 results are nondeterministic (IMHO resource dependant), i.e. there is no guarantee that all tasks of a WU have error 29. In some cases at least some of them can be brought to completion if you have the adequate hardware. If that problem could be solved, the special treatment by the validator would become redundant. |
||
|
boulmontjj
Senior Cruncher France Joined: Nov 17, 2004 Post Count: 317 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi everybody,
----------------------------------------I have a type A that finished ok and the result has been sent back before the dead line. The same WU finished in error for the 4 other members that received it. When i have a look to the results, the WU appears as "too late" !!
Is it normal ? |
||
|
|
![]() |