Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 42
Posts: 42   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 51751 times and has 41 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

WU ts05_b159_ps0000 finished properly:

ts05_b159_ps0000_3-- - In Progress 18.04.10 06:59:00 26.04.10 06:59:00 0.00 0.0 / 0.0
ts05_b159_ps0000_2-- 612 Error 16.04.10 14:46:12 18.04.10 05:21:52 29.41 443.6 / 0.0
ts05_b159_ps0000_0-- 612 Error 14.04.10 20:26:15 16.04.10 14:16:33 17.68 271.7 / 0.0
ts05_b159_ps0000_1-- 612 Pending Validation 14.04.10 20:26:15 18.04.10 14:14:14 63.10 1,259.4 / 0.0

Both errors happened at the different locations 63,28% and 38,72%.
So it seems to be proven that a task which has error 29 while running uninterrupted can be completed by stopping boinc (and maybe restarting the computer) from time to time (best immediately before the error strikes...). I leave the scientifical aspect of this fact to other, more competent people.
I'm quite confident that my other WU will complete as well (at the moment it is at 76%). But it's most probably for love, because all other tasks will have error 29 and therefore no credit and no time will be credited...
[Apr 18, 2010 2:25:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

When looking at run time and claim, it strikes me as there being allot more then there should be... the scenario of it having resumed from the start without loosing CPU time.
But it's most probably for love, because all other tasks will have error 29 and therefore no credit and no time will be credited...

To add an ad nauseam, policy change applied a number of weeks ago, all run time on error results, credit, result count is given after the 4th or 5th result has been computed. But that depends how that error occurred and if it's in the 'known science app failure list". So far all DDDT-2 fails I had have been given credit.

edit: Today I'll use ZoSo's QED [Latin is kind of half Italian, but what would I know about that ;>]

ts05_ b198_ ps0000_ 1-- 1112084 Error 14-4-10 20:29:48 15-4-10 11:45:05 11.50 210.2 / 210.2
ts05_ a260_ ps0000_ 1-- 95711 Error 14-4-10 19:47:53 15-4-10 06:00:54 8.98 91.6 / 91.6
ts01_ a442_ pe0000_ 3-- 1112084 Error 11-4-10 16:03:55 11-4-10 18:22:56 1.36 24.6 / 24.6
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Apr 18, 2010 2:56:43 PM]
[Apr 18, 2010 2:53:54 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

Ok, had not that luck with my second WU. ts05_a193_ps0000_1 had finally error 29 at 95.72%. But at least this error position was reproducible (three times exactly the same position). So it could be exhausted system resources, task internal restrictions or Monte Carlo (reproducible?). System monitor showed last RAM usage of 221MB and (as always) 805MB VM.
Up to now all other three tasks had error 29 as well, two at position 34.88% and one at position 16.72%.
Any conclusions? No (I'm no scientist). But at least a tiny number of error 29 WUs can be completed successfully on an average computer just by stopping and restarting boinc from time to time (WU ts05_b159_ps0000_1). So technical restrictions cannot be excluded a priori.
Definitely I would not advice anyone now to stop/restart boinc periodical - it's not worth the effort. Error 29 is not that frequent and in most cases it well could be that the task just detects that it is of no further use. But I'd be glad if this mechanism could be reviewed.

Matthias
[Apr 19, 2010 7:45:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

But I'd be glad if this mechanism could be reviewed.

Matthias


Agreed, but I'm not sure if, in order to get this fixed, CHARMM itself has to be modified, which may take some time sad ...
[Apr 20, 2010 5:18:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
wplachy
Senior Cruncher
Joined: Sep 4, 2007
Post Count: 423
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

To add an ad nauseam, policy change applied a number of weeks ago, all run time on error results, credit, result count is given after the 4th or 5th result has been computed. But that depends how that error occurred and if it's in the 'known science app failure list". So far all DDDT-2 fails I had have been given credit.

It appears that credit is not always being granted for 0x1d errors. I have one WU that 1 repair job and I errored out. The other initial wing man and 1 repair job validated. The two valids were awarded credit, the 0x1d errors were not.

ts05_ b200_ ps0000_ 3-- 612 Valid 4/17/10 05:19:17 4/20/10 08:56:00 44.35 680.6 / 777.2
ts05_ b200_ ps0000_ 2-- 612 Error 4/15/10 21:33:32 4/17/10 05:19:12 14.28 331.9 / 0.0
ts05_ b200_ ps0000_ 1-- 612 Valid 4/14/10 20:30:05 4/17/10 01:39:34 42.02 1,006.7 / 777.2
ts05_ b200_ ps0000_ 0-- 612 Error 4/14/10 20:30:03 4/15/10 21:33:29 24.61 465.3 / 0.0 <---mine

Both error results were: The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)

I have another WU that appears to be headed for the same fate. Mine and 1 repair job had 0x1d errors. the other initial wing man is PV and the repair job is In Progress.

To me this raises 2 questions. The first, if the 0x1d indicates "the science that is being run is determined not a good lab test case" and is really not an error why are 2 wus not getting the same result? The second is if the statement that 0x1d "errors" will receive credit still correct if some of the wus error and some do not?
Edit: corrected typo
----------------------------------------
Bill P

----------------------------------------
[Edit 1 times, last edit by wplachy at Apr 20, 2010 11:38:58 PM]
[Apr 20, 2010 11:36:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sekerob
Ace Cruncher
Joined: Jul 24, 2005
Post Count: 20043
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

Guess: 3 errors tell it was bad, 2 errors / 2 valid tells it's error, at least maybe how the rule was coded. With 2 valid the distribution is ended, noting that 1 of the original 2 was valid. I've seen several of these, simply not clear if it was intended like that.

Am I glad they're running this softly softly and taken an intermission to study the ts (test-process verification units) output.

over to techs!

PS: I did not boot the duo that had 4 of this run for the whole duration. The first task went with same, the next 3 ran through and their quorums show no error generations.
----------------------------------------
WCG Global & Research > Make Proposal Help: Start Here!
Please help to make the Forums an enjoyable experience for All!
----------------------------------------
[Edit 1 times, last edit by Sekerob at Apr 21, 2010 6:05:48 AM]
[Apr 21, 2010 6:05:13 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

To me this raises 2 questions. The first, if the 0x1d indicates "the science that is being run is determined not a good lab test case" and is really not an error why are 2 wus not getting the same result? The second is if the statement that 0x1d "errors" will receive credit still correct if some of the wus error and some do not?
Edit: corrected typo

Just a note... I had one that I invested 21 hours in and it came back as an error.

As someone noted way back in this thread the task has to fail 5 times to be dead. Today, the 5 wingman came back also with an error and though the task is still listed as error in the stats I did get 461.7 CS as claimed ... the others also got what they claimed.

Bottom line, though it does not follow the "normal" BOINC protocols as we know them for most projects, the error is not an error in the normal sense and in the end all is likely to end well ...

My point I suppose is that for me, I think I will just let the project run and when work becomes much more available I will up my numbers and work as hard as I can on the sub-project ... :)

Happy crunching...
[Apr 21, 2010 4:25:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
pirogue
Veteran Cruncher
USA
Joined: Dec 8, 2008
Post Count: 685
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

To add an ad nauseam, policy change applied a number of weeks ago, all run time on error results, credit, result count is given after the 4th or 5th result has been computed. But that depends how that error occurred and if it's in the 'known science app failure list". So far all DDDT-2 fails I had have been given credit.

It appears that credit is not always being granted for 0x1d errors. I have one WU that 1 repair job and I errored out. The other initial wing man and 1 repair job validated. The two valids were awarded credit, the 0x1d errors were not.

ts05_ b200_ ps0000_ 3-- 612 Valid 4/17/10 05:19:17 4/20/10 08:56:00 44.35 680.6 / 777.2
ts05_ b200_ ps0000_ 2-- 612 Error 4/15/10 21:33:32 4/17/10 05:19:12 14.28 331.9 / 0.0
ts05_ b200_ ps0000_ 1-- 612 Valid 4/14/10 20:30:05 4/17/10 01:39:34 42.02 1,006.7 / 777.2
ts05_ b200_ ps0000_ 0-- 612 Error 4/14/10 20:30:03 4/15/10 21:33:29 24.61 465.3 / 0.0 <---mine

Both error results were: The system cannot write to the specified device. (0x1d) - exit code 29 (0x1d)

I have another WU that appears to be headed for the same fate. Mine and 1 repair job had 0x1d errors. the other initial wing man is PV and the repair job is In Progress.

To me this raises 2 questions. The first, if the 0x1d indicates "the science that is being run is determined not a good lab test case" and is really not an error why are 2 wus not getting the same result? The second is if the statement that 0x1d "errors" will receive credit still correct if some of the wus error and some do not?

What wplachy said^^^^^^^^^^^^^^^^

How can these sometimes be errors receiving credit and sometimes be errors and not receive credit?
ts05_ a048_ ps0000_ 2--  	Error	4/17/10 06:16:23  	4/18/10 17:36:51  	22.25  	559.5 / 0.0
ts05_ b371_ ps0000_ 1-- Error 4/14/10 20:45:13 4/18/10 02:24:28 35.40 603.9 / 0.0
ts05_ b213_ ps0000_ 1-- Error 4/14/10 20:31:24 4/17/10 11:06:35 15.87 330.6 / 0.0
ts05_ a436_ ps0000_ 0-- Error 4/14/10 20:07:19 4/19/10 02:22:15 85.32 864.7 / 864.7
ts05_ a405_ ps0000_ 1-- Error 4/14/10 20:04:42 4/16/10 12:18:41 25.86 782.5 / 782.5
ts05_ a399_ ps0000_ 1-- Error 4/14/10 20:04:13 4/16/10 15:43:58 36.37 372.0 / 0.0
ts05_ a384_ ps0000_ 0-- Error 4/14/10 20:02:35 4/16/10 15:28:23 35.40 603.9 / 0.0
ts05_ a325_ ps0000_ 1-- Error 4/14/10 19:55:28 4/15/10 20:28:19 19.24 484.8 / 0.0
ts05_ a318_ ps0000_ 0-- Error 4/14/10 19:54:35 4/15/10 10:26:20 10.70 311.4 / 311.4
ts05_ a314_ ps0000_ 0-- Error 4/14/10 19:54:02 4/16/10 11:11:28 25.73 767.4 / 0.0
ts05_ a293_ ps0000_ 1-- Error 4/14/10 19:51:38 4/15/10 04:21:19 4.57 115.0 / 115.0
ts05_ a266_ ps0000_ 0-- Error 4/14/10 19:48:24 4/16/10 11:17:19 26.81 799.8 / 0.0

That's over 200 hours of code 29 errors that might or might not be WU errors, with no credit granted.
Maybe I missed something, but what is the exact explanation for this discrepancy?
----------------------------------------

[Apr 22, 2010 3:05:49 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

That's over 200 hours of code 29 errors that might or might not be WU errors, with no credit granted.
Maybe I missed something, but what is the exact explanation for this discrepancy?


I guess it's in the nature of the validator. If it sees two valid results, all error results are counted as errors. If it sees at least four error results (with at most one PV) it looks at the errors and accepts them if it is error 29. The validator should equate error 29 results with valid ones, but don't think that it can be configured that easy. The very problem is that error 29 results are nondeterministic (IMHO resource dependant), i.e. there is no guarantee that all tasks of a WU have error 29. In some cases at least some of them can be brought to completion if you have the adequate hardware. If that problem could be solved, the special treatment by the validator would become redundant.
[Apr 22, 2010 4:44:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
boulmontjj
Senior Cruncher
France
Joined: Nov 17, 2004
Post Count: 317
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: exited with code 29 (0x1d, -227)

Hi everybody,

I have a type A that finished ok and the result has been sent back before the dead line.
The same WU finished in error for the 4 other members that received it.
When i have a look to the results, the WU appears as "too late" !!



Statut des unités de travail


Nom du projet: Discovering Dengue Drugs - Together - Phase 2 (Type A)
Créé le :: 14/04/10
Nom: ts05_b039_ps0000
Quorum minimum: 2
Réplication initiale: 2



Nom du résultat App Version Number Etat Heure d'envoi Heure de retour prévue /
Heure de retour Temps d'unité centrale (heures) Crédit BOINC demandé/accordé
ts05_ b039_ ps0000_ 4-- 612 Erreur 18/04/10 22:21:11 22/04/10 00:56:35 25,44 417,7 / 417,7
ts05_ b039_ ps0000_ 3-- 612 Erreur 17/04/10 19:04:35 18/04/10 22:21:09 16,73 282,0 / 282,0
ts05_ b039_ ps0000_ 2-- 612 Trop tard 15/04/10 18:09:02 21/04/10 14:33:32 49,72 757,2 / 757,2
ts05_ b039_ ps0000_ 1-- 612 Erreur 14/04/10 20:14:58 15/04/10 18:08:53 10,56 183,5 / 183,5
ts05_ b039_ ps0000_ 0-- 612 Erreur 14/04/10 20:14:56 17/04/10 18:41:13 22,73 388,3 / 388,3


Is it normal ?
----------------------------------------

Rejoignez nous et visitez le site de l'équipe France ici http://www.grid-france.fr
[Apr 22, 2010 1:24:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 42   Pages: 5   [ Previous Page | 1 2 3 4 5 | Next Page ]
[ Jump to Last Post ]
Post new Thread