Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 20
|
![]() |
Author |
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Curiously enough, three tasks in the same workunit, all wingmen ending in 'Error'.
The Outcome is showing code '6', which is a validation error. workunit 163304830 ARP1_0028811_130_0 Linux Ubuntu Error 2022-09-12T21:20:10 2022-09-19T14:47:20 19.28/19.67 620.7/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: ARP1_0028811_130_0 Linux Ubuntu Error 2022-09-12T21:20:10 2022-09-19T14:47:20 19.28/19.67 620.7/0.0 |
||
|
catchercradle
Advanced Cruncher Joined: Jan 16, 2009 Post Count: 128 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have had that before but not on WCG. Never did get to the bottom of it. I wouldn't worry too much as long as it is the only instance of it. My only error showing so far from 76 tasks on my results page is a couldn't get input file one so almost certainly to do with the server problems.
Unless and this is a bit of a long shot based on my not being so familiar with WCG practices. Could it be something to do with the validation problems? |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have had that before but not on WCG. Never did get to the bottom of it. I wouldn't worry too much as long as it is the only instance of it. My only error showing so far from 76 tasks on my results page is a couldn't get input file one so almost certainly to do with the server problems. For me it is the only instance, that's correct, and I'm running tens of ARP1-tasks per day. I'm only reporting problems if there are any. ![]() The last time I reported one with ARP1 was in post 674290 (in July 2022), which was 1700 ARP1-tasks ago for me. ![]() |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 968 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Adri,
Thanks for flagging this work unit up as a possible issue -- I see it now has two retries (short deadline, unfortunately). Here's hoping those succeed... Regarding unexplained errors -- I often see one or two MCM1 wingmen returning results where the stderr log looks fine but the result is marked as an Error. It's always from one of the same small number of host systems so it suggests some sort of "hidden" issue with said host(s)... It's not exactly the same as your case, but it is an example of an Error without an obvious diagnostic symptom[1] :-) I wonder if that happens when there's something wrong with the uploaded result file(s) - possibly even reporting a result with files that never uploaded properly? - different validators may respond to such issues in different ways, I guess. [I know, I know; useless speculation :-)...] Cheers - Al. [1] I'm picking up the data for wingman analysis using the new API, which [unfortunately] doesn't include all the status flag values, so I don't get to see Outcome, ServerState or ValidateState explicitly, just their combined value as translated into the status used to report on a new-style WCG results page! Ah, well, it gets all I need 99.99% of the time ... |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Al:
[1] I'm picking up the data for wingman analysis using the new API, which [unfortunately] doesn't include all the status flag values, so I don't get to see Outcome, ServerState or ValidateState explicitly, just their combined value as translated into the status used to report on a new-style WCG results page! That's correct. We only seem to be getting to have a look at any of the values of Outcome, ServerState and ValidateState that you mentioned when using our personal member-URL "$HTTPS$URLBASE/api/members/$MEMBER/results?code=$VERIFY&format=json$ServerState$ValidateState$Outcome$SortBy$Offset&Limit=$opt_l" (as copied from wcgresults (that was the easiest way for me ![]() Adri |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Here is another workunit where the Outcome probably was 6 (validation error) - my machine failed to grab some inputfile -:
----------------------------------------workunit 163359747 ARP1_0017853_130_0 Linux Ubuntu Error 2022-09-12T21:05:24 2022-09-14T08:02:43 10.79/10.96 90.1/0.0--------------------------------------------------------------------------------------------------------------------------------------- Details: ($ wcgstats -wHrrr -=ARP1_0017853_130) ARP1_0017853_130_0 Linux Ubuntu Error 2022-09-12T21:05:24 2022-09-14T08:02:43 10.79/10.96 90.1/0.0 [Edit 1 times, last edit by adriverhoef at Sep 26, 2022 5:59:24 PM] |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Workunit 163304830 (mentioned at the top of this thread) has, despite the 3 intital Errors for the devices that received tasks _0, _1 and _2, nevertheless validated for wingmen _3 and _4:
workunit 163304830: ($ wcgstats -wIrrr -=ARP1_0028811_130) App: Africa Rainfall Project--------------------------------------------------------------------------------------------------------------------------------------- Details: ($ wcgstats -wHrrr -=ARP1_0028811_130) … |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 968 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Adri - I see some worrying trends regarding ARP1 tasks at the moment, and your two examples serve to highlight the problems :-(
For what it's worth, the first example eventually validated without recourse to a sixth task, an outcome I'm seeing quite often (though the errors are different - see below...) The second one is a real problem though -- that's a task that is now stuck but not because the model data is causing a crash. When monitoring the progress of my ARP1 tasks I expect to see "No Reply" and "Not Started by Deadline" returns amongst the wingmen, but over the last week or so I'm seeing an alarming increase in wingmen getting download errors (usually "wrong size"). The worst I've seen in an ARP1 task was two such errors, usually for the initial pair of tasks, but I've seen three for MCM1 tasks a couple of times now... Fortunately, all such tasks so far have managed to complete, sometimes needing all six attempts! Your second example is, I suspect, one of quite a few similar cases, most of which will pass unobserved by end users... I hope that the WCG folks are aware of this and have access to whatever knowledge and tools the IBM folks used to resolve the original stuck unit problems, as I fear we may be in for a stuck unit re-run that has nothing to do with model-induced SIGSEGV errors :-( And, of course, there needs to be a solution to the "wrong size" and "checksum error" download problems too -- I seem to recall that used to happen occasionally at IBM-WCG, but I don't think it was with the same frequency. Cheers - Al. P.S. Strangely, I haven't experienced a download error yet (said he, tempting providence...) -- it has always been other tasks for the same unit... |
||
|
PMH_UK
Veteran Cruncher UK Joined: Apr 26, 2007 Post Count: 771 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Another 2 in error with clean looking logs:
----------------------------------------https://www.worldcommunitygrid.org/contribution/workunit/163419828 Paul.
Paul.
|
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2167 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
One task (on a client that continuously has been returning 9 valid tasks per day over the past few weeks) showing an anomaly with an Outcome of 'validation error' (code: 6) — the task of mine was the blue coloured one with suffix _2.
----------------------------------------workunit 657037701 ARP1_0016751_140_0 Linuxmint Error 2025-01-30T14:55:30 2025-02-04T15:35:28 0.00/0.00 0.0/0.0Details: --------------------------------------------------------------------------------------------------------------------------------------- ARP1_0016751_140_0 Linuxmint Error 2025-01-30T14:55:30 2025-02-04T15:35:28 0.00/0.00 0.0/0.0 Adri [Edit 1 times, last edit by adriverhoef at Feb 5, 2025 9:54:34 PM] |
||
|
|
![]() |