Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 65
Posts: 65   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 8437 times and has 64 replies Next Thread
catchercradle
Advanced Cruncher
Joined: Jan 16, 2009
Post Count: 126
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validator not running...or something else?

[1] Without access to the system logs we can't tell what is and is not running, so we have to settle for what we can deduce from what we can see, and evidence suggests that we were only getting retries for the last week or so...


Why a server status page as provided by the native BOINC server software would be useful.
This morning I have gone from 3 to 5 tasks that have escaped from PV jail and downloaded 8 tasks seven of which are _0 or _1
[Sep 20, 2022 5:55:25 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Ingleside
Veteran Cruncher
Norway
Joined: Nov 19, 2005
Post Count: 974
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validator not running...or something else?

P.S. The PVal jail problem isn't necessarily a Validator issue (despite what the results pages might suggest), as the way the validator picks up work units to look at depends on a flag set by the transitioner, whereas the results pages just look at the various state conditions of individual results and don't know whether that flag has been set or not!... So if the transitioner has got a bit blocked up with respect to ARP1 tasks, this sort of issue could appear (and probably won't resolve without intervention)

Where's no mentioning of the transitioner having an application-specific part. Instead it's only mentioned transitioner looks on wu and result states and triggers validation or generates new results as necessary.

With no application specific part in transitioner, even if uses multiple transitioners, one of the transitioners being backlogged wouldn't just mean ARP1 gets backlogged. Instead, all projects would see some of the results being stuck for multiple days waiting on validation. Meaning, if one of the transitioners really was 1 week backlogged, I find it strange where's not many reports about MCM and Open Pandemic also having 1 week old results stuck waiting on validation but it's only ARP1 that is 1 week stuck waiting on validation.

Since backlogged transitioner don't seem to be the explanation for ARP1 validation being backlogged, more likely explanations includes ARP1 validator(s) are most of the time turned-off and only infrequently run, or ARP1 validator(s) are backlogged by higher-priority processes if example doubles as a temporary download-server.

Unless the upload servers starts filling-up with old ARP1 results, it's not really much of a problem to have a backlog of unvalidated ARP1 results.

Note, of course, a "we've temporarily slowed-down ARP1-validation until we've fixed the work distribution problems" or another explanation would definitely be nice to get, since if nothing else wouldn't need to guess on what is really the problem.
----------------------------------------


"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
[Sep 20, 2022 1:09:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 952
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validator not running...or something else?

Ingleside - yes, a transitioner is application-independent, but the flow of data for different applications can be very different. As we don't have "Server Status" figures or information on whether there are significant non-standard aspects to the ARP1 validator (or work unit generator?) that might interfere with things, we can only speculate as to the actual reasons for problems (and speculate we will...) -- something has caused the PVal jail issue for ARP1, it doesn't seem to be going away, and we'd like to know what the problem is :-)

ARP1 is the only current project that is likely to be troubled by significant numbers of "timed out" results (No Reply or Not Started by Deadline), which may play into whatever is causing the current issues. All my results stuck in PVal jail (for work units with at least one "timed out" result) don't seem able to escape until 6 or more days after the last viable result came in; if it wasn't for that I'd be convinced it was the validator, but...

It could be that there's something in the custom part of the ARP1 validator that is struggling at present, but how would we know? The obvious comment would be "WCG-IBM didn't have these problems", but the new system configuration is likely to be somewhat different at present.

Whatever the main problem may be, it would be interesting to know whether there is (or has been) a backlog of ARP1 work units waiting to get their "validator needed" flag set for one reason or another. After all, if the ARP1 validator(s) really is/are running at least 6 days behind all the time I think we should be seriously worried :-)

Unless the upload servers starts filling-up with old ARP1 results, it's not really much of a problem to have a backlog of unvalidated ARP1 results.
Given that creating the next generation of work unit for each of the 35609 grid cells requires the previous generation's unit to have validated and been assimilated, it will be a problem if this continues into "production mode", but hopefully that won't be the case...

Cheers - Al.

P.S. Whilst we can't see WCG-specific versions of BOINC Server code, a look at the generic versions of things is quite interesting (and nightmare-inducing, especially the transitioner!)...
[Sep 20, 2022 8:07:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
catchercradle
Advanced Cruncher
Joined: Jan 16, 2009
Post Count: 126
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validator not running...or something else?

P.S. Whilst we can't see WCG-specific versions of BOINC Server code, a look at the generic versions of things is quite interesting (and nightmare-inducing, especially the transitioner!)...


I will have to take your word for that. I have installed the client and manager from source on my Linux box but haven't looked at the server code at all.
[Sep 21, 2022 8:10:08 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Validator not running...or something else?

I see some resends from errors, so I'm going to take a wild uneducated guess and say that the transitioner is working enough to resend errors (and maybe timeouts). It is when there are 2 or more replies that it just sits there in my list. I can see from this thread that there was a day or so that we all noticed some action. I'm going to guess that the validator is only run under supervision.

We already know that they have to manually "fill the hopper", so I'm guessing that they manually run the validator.
[Sep 21, 2022 3:14:41 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 65   Pages: 7   [ Previous Page | 1 2 3 4 5 6 7 ]
[ Jump to Last Post ]
Post new Thread