World Community Grid - View Thread - Validator not running...or something else?

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Validator not running...or something else?

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 65

[ ]

Author

This topic has been viewed 8381 times and has 64 replies

geophi
Advanced Cruncher
U.S.
Joined: Sep 3, 2007
Post Count: 102
Status: Offline
Project Badges:

1 year badge for Help Fight Childhood Cancer

45 day badge for The Clean Energy Project - Phase 2

14 day badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

14 day badge for Uncovering Genome Mysteries

90 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

1 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Validator not running...or something else?

I have several ARP tasks that are pending validation, along with a wingman who also completed it. Some of these tasks were completed by these two PCs at least two or three days ago. No other task from these work units have been sent to another PC, so there is no _2 replication.. If it was a problem with differing results between the two returned tasks, it should be pending verification. Example:

https://www.worldcommunitygrid.org/contribution/workunit/155120909

A few of the tasks I've completed within the last day were validated, as I ran the _2 task from the work unit after one on the initial two tasks errored out.

Anyone else seeing this or have an explanation

[Aug 31, 2022 7:58:04 PM]

MJH333
Senior Cruncher
England
Joined: Apr 3, 2021
Post Count: 266
Status: Offline
Project Badges:

50 year badge for Mapping Cancer Markers

10 year badge for Africa Rainfall Project


Re: Validator not running...or something else?

Anyone else seeing this

Yes, I’ve got a few like that and assumed that there was some issue with the validator.
By the way, the machine running the _0 task in the example you gave was quick!
Cheers,
Mark

[Aug 31, 2022 8:28:04 PM]

geophi
Advanced Cruncher
U.S.
Joined: Sep 3, 2007
Post Count: 102
Status: Offline
Project Badges:


Re: Validator not running...or something else?

Yeah, that's my Ryzen 5 5600 running 4 ARP and 2 OPN at the same time. ARPs love Ryzens.

[Aug 31, 2022 8:54:48 PM]

MJH333
Senior Cruncher
England
Joined: Apr 3, 2021
Post Count: 266
Status: Offline
Project Badges:


Re: Validator not running...or something else?

That’s impressive performance from a $180-ish CPU!

[Aug 31, 2022 9:10:46 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 945
Status: Recently Active
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

180 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

14 day badge for Computing for Sustainable Water

2 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Validator not running...or something else?

There is definitely something amiss with ARP1 validation -- retry jobs issued because of "no reply" or the "not started by deadline" error validate fairly promptly, but tasks where both initial results are returned well within the deadline sit at Pending Validation. Looking at corresponding work unit numbers shows (as one might expect) that the ones validating have much lower WU numbers than the ones that are stalled!

I'm not sure whether this happens because the transitioners are overloaded or because the validator just isn't seeing the higher-numbered units because of the way it picks up results to check. What follows is "informed speculation" (i.e. I've done a code-dive on BOINC code for another project...) but may be way off if WCG/IBM have hacked the core BOINC code of the validators or the transitioner (not a good idea, so probably not the case!)

Note that it is possible to run multiple validators and transitioners, using an algorithm to ensure that each one processes a distinct set of work units; I'd hope WCG does that! The below refers to an individual transitioner or validator...

The validator deals with work units that the transitioner has marked as needing validation, and seems to process them in ascending order of work unit number. The validator will process a certain number of work units and associated results (1000?) then have a brief sleep before starting another batch. Each time it'll tend to pick up the lowest-numbered units, so if more retries have come in during the process interval, higher numbered ones will be missed again... However, the number of ARP1 units processed per day is low enough that getting them through the validator(s) should not be an issue (unless they're sometimes turning validators off to reduce server load...)

This leaves us with the possibility that the transitioner(s) might be suffering a backlog. The normal behaviour of the transitioner is to run through a pre-defined number of work-units then, possibly after a brief pause, start again... Each batch of work units is selected based on when the work-unit was marked for transitioner attention (which could be in the past!), and then [probably] in ascending order of work-unit number within that. It pays no attention to which applications are involved

If there aren't enough transitioners running the sheer volume of the relatively short run-time OPN1/OPNG work might cause issues.

If an official BOINC server expert reads this and can correct it, feel free! And, of course, if there's some other explanation for this issue, I'd love to know what it is; always willing to learn :-)

Cheers - Al.

P.S. I've got some scripts that analyse work unit result behaviour for my own results and their wingmen. If necessary, I can actually provide data on this issue from the viewpoint of a single user :-)

[Sep 1, 2022 1:51:55 AM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 946
Status: Recently Active
Project Badges:

45 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Validator not running...or something else?

https://www.worldcommunitygrid.org/contribution/workunit/155117644

same here. just came to the forums to see what was up.

[Sep 1, 2022 3:11:02 AM]

catchercradle
Advanced Cruncher
Joined: Jan 16, 2009
Post Count: 126
Status: Offline
Project Badges:

14 day badge for Drug Search for Leishmaniasis

180 day badge for Africa Rainfall Project

14 day badge for OpenPandemics - COVID-19


Re: Validator not running...or something else?

Snap! Just checked and the tasks awaiting validation from a few days ago haven't changed their status.

[Sep 1, 2022 9:55:08 AM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 945
Status: Recently Active
Project Badges:


Re: Validator not running...or something else?

In the "Workunits are being sent out" News thread, Grumpy Swede has recently posted that the same issues now appear to be showing up with the other projects :-(

I have checked my returns over the last few days and can confirm that MCM1 and OPN1 have just started showing this behaviour today. It'll be interesting to see how long the stalled units stay in that condition for these projects...

And the above-mentioned post does put it out in the News area; hopefully WCG folks will spot it (if they aren't already aware there's an issue...)

Cheers - Al.

----------------------------------------
[Edit 1 times, last edit by alanb1951 at Sep 1, 2022 3:52:02 PM]

[Sep 1, 2022 3:50:21 PM]

spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 274
Status: Offline
Project Badges:

20 year badge for OpenPandemics - COVID-19


Re: Validator not running...or something else?

I just checked my results, and the four ARPs that show up as valid have turn-in dates of today or yesterday. I have a bunch of other ARP units stacked up as well in the "pending validation" status, but I have to assume that some of these got stuck during the early days of the restart. ARP can take quite a while to run, and so it may be a while before they get validated. One of my currently-running units is a _4 with two user aborts, for instance.

[Sep 1, 2022 7:15:59 PM]

Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 946
Status: Recently Active
Project Badges:


Re: Validator not running...or something else?

Just to make sure that Krembil people take this seriously. These aren't WUs waiting for someone else to finish. These are WUs with multiple possible valid results. The validator has failed to compare these results and either declare them valid or issue another copy of the WU.

There are links to WUs in this thread, and more can be provided if needed. While the validator might be working on some WUs, some have been skipped and are in limbo.

[Sep 1, 2022 7:51:54 PM]

[ ]