Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 65
|
![]() |
Author |
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 102 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have several ARP tasks that are pending validation, along with a wingman who also completed it. Some of these tasks were completed by these two PCs at least two or three days ago. No other task from these work units have been sent to another PC, so there is no _2 replication.. If it was a problem with differing results between the two returned tasks, it should be pending verification. Example:
https://www.worldcommunitygrid.org/contribution/workunit/155120909 A few of the tasks I've completed within the last day were validated, as I ran the _2 task from the work unit after one on the initial two tasks errored out. Anyone else seeing this or have an explanation |
||
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 266 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Anyone else seeing this Yes, I’ve got a few like that and assumed that there was some issue with the validator.By the way, the machine running the _0 task in the example you gave was quick! Cheers, Mark |
||
|
geophi
Advanced Cruncher U.S. Joined: Sep 3, 2007 Post Count: 102 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yeah, that's my Ryzen 5 5600 running 4 ARP and 2 OPN at the same time. ARPs love Ryzens.
|
||
|
MJH333
Senior Cruncher England Joined: Apr 3, 2021 Post Count: 266 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
That’s impressive performance from a $180-ish CPU!
|
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 945 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
There is definitely something amiss with ARP1 validation -- retry jobs issued because of "no reply" or the "not started by deadline" error validate fairly promptly, but tasks where both initial results are returned well within the deadline sit at Pending Validation. Looking at corresponding work unit numbers shows (as one might expect) that the ones validating have much lower WU numbers than the ones that are stalled!
I'm not sure whether this happens because the transitioners are overloaded or because the validator just isn't seeing the higher-numbered units because of the way it picks up results to check. What follows is "informed speculation" (i.e. I've done a code-dive on BOINC code for another project...) but may be way off if WCG/IBM have hacked the core BOINC code of the validators or the transitioner (not a good idea, so probably not the case!) Note that it is possible to run multiple validators and transitioners, using an algorithm to ensure that each one processes a distinct set of work units; I'd hope WCG does that! The below refers to an individual transitioner or validator... The validator deals with work units that the transitioner has marked as needing validation, and seems to process them in ascending order of work unit number. The validator will process a certain number of work units and associated results (1000?) then have a brief sleep before starting another batch. Each time it'll tend to pick up the lowest-numbered units, so if more retries have come in during the process interval, higher numbered ones will be missed again... However, the number of ARP1 units processed per day is low enough that getting them through the validator(s) should not be an issue (unless they're sometimes turning validators off to reduce server load...) This leaves us with the possibility that the transitioner(s) might be suffering a backlog. The normal behaviour of the transitioner is to run through a pre-defined number of work-units then, possibly after a brief pause, start again... Each batch of work units is selected based on when the work-unit was marked for transitioner attention (which could be in the past!), and then [probably] in ascending order of work-unit number within that. It pays no attention to which applications are involved If there aren't enough transitioners running the sheer volume of the relatively short run-time OPN1/OPNG work might cause issues. If an official BOINC server expert reads this and can correct it, feel free! And, of course, if there's some other explanation for this issue, I'd love to know what it is; always willing to learn :-) Cheers - Al. P.S. I've got some scripts that analyse work unit result behaviour for my own results and their wingmen. If necessary, I can actually provide data on this issue from the viewpoint of a single user :-) |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 946 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() |
https://www.worldcommunitygrid.org/contribution/workunit/155117644
same here. just came to the forums to see what was up. |
||
|
catchercradle
Advanced Cruncher Joined: Jan 16, 2009 Post Count: 126 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Snap! Just checked and the tasks awaiting validation from a few days ago haven't changed their status.
|
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 945 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In the "Workunits are being sent out" News thread, Grumpy Swede has recently posted that the same issues now appear to be showing up with the other projects :-(
----------------------------------------I have checked my returns over the last few days and can confirm that MCM1 and OPN1 have just started showing this behaviour today. It'll be interesting to see how long the stalled units stay in that condition for these projects... And the above-mentioned post does put it out in the News area; hopefully WCG folks will spot it (if they aren't already aware there's an issue...) Cheers - Al. [Edit 1 times, last edit by alanb1951 at Sep 1, 2022 3:52:02 PM] |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I just checked my results, and the four ARPs that show up as valid have turn-in dates of today or yesterday. I have a bunch of other ARP units stacked up as well in the "pending validation" status, but I have to assume that some of these got stuck during the early days of the restart. ARP can take quite a while to run, and so it may be a while before they get validated. One of my currently-running units is a _4 with two user aborts, for instance.
|
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 946 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() |
Just to make sure that Krembil people take this seriously. These aren't WUs waiting for someone else to finish. These are WUs with multiple possible valid results. The validator has failed to compare these results and either declare them valid or issue another copy of the WU.
There are links to WUs in this thread, and more can be provided if needed. While the validator might be working on some WUs, some have been skipped and are in limbo. |
||
|
|
![]() |