Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 144
|
![]() |
Author |
|
wildhagen
Veteran Cruncher The Netherlands Joined: Jun 5, 2009 Post Count: 832 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Validation seems to be at least partially running now.
Earlier today I had 45 pages of WU's awaiting validation, that is back down to 'just' 37 pages now, and still dropping. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1951 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It seems there is overall still and issue with nobody properly monitoring the whole system, including a stuck validator.
----------------------------------------Since yesterday, the number of my PV jail inmates has increased by about 50% and my overall stats, both on the WCG contribution page as well as external stats has cut at least in half. I wonder for how long there's gonna be just crickets from Krembil... ![]() Ralf ![]() |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7664 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think it may still be stuck. If there is more than one, at least one is stuck. The mid day update showed 221,000 for MCM which is about 1/2 of normal. Personally, I still show 560 in pending validation status. Still about twice normal.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1951 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Validation seems to be at least partially running now. It's not that there are no WUs validated at all, the problem is that there are more and more WUs which are getting stuck in PV jail. Earlier today I had 45 pages of WU's awaiting validation, that is back down to 'just' 37 pages now, and still dropping. Some are what bfmores also complained about, WUs of folks that are hoarding WUs "because we are running out of work!" and selfishly increase their buffers to rather unreasonable amounts, not realizing how this backfires on the system as a whole. Those then time out and result in resends (_2, possibly _3 and _4, in case of MCM), which then might take another 3 days or so to be returned (or not). The other issue is that there seems to be more and more WUs where both of the results of the original WU have been successfully returned (in time) and do not validate within a reasonable amount of time. Either way, this is a situation that needs to be looked at, and while Cyclops seems to have been lurking since the number of posts about this increased, there is not even a post acknowledging that problem... ![]() Ralf ![]() |
||
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
Hi all, the tech team and I are aware of the issue and are monitoring the constantly increasing pool of workunits that are stuck pending validation for MCM1. We are looking into ways to mitigate the effect that volunteers who hoard units they are too slow to actually return in time are having on the rest of the community. Once a solution has been agreed on and put into action, we will share it here. Sorry for the confusion.
----------------------------------------[Edit 1 times, last edit by Cyclops at Feb 8, 2023 8:43:57 PM] |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1951 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi all, the tech team and I are aware of the issue and are monitoring the constantly increasing pool of workunits that are stuck pending validation for MCM1. We are looking into ways to mitigate the effect that volunteers who hoard units they are too slow to actually return in time are having on the rest of the community. Once a solution has been agreed on and put into action, we will share it here. Sorry for the confusion. Well, I always wonder why it takes so long to even get an acknowledgement of a problem... ![]() Anyway, just to be clear, there seem to be at least two different issues here, which might not necessarily be connected. The expiring, hoarded WUs and subsequent resends are just one of those issues, another one is that apparently randomly successfully returned WUs (with and without resends) are being stuck for no apparent reasons. Ralf ![]() |
||
|
cz50975
Advanced Cruncher Joined: Dec 9, 2004 Post Count: 95 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
there is problem with some MCM validators from approximately 2023-02-06 21:30:33 UTC Here is example - WU MCM1_0196018_0172 https://www.worldcommunitygrid.org/contribution/workunit/260080169 MCM1_0196018_0172_0 Win11 PenVal 2023-02-05 07:42:05 UTC 2023-02-06 21:30:33 UTC 2.67 / 2.84 93.9 / 0 MCM1_0196018_0172_1 Win10 PenVal 2023-02-05 07:42:12 UTC 2023-02-05 14:22:48 UTC 1.32 / 1.34 75.2 / 0 Now majority of uploaded MCM WUs waiting in queue for validation as example above. more than 48 hours in queue with no progress this is not just a delay in processing, it's stuck I have bulk of such examples, but that is the oldest |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 953 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi all, the tech team and I are aware of the issue and are monitoring the constantly increasing pool of workunits that are stuck pending validation for MCM1. We are looking into ways to mitigate the effect that volunteers who hoard units they are too slow to actually return in time are having on the rest of the community. Once a solution has been agreed on and put into action, we will share it here. Sorry for the confusion. Cyclops,It would be interesting to know how many validators are being run for MCM1 -- I suspect it needs more than one, but... As for mitigations (other than more validators!) -- the obvious one is to put a fairly tight total jobs ceiling on issued tasks per host -- even the fastest current CPUs can't get through several MCM1 tasks per CPU per hour [at present] and big, powerful machines will tend to be permanently connected to the Internet. Unfortunately, someone will then complain that it will inconvenience them if there are internet/download issues. You can't please everyone all the time... Tinkering with deadlines and grace days might reduce the number of No Reply retries that turn out to not be needed because the No Reply machine does return a [valid] result 24 hours or so later, but I doubt that would significantly reduce the number of excess tasks out in the field :-( Good luck to the tech team in their quest -- sadly, this over-provisioning issue is not unique to WCG and I'm not convinced most projects have an answer that doesn't irritate folks with very fast-turnaround systems (or others who run large buffers with less possible justification..) Cheers - Al. |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7664 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
One of my machines is dual cpu machine with 32 threads. I have this set to queue 64 MCM units. At the present time the turnaround for a unit is under 1.5 days. If the level of the longer running MCM units increases, this will degrade a bit, but will still be somewhere around 2 days. I know there are a few users with 128 or 256 thread machines who can safely run queues safely and easily at probably twice their thread count. These users should not be penalized by some hard limit.
----------------------------------------I agree there should be some limit for those who over-queue the capacity of their machine, but don't presently know how this might be implemented. Somebody smarter than me may be able to devise some feedback system where machines with chronic no reply, late reply etc. results would cause the servers to involuntarily limit the number of work units issued to such machines regardless of the individual machine setting. An over ride so to speak. Some kind of enhancement or tweak to the reliable/not reliable parameter for the machine. Just speculation and musings for the moment. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7664 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Does not appear fixed yet. 670 in PV this morning.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
|
![]() |