Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 155
|
![]() |
Author |
|
Aperture_Science_Innovators
Advanced Cruncher United States Joined: Jul 6, 2009 Post Count: 139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Issue appears to have spread to SCC too. Only getting resends since about 07:00 UTC.
----------------------------------------![]() |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7664 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Issue appears to have spread to SCC too. Only getting resends since about 07:00 UTC. Same here. One machine dry, others to follow soon. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 955 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Welcome to the weekend. Looks like there is only a smattering of resends for MCM and SCC available.
|
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 955 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
bumping this back up in hopes we have less No WUs posts.
|
||
|
TigerLily
Senior Cruncher Joined: May 26, 2023 Post Count: 280 Status: Offline Project Badges: ![]() |
Hi everyone,
The issue of no work unit availability on weekends was brought to the team last week. They are currently working on investigating and fixing this issue. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12369 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
TigerLily
Thanks, but we have been reporting this for months. Mike |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 955 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Enjoying the workunits we are getting now. Thank you TigerLily for letting the team know about the weekend issues.
I'm hoping for a good weekend without issues soon. |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2157 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Regarding the issue of no workunit availability on weekends.
Each time lately, before the weekend, I'm seeing a buildup of tasks that are "Waiting to be sent". Currently I am seeing a large number of "Waiting to be sent" that don't get sent again. <10> * MCM1_0202379_7015_0 Fedora Linux Pending Validation 2023-08-10T17:06:19 2023-08-12T06:42:29 (Generated by wcgstats -wsPQ -a0 -m0 -SS -P100, then redacted on nrs. 61-72 and 75-87.) That's it, almost 50 tasks that look like they refuse to get sent (and within a reasonable amount of time). Anyone else with such large numbers? (You have to dig deeper than just looking at your own tasks only, so a tool like wcgstats would be nice to have/use.) Adri |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 953 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Adri,
I'm operating on a smaller sample than you are, but I can confirm the observation. I suspect it's a characteristic of the current "feast or famine" regarding supply of new work units. I posted a quite long comment about this in the News thread "2023-07-31 Update (MCM1 issue resolved)" in which I commented on the cyclic nature of this behaviour. You might find it interesting... Because Ralf's posts still seem to be moderated, I missed his response at the time[1]; it made a valid point about this being recent behaviour, which I might have commented on there if I'd seen it earlier! This cyclic behaviour seems to have kicked in for MCM1 after the late July outage, so users with larger buffers may have suddenly acquired far more units all due at about the same time once supplies were restored. If the user system buffers were being replenished at a more steady rate (as was probable before the outage) the time distribution of retry requests would be far more even (and unlikely to be as problematic...) I am unsure whether the tools they use to control issue of new work are flexible enough to deal with this -- only an insider would know. And I fear that the only ways to stop the cyclic work pattern once it has started would be to either put a [temporary] cap on the number of tasks any user could have (for MCM1) or finding a way to get the (MCM1) feeder mechanism to give [slightly?] less priority to retries... I'll also note here that it's no surprise that a project (such as SCC1) that uses adaptive replication is more likely to simply run out of work than to go through a huge backlog of retries! MCM1, however, could easily end up with many, many thousands]2] of retries in a worst-case scenario... Cheers - Al. [1] If I'd seen it then I might have asked him whether he had actually read the bit that offered a possible explanation for the cyclic pattern, or whether he just thought it was irrelevant :-) [2] With the large number of MCM1 work units processed each week, even a 1% retry rate would be a lot of tasks if most of them were asked for at about the same time. And judging by my recent wingmen, I suspect that the present retry rate is quite a bit higher :-( |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 955 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thank you Adi and Alan for taking the time to dig into the info you can gather and posting it. I hope Tigerlily will pass some of this additional info on to the team so they can solve the problem quicker.
There is definitely an issue that a repeated outage from the server will cause a massive refill the cache event with all the same return date, as well as cause users to expand their cache sizes to avoid the next time the event happens. It is a valid reaction, but it makes the initial problem worse and/or harder to diagnose. I know on the SCC side the MacOS mismatch valid/invalid issue is causing more resends as I personally flip between trusted machine to earning trust again. |
||
|
|
![]() |