Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 129
|
![]() |
Author |
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1322 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I can confirm that (at least) several of them stalled. Even two days ago, I've deliberately held on to 3 tasks to have them pass their deadlines and their duplicates — or 'retries', maybe the preferable term which I've seen been used by Al — are still Waiting to be sent: I have 4 SCC's pending validation where the retries are waiting to be sent: SCC1_0004203_MyoD1-C_38453_2 Waiting to be sent since 2023-08-16 06:34:31 UTC |
||
|
KerSamson
Master Cruncher Switzerland Joined: Jan 29, 2007 Post Count: 1673 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi Sgt.Joe,
----------------------------------------I may know the reason for so many resent at once. My two best crunching machines (16 thread each) suffer a severe power outage at my office with a full 3 day buffer each. Unfortunately for me, the electricity connection has been restored only after 6 days. All waiting WU's have been cancelled by the WCG servers when the machines restarted last Wednesday evening. Cheers, Yves |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7668 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi Sgt.Joe, I may know the reason for so many resent at once. My two best crunching machines (16 thread each) suffer a severe power outage at my office with a full 3 day buffer each. Unfortunately for me, the electricity connection has been restored only after 6 days. All waiting WU's have been cancelled by the WCG servers when the machines restarted last Wednesday evening. Cheers, Yves Well, maybe so. But now somebody has kick started the SCC feeder and I am back to a full supply again. We will just have to see if the supply will hold out for the weekend. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 964 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi Sgt.Joe, I may know the reason for so many resent at once. My two best crunching machines (16 thread each) suffer a severe power outage at my office with a full 3 day buffer each. Unfortunately for me, the electricity connection has been restored only after 6 days. All waiting WU's have been cancelled by the WCG servers when the machines restarted last Wednesday evening. Cheers, Yves Well, maybe so. But now somebody has kick started the SCC feeder and I am back to a full supply again. We will just have to see if the supply will hold out for the weekend. Cheers Regarding new work: there appears to be a new target (FLI1-B) -- lowest batch seen so far seems to be 4262 and the highest so far is 4271. (I've seen these boundaries, and so has Adri's periodic sampler script.) Hopefully, that should keep us in full swing over the weekend :-) Regarding Yves and the power outage - that may well have been a contributory factor but I suspect there were tens of thousands of retries queued at one stage, so I don't think it's all one person's fault :-) Cheers - Al. [Edit 1 times, last edit by alanb1951 at Aug 26, 2023 5:38:07 AM] |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7668 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Regarding new work: there appears to be a new target (FLI1-B) -- lowest batch seen so far seems to be 4262 and the highest so far is 4271. (I've seen these boundaries, and so has Adri's periodic sampler script.) Hopefully, that should keep us in full swing over the weekend :-) This series must have a small target like the "A" series because they are just ripping through, most less than an hour, some as short as 30 minutes. Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 964 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Regarding new work: there appears to be a new target (FLI1-B) -- lowest batch seen so far seems to be 4262 and the highest so far is 4271. (I've seen these boundaries, and so has Adri's periodic sampler script.) Hopefully, that should keep us in full swing over the weekend :-) This series must have a small target like the "A" series because they are just ripping through, most less than an hour, some as short as 30 minutes. Cheers I think most of the difference you're seeing is probably down to the different sizes [and complexity] of the ligands -- smaller ligands seem to be getting out first for FLI1-B (and I think that happened for FLI1-A too...) Run time for a given receptor goes up with ligand size and complexity (though I've not yet had time to try to work out an estimator for given sizes to match the one used to size OPN1/OPNG tasks.) The two FLI1 targets both have the same number of atoms. If you're interested you can check this fairly easily -- the results log for an SCC1 task should have a line identifying the two files used, each name being followed by a size = n b item, where n is the number of atoms and b the number of branches. Both FLI1 receptor files are size = 899 0... I looked at data I collected for the recent FLI1 tasks on one of my Ryzens, and the newer FLI1-A tasks all had ligands over 20 atoms [and quite a few had 30 or more atoms], whilst FLI1-B tasks don't seen to be getting into the mid-to-high 20s yet; I dug back far enough to find FLI1-A tasks with similarly small ligands and those tasks tended to take about the same time as similarly sized FLI1-B ones. With a mix of FLI1-B and the appearance of yet more FLI1-A, I think the run times are likely to be all over the place for a while :-) Cheers - Al. P.S. MyoD1-C receptor has 1268 atoms (about 1.4 times as large) and tasks for a given ligand size seem to take nearly twice as long to run (that part of the code seems to be order n-squared...) |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7668 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yes, those MyoD1-C do take longer, so I guessed they were bigger.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2166 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Regarding new work: there appears to be a new target (FLI1-B) -- lowest batch seen so far seems to be 4262 and the highest so far is 4271. (I've seen these boundaries, and so has Adri's periodic sampler script.) Hopefully, that should keep us in full swing over the weekend :-) Al, your inspirational remark about targets and my sampler script directed me indeed to said script where, after applying some small modifications, I found a nice way to inject the accompanying targets, so that each batchnumber is coupled with its target from now on. See the new result here. (Type <End> to take you to the end of the list.) Adri |
||
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 802 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hi hchc, Sgt.Joe, Taurus Oldbull, and GB033533, Thanks for bringing the issue to our attention. This has been forwarded to the tech team to investigate and I will provide updates as they become available. Thanks TigerLily. When you posted this message, I did notice a dozen or so really old work units get sent to other people so the tech team must've kicked a few off, but it still (as of Sunday the 27th) seems to be an issue. Lotta old work units in "waiting to be sent" state. Just letting you know! Not being impatient, just an update.
[Edit 1 times, last edit by hchc at Aug 28, 2023 2:26:11 AM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 964 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just a thought, given the large number of different ways feeders can be configured...
It's behaving as if new work has somehow been explicitly given priority to try to avoid the log-jam caused by large numbers of retries :-) -- if so, there's a high chance that most (or all?) long-standing retries won't clear while there's a lot of new work hitting the feeder. (Most retries triggered by the validator to resolve Pending Verification seem to be immune to the problem [at present...]) As hchc noted above, some delayed retries got out a few days ago (at a time when there was precious little new work (around 24th/25th August) -- coincidence? I'd love to know how many SCC1 WUs are stuck waiting for retries (and how long it might take to get them distributed given the "same platform" requirement ), but such data-dives can't be high priority for WCG under the circumstances; I suspect the number might be high enough to surprise many folks... Probably, the only long-term solution to the stop/start flow problem for SCC1 is to find a way of reducing the number of simultaneous retries in the system -- unfortunately, the best way to do that is to educate users to maintain as small a cache as is necessary for normal running [yeah, right!...] so there's a bigger supply of available work for those who can turn it around quickly, and any site-based alternative (such as shortening the deadline for SCC1 whilst adding a compensatory grace period, or forcibly capping the maximum SCC1 task count) is likely to lead to a barrage of complaints :-( Cheers - Al. P.S. If the priority mechanism isn't the cause, there must be an exotic problem in the server code that only seems to bite SCC1 [at present] -- MCM1 retries seem to get out in a timely fashion. Who knows whether it's a configuration choice or a server bug... |
||
|
|
![]() |