Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 98
|
![]() |
Author |
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
Dear volunteers,
We have taken additional measures to increase the quantity of WUs we can send out, and we have been able to increase the quantity of WUs in flight at any given time. Volunteers should see this reflected on their devices now, and perhaps even over this past week. We are also relieved to share that the hosting data centre has assigned additional personnel on site to resolve our networking issues, meaning a fix is imminent. We will share with you any further updates we receive from the data centre. The network fix will allow us to bring our remaining servers online, stabilizing and further increasing the WU supply. Thus, until we are able to deploy all dedicated servers, we must continuously adjust and monitor tasks scheduled in Aurora/Mesos to keep the tasks balanced and the workunits flowing, and so far this process is unduly intensive and sporadic. For example, a recurring job may saturate the scheduler by creating a large number of downstream jobs. This flood of new jobs might then throttle the processing rate of other waiting jobs and thereby interrupt the supply of work. To fix the problem, we would need to temporarily deschedule the parent job, decrease its frequency, or decrease the priority of its children in such a way that does not starve other stages of the pipeline. Last week, we mentioned that we have begun to investigate concerns over statistics, credit, streaks, and database dumps raised by volunteers. We will have an update on some of these issues next week. We also plan to release a more structured breakdown from the tech team similar to a CHANGELOG starting next week or the week after so that we can increase the frequency and clarity of updates. Future Plans for Aurora/Mesos Replacement by SLURM at the WCG With the above in mind, although we should be able to immediately deploy additional server resources for Aurora/Mesos job scheduling once networking issues are resolved, our team has greater familiarity and experience with the SLURM scheduler, an alternative to Aurora/Mesos. SLURM is a mature technology currently in use at many of the world’s foremost supercomputing centres, and we intend a full transition to SLURM soon after WCG full restart. Pending some investigation, we may also look to expand our message-passing layer and implement a publisher/subscriber model and some notion of back-pressure to dictate the chain of downloading data from researchers and creating workunits with which to stock the feeder. From what we have observed, we can expect the move to SLURM will distribute our internal server resources more efficiently than Aurora/Mesos currently does, while losing no functionality. This should be relatively straightforward to port since it overlaps with the existing skill-set of the team. However, this work is not a higher priority than addressing long-standing concerns of volunteers, which we are finally carving out the bandwidth to address. Thanks for your patience and have a great weekend! -WCG Tech Team |
||
|
dough boy
Cruncher Joined: May 22, 2012 Post Count: 8 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have been able to download several days worth of data for the last almost week.
|
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 296 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for the detailed status report.
Many volunteers, myself included, will find the additional details of what has been done, what will be done and the MORE DETAILED timeline that is provided by this update refreshing and welcomed. As well as the CHANGELOG info you will be implementing. Thanks to the WCG Tech Team in contributing to this update! I have commented elsewhere on the WUs provided and the ongoing but lessening http errors. Thanks again, Bruce |
||
|
danwat1234
Cruncher Joined: Apr 18, 2020 Post Count: 39 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
Thank you for the update! All of my machines seem to have been getting regular wcg work this past week or so. I'm surprised how smooth it was once the day or two server hiccup was through. Keep it up!
----------------------------------------[Edit 1 times, last edit by danwat1234 at Aug 27, 2022 4:23:57 AM] |
||
|
Foxus
Cruncher Joined: Oct 22, 2008 Post Count: 2 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the Update and confirmed I get some WU - nice that the new environment gets to life.
Great Job of your Team - we cross our thumbs that the remaining problems will soon be vanished and you get some sleep after all these impressions last weeks. Good luck and may a light shine on all your ways. |
||
|
phillipspencer
Advanced Cruncher France Joined: Apr 9, 2015 Post Count: 71 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Appreciate the detailed update and the indication of future priorities. Good to understand your allocation of resources too.
|
||
|
mdparkhill
Advanced Cruncher Joined: May 2, 2007 Post Count: 60 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Good news to hear that someone is finally taking the networking issue seriously. It was nice to learn more about the backend for the scheduling. I had not even considered that as issue. I do mainframes and some times out schedulers go ape and it's all my fault even when it's user error. Thanks again for the updates and i just got 60-60 tasks, down loads still a little slow but it appears to working better(I hope crossed fingers).
----------------------------------------![]() |
||
|
nivrip
Senior Cruncher North Yorkshire Joined: Sep 13, 2007 Post Count: 264 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the info. Getting plenty of WUs now but still occasional hiccups with some of them stuck in Transfers. Using the Retry button always does the trick over a minute or two.
----------------------------------------
ЮРКШИР КРУНЧЕР
|
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Still having to babysit my zoo, but the work is coming in steadily. Thanks for the update!
|
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 949 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
love the detailed update. I'm getting WUs with some minor hiccups, but it is great to know my machine is useful again.
|
||
|
|
![]() |