Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 35
|
![]() |
Author |
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
ARP & OPN1 workunits
On Monday afternoon, many volunteers reported receiving new ARP1 and OPN1 workunits. These workunits are not from a new batch; these are older WUs that were never sent out due to an overloaded server causing problems in our workunit-distribution process. ARP1 and OPN1/OPNG teams remain on temporary pause, preparing new workunits. In addition, this infusion of about 2 million WUs helped us to confirm that the networking/download issues we have in the data center persist under a normal load. Improvements made by the SHARCNET team did reduce network congestion. However, based on these results, they are now implementing further modifications to the network, which should resolve these issues for the future. We will keep you updated with further details about the upcoming maintenance, once we receive more information from the SHARCNET team. Thank you for sending reports of HTTP errors that were experienced by volunteers processing the recent ARP1/OPN1 workunits, which helped us diagnose these errors. The effect is especially strong after an outage, because of the pent-up demand by all the connected BOINC clients. The backlog of workunits released for distribution over the last few days produced the same effect. We continue working together with the SHARCNET team on improving our network. In parallel, we are finalizing the SSD storage upgrade we mentioned in December, and this will also help in improving WCG backend performance. If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding. WCG team |
||
|
Hans Sveen
Veteran Cruncher Norge Joined: Feb 18, 2008 Post Count: 818 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hello
Thank You again for the information! Looking forward to further infomation as the project is getting back to running at full steam👍🤞🏻😊 With regards, Hans S. Oslo |
||
|
Sgt.Joe
Ace Cruncher USA Joined: Jul 4, 2006 Post Count: 7655 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for the update.
----------------------------------------Cheers
Sgt. Joe
*Minnesota Crunchers* |
||
|
ADDIE2014
Cruncher Joined: Apr 13, 2019 Post Count: 31 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thanks for the update Cyclops
|
||
|
Aperture_Science_Innovators
Advanced Cruncher United States Joined: Jul 6, 2009 Post Count: 139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Aw, I was enjoying seeing work from several sub-projects again :-)
----------------------------------------Ty for the update regardless, and may the teams get their projects ready for more work soon! ![]() |
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2153 Status: Recently Active Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for informing us volunteers!
Cyclops: On Monday afternoon, many volunteers reported receiving new ARP1 and OPN1 workunits. These workunits are not from a new batch; these are older WUs that were never sent out due to an overloaded server causing problems in our workunit-distribution process. ARP1 and OPN1/OPNG teams remain on temporary pause, preparing new workunits. Does the last sentence ("teams … preparing new workunits") also apply to ARP1? I can imagine it only applies to OPN1/OPNG. ARP1-workunits are generated from the previous generation, unless they error out and get stuck, isn't it? So, as soon as an ARP1-workunit has been declared Valid, you can generate the next generation on the server and there's no need for the ARP1-team to "remain on temporary pause", unless the ARP1-team still isn't ready for downloading Valid results, of course. Is the ARP1-researchteam ready yet or is the ARP team still finalizing storage issues (see your post 681390)? In addition, this infusion of about 2 million WUs helped us to confirm that the networking/download issues we have in the data center persist under a normal load. Is it my imagination or have the transient HTTP errors already mostly disappeared? Since 10:44 UTC and after downloading 70 tasks (OPN1, MCM1) in 37 transfer-sessions I haven't seen any HTTP error. It has become a common experience: after a few days, after an outage, the (transient) HTTP errors are disappearing. In my experience, this also happens when all ARP1-workunits from their current generation have been sent while no new generation is being generated; in other words, once all ARP1-workunits have been sent and distributed, after turning in the computed result no new generations will be generated and the distribution of new ARP1-tasks dries out eventually. Having said this, I haven't seen any new ARP1-tasks since 06:00 UTC this morning after turning in 13 ARP1-tasks during the past ten hours (at 16:11, 16:08, 15:43, 15:40, 15:32, 14:29, 14:24, 14:08, 13:33, 13:24, 12:54, 10:09 and 07:43 UTC). Lately, it is also a common experience that once the distribution of ARP1-tasks has completely dwindled down/dried out and a fresh restart of about 35,000 new generations happen, the HTTP errors rear their ugly heads again. Finally, Cyclops, back in December you wrote (in post 680326) that you were thinking of starting to crunch in January. Have you had any luck yet installing BOINC? |
||
|
Cyclops
Senior Cruncher Joined: Jun 13, 2022 Post Count: 295 Status: Offline |
Hi adriverhoef,
Does the last sentence ("teams … preparing new workunits") also apply to ARP1? You're right about that, we should have been a bit more clear that "preparing new workunits" does not apply to ARP1. It would be more accurate to say that they are all on pause to varying degrees.I can imagine it only applies to OPN1/OPNG. ARP1-workunits are generated from the previous generation, unless they error out and get stuck, isn't it? So, as soon as an ARP1-workunit has been declared Valid, you can generate the next generation on the server and there's no need for the ARP1-team to "remain on temporary pause", unless the ARP1-team still isn't ready for downloading Valid results, of course. Is the ARP1-researchteam ready yet or is the ARP team still finalizing storage issues (see your post 681390)? The ARP team is still working on their storage and will tell us when they are ready to send out new workunits.Is it my imagination or have the transient HTTP errors already mostly disappeared? Since 10:44 UTC and after downloading 70 tasks (OPN1, MCM1) in 37 transfer-sessions I haven't seen any HTTP error. It has become a common experience: after a few days, after an outage, the (transient) HTTP errors are disappearing. The decrease in errors is likely because not all clients are asking for new workunits, some are processing existing ones, which puts less strain on the server. when EVERYONE is downloading new units, then it becomes much more congested (like we saw when a lot of ARP/OPN units were downloaded earlier this week). We are working to improve our server so that even at the height of activity on our servers, HTTP errors won't happen to such a degree.In my experience, this also happens when all ARP1-workunits from their current generation have been sent while no new generation is being generated; in other words, once all ARP1-workunits have been sent and distributed, after turning in the computed result no new generations will be generated and the distribution of new ARP1-tasks dries out eventually. Having said this, I haven't seen any new ARP1-tasks since 06:00 UTC this morning after turning in 13 ARP1-tasks during the past ten hours (at 16:11, 16:08, 15:43, 15:40, 15:32, 14:29, 14:24, 14:08, 13:33, 13:24, 12:54, 10:09 and 07:43 UTC). Lately, it is also a common experience that once the distribution of ARP1-tasks has completely dwindled down/dried out and a fresh restart of about 35,000 new generations happen, the HTTP errors rear their ugly heads again. Finally, Cyclops, back in December you wrote (in post 680326) that you were thinking of starting to crunch in January. Have you had any luck yet installing BOINC? Thanks for asking, I did start crunching at the beginning of January. My progress isn't available yet since I asked the tech team to use my device as a testing ground to solve the ongoing missing devices/results situation. |
||
|
bfmorse
Senior Cruncher US Joined: Jul 26, 2009 Post Count: 296 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Cyclops,
Thanks for asking, I did start crunching at the beginning of January. My progress isn't available yet since I asked the tech team to use my device as a testing ground to solve the ongoing missing devices/results situation. Have they made any progress on our systems, as you may recall one of my recently added systems has also been volunteered for the same purpose. |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Cyclops, I presume that the recovery of download times is due to the servers running out of ARP1 units.
Now that you have cleared out those delayed units, will you be attempting to restart the extreme and accelerated units that have been stuck for some time. IBM managed to get previously stuck units going again by reducing the timestep from 36 seconds to 24 seconds. This applies especially to the 3 units stuck in generations 14, 16 & 17, otherwise known as ultra extremes. Mike |
||
|
Gretar
Cruncher Iceland Joined: Dec 28, 2008 Post Count: 23 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the info Cyclops.
|
||
|
|
![]() |