Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 35
Posts: 35   Pages: 4   [ 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 80661 times and has 34 replies Next Thread
Cyclops
Senior Cruncher
Joined: Jun 13, 2022
Post Count: 295
Status: Offline
Reply to this Post  Reply with Quote 
2023-01-25 Update (ARP & OPN1 workunits)

ARP & OPN1 workunits

On Monday afternoon, many volunteers reported receiving new ARP1 and OPN1 workunits. These workunits are not from a new batch; these are older WUs that were never sent out due to an overloaded server causing problems in our workunit-distribution process. ARP1 and OPN1/OPNG teams remain on temporary pause, preparing new workunits.

In addition, this infusion of about 2 million WUs helped us to confirm that the networking/download issues we have in the data center persist under a normal load. Improvements made by the SHARCNET team did reduce network congestion. However, based on these results, they are now implementing further modifications to the network, which should resolve these issues for the future. We will keep you updated with further details about the upcoming maintenance, once we receive more information from the SHARCNET team.

Thank you for sending reports of HTTP errors that were experienced by volunteers processing the recent ARP1/OPN1 workunits, which helped us diagnose these errors. The effect is especially strong after an outage, because of the pent-up demand by all the connected BOINC clients. The backlog of workunits released for distribution over the last few days produced the same effect. We continue working together with the SHARCNET team on improving our network. In parallel, we are finalizing the SSD storage upgrade we mentioned in December, and this will also help in improving WCG backend performance.

If you have any comments or questions, please leave them in this thread for us to answer. Thank you for your support, patience and understanding.

WCG team
[Jan 26, 2023 2:15:57 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Hans Sveen
Veteran Cruncher
Norge
Joined: Feb 18, 2008
Post Count: 818
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Hello
Thank You again for the information!
Looking forward to further infomation as
the project is getting back to running
at full steam👍🤞🏻😊

With regards,
Hans S.
Oslo
[Jan 26, 2023 10:22:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7655
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Thank you for the update.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Jan 26, 2023 2:50:49 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ADDIE2014
Cruncher
Joined: Apr 13, 2019
Post Count: 31
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Thanks for the update Cyclops
[Jan 26, 2023 3:07:55 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Aperture_Science_Innovators
Advanced Cruncher
United States
Joined: Jul 6, 2009
Post Count: 139
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Aw, I was enjoying seeing work from several sub-projects again :-)

Ty for the update regardless, and may the teams get their projects ready for more work soon!
----------------------------------------

[Jan 26, 2023 4:22:04 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2153
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Thanks for informing us volunteers!

Cyclops:
On Monday afternoon, many volunteers reported receiving new ARP1 and OPN1 workunits. These workunits are not from a new batch; these are older WUs that were never sent out due to an overloaded server causing problems in our workunit-distribution process. ARP1 and OPN1/OPNG teams remain on temporary pause, preparing new workunits.

Does the last sentence ("teams … preparing new workunits") also apply to ARP1?
I can imagine it only applies to OPN1/OPNG. ARP1-workunits are generated from the previous generation, unless they error out and get stuck, isn't it?
So, as soon as an ARP1-workunit has been declared Valid, you can generate the next generation on the server and there's no need for the ARP1-team to "remain on temporary pause", unless the ARP1-team still isn't ready for downloading Valid results, of course.
Is the ARP1-researchteam ready yet or is the ARP team still finalizing storage issues (see your post 681390)?

In addition, this infusion of about 2 million WUs helped us to confirm that the networking/download issues we have in the data center persist under a normal load.

Is it my imagination or have the transient HTTP errors already mostly disappeared? Since 10:44 UTC and after downloading 70 tasks (OPN1, MCM1) in 37 transfer-sessions I haven't seen any HTTP error. It has become a common experience: after a few days, after an outage, the (transient) HTTP errors are disappearing.
In my experience, this also happens when all ARP1-workunits from their current generation have been sent while no new generation is being generated; in other words, once all ARP1-workunits have been sent and distributed, after turning in the computed result no new generations will be generated and the distribution of new ARP1-tasks dries out eventually.
Having said this, I haven't seen any new ARP1-tasks since 06:00 UTC this morning after turning in 13 ARP1-tasks during the past ten hours (at 16:11, 16:08, 15:43, 15:40, 15:32, 14:29, 14:24, 14:08, 13:33, 13:24, 12:54, 10:09 and 07:43 UTC).
Lately, it is also a common experience that once the distribution of ARP1-tasks has completely dwindled down/dried out and a fresh restart of about 35,000 new generations happen, the HTTP errors rear their ugly heads again.

Finally, Cyclops, back in December you wrote (in post 680326) that you were thinking of starting to crunch in January. Have you had any luck yet installing BOINC?
[Jan 26, 2023 4:32:56 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Cyclops
Senior Cruncher
Joined: Jun 13, 2022
Post Count: 295
Status: Offline
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Hi adriverhoef,

Does the last sentence ("teams … preparing new workunits") also apply to ARP1?
I can imagine it only applies to OPN1/OPNG. ARP1-workunits are generated from the previous generation, unless they error out and get stuck, isn't it?
So, as soon as an ARP1-workunit has been declared Valid, you can generate the next generation on the server and there's no need for the ARP1-team to "remain on temporary pause", unless the ARP1-team still isn't ready for downloading Valid results, of course.
You're right about that, we should have been a bit more clear that "preparing new workunits" does not apply to ARP1. It would be more accurate to say that they are all on pause to varying degrees.

Is the ARP1-researchteam ready yet or is the ARP team still finalizing storage issues (see your post 681390)?
The ARP team is still working on their storage and will tell us when they are ready to send out new workunits.

Is it my imagination or have the transient HTTP errors already mostly disappeared? Since 10:44 UTC and after downloading 70 tasks (OPN1, MCM1) in 37 transfer-sessions I haven't seen any HTTP error. It has become a common experience: after a few days, after an outage, the (transient) HTTP errors are disappearing.
In my experience, this also happens when all ARP1-workunits from their current generation have been sent while no new generation is being generated; in other words, once all ARP1-workunits have been sent and distributed, after turning in the computed result no new generations will be generated and the distribution of new ARP1-tasks dries out eventually.
Having said this, I haven't seen any new ARP1-tasks since 06:00 UTC this morning after turning in 13 ARP1-tasks during the past ten hours (at 16:11, 16:08, 15:43, 15:40, 15:32, 14:29, 14:24, 14:08, 13:33, 13:24, 12:54, 10:09 and 07:43 UTC).
Lately, it is also a common experience that once the distribution of ARP1-tasks has completely dwindled down/dried out and a fresh restart of about 35,000 new generations happen, the HTTP errors rear their ugly heads again.
The decrease in errors is likely because not all clients are asking for new workunits, some are processing existing ones, which puts less strain on the server. when EVERYONE is downloading new units, then it becomes much more congested (like we saw when a lot of ARP/OPN units were downloaded earlier this week). We are working to improve our server so that even at the height of activity on our servers, HTTP errors won't happen to such a degree.

Finally, Cyclops, back in December you wrote (in post 680326) that you were thinking of starting to crunch in January. Have you had any luck yet installing BOINC?
Thanks for asking, I did start crunching at the beginning of January. My progress isn't available yet since I asked the tech team to use my device as a testing ground to solve the ongoing missing devices/results situation.
[Jan 26, 2023 7:54:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
bfmorse
Senior Cruncher
US
Joined: Jul 26, 2009
Post Count: 296
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Cyclops,
Thanks for asking, I did start crunching at the beginning of January. My progress isn't available yet since I asked the tech team to use my device as a testing ground to solve the ongoing missing devices/results situation.

Have they made any progress on our systems, as you may recall one of my recently added systems has also been volunteered for the same purpose.
[Jan 26, 2023 10:29:06 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Cyclops, I presume that the recovery of download times is due to the servers running out of ARP1 units.

Now that you have cleared out those delayed units, will you be attempting to restart the extreme and accelerated units that have been stuck for some time. IBM managed to get previously stuck units going again by reducing the timestep from 36 seconds to 24 seconds. This applies especially to the 3 units stuck in generations 14, 16 & 17, otherwise known as ultra extremes.

Mike
[Jan 26, 2023 11:23:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Gretar
Cruncher
Iceland
Joined: Dec 28, 2008
Post Count: 23
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-01-25 Update (ARP & OPN1 workunits)

Thanks for the info Cyclops.
[Jan 27, 2023 11:29:18 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 35   Pages: 4   [ 1 2 3 4 | Next Page ]
[ Jump to Last Post ]
Post new Thread