Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 214
Posts: 214   Pages: 22   [ Previous Page | 4 5 6 7 8 9 10 11 12 13 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 95090 times and has 213 replies Next Thread
spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 277
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Well, something got changed (I am careful at this point not to say "fixed", as that might jinx it), but since late last night and during the night, those download problems have all but disappeared.

As far as I can tell the distribution of (new) ARP1-tasks has stopped.

That's what happened last time, too:
The distribution of ARP1-tasks stopped and the download problems, too.
The distribution of ARP1-tasks started and the download problems, too.


That's what I've been (casually, no stats) observing. It would make sense, since ARP units are far and away the most data-heavy of the bunch, and in both directions as well.

OPNG also seems to have a bearing on the download problems, only because there are so many separate small files in a work unit that each require separate requests.
[Sep 23, 2022 6:01:40 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1956
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

That's what I've been (casually, no stats) observing. It would make sense, since ARP units are far and away the most data-heavy of the bunch, and in both directions as well.

OPNG also seems to have a bearing on the download problems, only because there are so many separate small files in a work unit that each require separate requests.
I think that this would be a misinterpretation of the symptoms.
What shows up for all project WUs is a problem of getting a connection rather than actual transfer speed when having large data files as with ARP1.

After mentioning that I haven't received any ARP1 WUs in 2 weeks, I just got 3 a while ago, two of them on machines local to me, so I could try and baby sit the downloads for a bit. Like with any other project, it took some effort to get a connection for any of those files. But once that connection was established after a number of forced retries, they all just downloaded fine. A bit slow maybe, but without interruption and all but one of the data files came in within 5 minutes.Which I would consider rather acceptable with a WU that on that particular system (older Xeon based server running a 32bit server OS) takes about 36h (CPU time, 48h clock time)...
----------------------------------------

[Sep 23, 2022 7:17:29 PM]   Link   Report threatening or abusive post: please login first  Go to top 
cubes
World Community Grid Tech, Mapping Cancer Markers and Help Conquer Scientist
Canada
Joined: Mar 3, 2007
Post Count: 58
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

We have made some improvements to the WCG system today that should improve the download situation (repeated download attempts and "transient" HTTP errors in the BOINC client logs). In short, we have doubled the number of World Community Grid download servers and have begun tuning a related part of the system.

A somewhat longer explanation:

The WCG back-end system operates as a network of virtual servers on a private cloud. File-upload and download requests are received first by our load balancer, which directs each request to an available upload/download server. As designed, our system should run with two u/d servers, but one of them was affected by a mysterious network problem that has kept several of our virtual servers offline for weeks. We suspected ghosts, cursed VM images, and OpenStack glitches, but recently, our hosting provider ruled those out for us, determining the problem to lie between a physical server a router. The problem is not 100% fixed, but with the cause identified, we managed to squeeze the second u/d VM onto another physical server, and successfully brought it online about 9.5 hours ago.

Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning.

The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further.

Christian
[Sep 24, 2022 3:53:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Aperture_Science_Innovators
Advanced Cruncher
United States
Joined: Jul 6, 2009
Post Count: 139
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Thanks for the technical update, Christian. Updates are always good news in my book.
----------------------------------------

[Sep 24, 2022 4:58:51 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Strandkievit
Cruncher
Joined: Mar 19, 2020
Post Count: 1
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Glad to see it is not on my end; was happy to see covid-gpu units in my download list, even put an other project on halt, yet now down to three cpu-units and the gpu`s are still pending.
Greets from The Netherlands.
[Sep 24, 2022 5:13:47 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Blount
Senior Cruncher
Joined: Aug 19, 2005
Post Count: 474
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Thank-you for a technical update. However, I see little to no improvement in transient download errors. If there is an improvement is it slight. Without taking the log and doing an analysis... I am guessing there is about a 8 errors to each 'Finished' download. After 4 transient error the downloads are delayed. Thus errors have a huge impact in getting tasks to complete.

If you would like a more details analysis of my logs, just ask.
[Sep 24, 2022 10:28:06 AM]   Link   Report threatening or abusive post: please login first  Go to top 
mwroggenbuck
Advanced Cruncher
USA
Joined: Nov 1, 2006
Post Count: 77
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
devilish Re: 2022-09-15 Update (Networking & Workunits)

We suspected ghosts, cursed VM images, and OpenStack glitches


Don't forget to check for Gremlins. Those little rascals are really good at hiding. devilish

Seriously, thanks for the update. I appreciate it very much.
----------------------------------------
[Edit 2 times, last edit by mwroggenbuck at Sep 24, 2022 1:47:20 PM]
[Sep 24, 2022 12:22:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
ncoded.com
Advanced Cruncher
United Kingdom
Joined: Aug 16, 2016
Post Count: 62
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Thank you for the detailed explanation of the issues, and their possible resolutions.
----------------------------------------

[Sep 24, 2022 12:26:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
as1981
Cruncher
Joined: Dec 3, 2006
Post Count: 49
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Thank you for the detailed update.
[Sep 24, 2022 12:39:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2171
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
rose Re: 2022-09-15 Update (Networking & Workunits)

Many thanks for keeping us updated.
[Sep 24, 2022 1:57:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 214   Pages: 22   [ Previous Page | 4 5 6 7 8 9 10 11 12 13 | Next Page ]
[ Jump to Last Post ]
Post new Thread