Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 214
|
![]() |
Author |
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 277 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Well, something got changed (I am careful at this point not to say "fixed", as that might jinx it), but since late last night and during the night, those download problems have all but disappeared. As far as I can tell the distribution of (new) ARP1-tasks has stopped. That's what happened last time, too: The distribution of ARP1-tasks stopped and the download problems, too. The distribution of ARP1-tasks started and the download problems, too. That's what I've been (casually, no stats) observing. It would make sense, since ARP units are far and away the most data-heavy of the bunch, and in both directions as well. OPNG also seems to have a bearing on the download problems, only because there are so many separate small files in a work unit that each require separate requests. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1956 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
That's what I've been (casually, no stats) observing. It would make sense, since ARP units are far and away the most data-heavy of the bunch, and in both directions as well. I think that this would be a misinterpretation of the symptoms. OPNG also seems to have a bearing on the download problems, only because there are so many separate small files in a work unit that each require separate requests. What shows up for all project WUs is a problem of getting a connection rather than actual transfer speed when having large data files as with ARP1. After mentioning that I haven't received any ARP1 WUs in 2 weeks, I just got 3 a while ago, two of them on machines local to me, so I could try and baby sit the downloads for a bit. Like with any other project, it took some effort to get a connection for any of those files. But once that connection was established after a number of forced retries, they all just downloaded fine. A bit slow maybe, but without interruption and all but one of the data files came in within 5 minutes.Which I would consider rather acceptable with a WU that on that particular system (older Xeon based server running a 32bit server OS) takes about 36h (CPU time, 48h clock time)... ![]() |
||
|
cubes
World Community Grid Tech, Mapping Cancer Markers and Help Conquer Scientist Canada Joined: Mar 3, 2007 Post Count: 58 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We have made some improvements to the WCG system today that should improve the download situation (repeated download attempts and "transient" HTTP errors in the BOINC client logs). In short, we have doubled the number of World Community Grid download servers and have begun tuning a related part of the system.
A somewhat longer explanation: The WCG back-end system operates as a network of virtual servers on a private cloud. File-upload and download requests are received first by our load balancer, which directs each request to an available upload/download server. As designed, our system should run with two u/d servers, but one of them was affected by a mysterious network problem that has kept several of our virtual servers offline for weeks. We suspected ghosts, cursed VM images, and OpenStack glitches, but recently, our hosting provider ruled those out for us, determining the problem to lie between a physical server a router. The problem is not 100% fixed, but with the cause identified, we managed to squeeze the second u/d VM onto another physical server, and successfully brought it online about 9.5 hours ago. Prior to that happy event, we looked into the source of the "transient" errors reported in client logs. As it happens, the BOINC client will log almost any kind of HTTP/HTTPS error status as a "transient HTTP error". We first investigated our upload/download server, but its logs showed a >99.9% rate of successful responses, and the server load was generally low. Whatever the exact errors the clients were receiving, it seemed they did not come directly there. So we moved on to the load balancer. Our load balancer runs HAProxy. Examining its operating stats showed it was the source of the BOINC "transient" errors, apparently configured to be a little over-protective of our u/d server, turning down lots of requests. Our HAProxy configuration was originally copied from IBM's, then adapted to work in the new environment, though we left many of parameters unchanged -- maximum number of simultaneous connections, etc. As it turns out, some of those settings do not work well in the Krembil WCG cluster, at least when we're at 50% download capacity. We made a cautious change or two, but with the new server online now, we will wait until the system settles into a new equilibrium to resume parameter tuning. The changes probably won't eliminate the "transient" errors -- initial stats from HAProxy say both download servers are saturated now, but hopefully the second download server reduces the pain, and tuning our load balancer should improve things further. Christian |
||
|
Aperture_Science_Innovators
Advanced Cruncher United States Joined: Jul 6, 2009 Post Count: 139 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the technical update, Christian. Updates are always good news in my book.
----------------------------------------![]() |
||
|
Strandkievit
Cruncher Joined: Mar 19, 2020 Post Count: 1 Status: Offline Project Badges: ![]() ![]() |
Glad to see it is not on my end; was happy to see covid-gpu units in my download list, even put an other project on halt, yet now down to three cpu-units and the gpu`s are still pending.
Greets from The Netherlands. |
||
|
Blount
Senior Cruncher Joined: Aug 19, 2005 Post Count: 474 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank-you for a technical update. However, I see little to no improvement in transient download errors. If there is an improvement is it slight. Without taking the log and doing an analysis... I am guessing there is about a 8 errors to each 'Finished' download. After 4 transient error the downloads are delayed. Thus errors have a huge impact in getting tasks to complete.
If you would like a more details analysis of my logs, just ask. |
||
|
mwroggenbuck
Advanced Cruncher USA Joined: Nov 1, 2006 Post Count: 77 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We suspected ghosts, cursed VM images, and OpenStack glitches Don't forget to check for Gremlins. Those little rascals are really good at hiding. ![]() Seriously, thanks for the update. I appreciate it very much. [Edit 2 times, last edit by mwroggenbuck at Sep 24, 2022 1:47:20 PM] |
||
|
ncoded.com
Advanced Cruncher United Kingdom Joined: Aug 16, 2016 Post Count: 62 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for the detailed explanation of the issues, and their possible resolutions.
----------------------------------------![]() |
||
|
as1981
Cruncher Joined: Dec 3, 2006 Post Count: 49 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
Thank you for the detailed update.
|
||
|
adriverhoef
Master Cruncher The Netherlands Joined: Apr 3, 2009 Post Count: 2171 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Many thanks for keeping us updated.
|
||
|
|
![]() |