Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 54
|
![]() |
Author |
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
I have been allocated a number of OPNG tasks today - this batch, for Intel GPU under Windows.
The allocation is received OK, but BOINC struggles to download the associated data files. The server response is 19/08/2022 15:00:50 | World Community Grid | [http] [ID#8341] Received header from server: HTTP/1.0 503 Service Unavailable 19/08/2022 15:00:50 | World Community Grid | [http] [ID#8341] Received header from server: No server is available to handle this request. and multiple retries are needed. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1948 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have been allocated a number of OPNG tasks today - this batch, for Intel GPU under Windows. Yes, that's what everybody is dealing with. Looks like Krembil's hardware is not up to snuff to handle the workload of uploading the WUs. Happens to all of us... The allocation is received OK, but BOINC struggles to download the associated data files. The server response is 19/08/2022 15:00:50 | World Community Grid | [http] [ID#8341] Received header from server: HTTP/1.0 503 Service Unavailable 19/08/2022 15:00:50 | World Community Grid | [http] [ID#8341] Received header from server: No server is available to handle this request. and multiple retries are needed. ![]() Ralf ![]() |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
It's the first time I've been allocated enough tasks to notice since the project restart, If they're ramping up the work creation, they need to beef up, or fine tune, the download servers at the same rate.
Exactly the same error message occurs under Linux, but Linux seems better at holding on to an open connection, and re-using it, once it's been established. That might be one specific detail to explore. |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 946 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
I'm hoping that the fact that they have started sending us more tasks is so a network tech can work on the issues. You can't find the leaks in the hose if the water is turned off. The OPNGs are now going out, so we can check that issue off the list.
Again, it would be nice if we knew what was going on officially instead of guessing. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1948 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Exactly the same error message occurs under Linux, but Linux seems better at holding on to an open connection, and re-using it, once it's been established. That might be one specific detail to explore. Well, no. As those errors are https errors, clearly defined protocol errors, and they are the same regardless of which OS is involved on either side...Ralf ![]() |
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
One other group of messages that made me make that remark was
19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Too old connection (133 seconds), disconnect it 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Connection 19130 seems to be dead! 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Closing connection 19130 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Found bundle for host download.worldcommunitygrid.org: 0x55951fe30e40 [can multiplex] 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Re-using existing connection! (#19129) with host download.worldcommunitygrid.org 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Connected to download.worldcommunitygrid.org (199.241.167.118) port 443 (#19129) 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Using Stream ID: 53 (easy handle 0x55951fd74840) Disconnecting after 133 seconds when it is still needed seems inefficient. It's all about how well the HTTPS tools are being used. Windows seems to drop connections after 20 seconds, which is even worse. |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1948 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
One other group of messages that made me make that remark was Those are not http errors but probably from the BOINC client, referring to BOINC transfer protocol issues.19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Too old connection (133 seconds), disconnect it 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Connection 19130 seems to be dead! 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Closing connection 19130 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Found bundle for host download.worldcommunitygrid.org: 0x55951fe30e40 [can multiplex] 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Re-using existing connection! (#19129) with host download.worldcommunitygrid.org 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Connected to download.worldcommunitygrid.org (199.241.167.118) port 443 (#19129) 19/08/2022 15:21:16 | World Community Grid | [http] [ID#21847] Info: Using Stream ID: 53 (easy handle 0x55951fd74840) Disconnecting after 133 seconds when it is still needed seems inefficient. It's all about how well the HTTPS tools are being used. Windows seems to drop connections after 20 seconds, which is even worse. By and large, the current problems that we all experience do not have anything to do with the OS involved, and I am pretty sure that the WCG stuff is running on Linux based servers... ![]() |
||
|
narf57
Cruncher Joined: Dec 19, 2014 Post Count: 3 Status: Offline Project Badges: ![]() ![]() ![]() ![]() |
Also getting lots of new WU, but all of them have transient HTTP errors, and require multiple retries to download. At least I now have 22 WU in the queue.
|
||
|
Richard Haselgrove
Senior Cruncher United Kingdom Joined: Feb 19, 2021 Post Count: 360 Status: Offline Project Badges: ![]() ![]() |
Well, there's HTTP, and there's HTTPS. HTTPS has much greater overheads in establishing each separate connection, which probably limits the number of concurrent connects to any one given server.
We're getting into network wrangling here, and that's an extreme discipline even within the general area of data management. In going on about it, I'm drawing on 15 years' experience computing with BOINC, and during that time it's become clear to me that even the most experienced BOINC project administrators and server operators have very little direct knowledge of the BOINC client behaviour as seen from our end of the cable. [I once heard an enthusiastic and emphatic 'hear, hear' down the line, when I made a statement like that on a BOINC teleconference. It came from one of the most experienced BOINC project administrators of all, overseeing a network that was comparably busy or even busier than WCG at its peak.] I'm simply trying to give the Krembil team some experience by proxy of what we can see here, and what they might want to think about. And I think I've said enough for now. |
||
|
Ian-n-Steve C.
Senior Cruncher United States Joined: May 15, 2020 Post Count: 180 Status: Offline Project Badges: ![]() |
they get through eventually if you just keep retrying them.
----------------------------------------for example, this worked fine to hammer through my stuck transfers on linux using the boinccmd tool. watch -n 30 ./boinccmd --network_available once the files downloaded, the tasks processes and uploaded normally. ![]() EPYC 7V12 / [5] RTX A4000 EPYC 7B12 / [5] RTX 3080Ti + [2] RTX 2080Ti EPYC 7B12 / [6] RTX 3070Ti + [2] RTX 3060 [2] EPYC 7642 / [2] RTX 2080Ti |
||
|
|
![]() |