Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 3195
|
![]() |
Author |
|
imakuni
Advanced Cruncher Joined: Jun 11, 2009 Post Count: 103 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well, you're auto-clicker scripts are just making things worse. As the basic issue is not only the net bandwidth but there are also limitations on the numbers of concurrent connections. And the later will be flooded when everyone is issuing automated retry request far to quick/often... Ralf I'm 99% sure an autoclicker with tasks actually finishing their downloads and then getting computed before the deadline ends is less taxing on the server than having 29 tasks assigned, downloading upwards of a GB of various files without any one unit actually managing to go through, then having them automatically aborted because they took too long to begin. And I'm also 100% sure it's better if you have that one user trying to make 2 connections every minute rather than 29 users trying to make 48 connections at a time. Only thing I'm not sure of is if failing a task due to not meeting the deadline puts your computer on the list of "bad" hosts, forcing extra computers to work on the same task. If that's not the case, ignore the following, but if it is, that's even MORE reason to use an autoclicker to make damn sure you can finish your tasks in time, avoiding further strain from extra tasks. Oh, and on that note, raw number of connections isn't as likely of an issue as the lack of bandwidth is. Well, okay, it's also possible the server is CPU / RAM / whatever starved, but I don't think we can see that on our end... Regardless, the end result for tasks stuck in download isn't because there's too many raw people trying to download, it's because the server can't feed the data fast enough to the various hosts. If it could, it wouldn't be an issue; many of us have have dozens of MB of download speed, yet tasks download at dozens of KB instead because the server can't get them out quickly. Now sure, I'll admit that limiting based on network won't address the core problem (not enough "server" power). But with the stuff they do have, this should provide a bandaid fix: network too high = too many downloads stuck, so focus on finishing that before creating more tasks to assign to more users that will just get more stuck tasks. Does that not sound reasonable? Do you have a better solution the devs can pull on their end? Like, look, I know you have no idea on what you're talking about, but you're not going to convince me I'm incorrect in my assessment if you neither offer a better explanation, nor a better suggestion on what to do (and why that would be better). ![]() Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840 |
||
|
Mike.Gibson
Ace Cruncher England Joined: Aug 23, 2007 Post Count: 12349 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
imakuni
The problem is that we get these queues every time there is a restart of ARP. Auto-clicking or manual retry might give one person an advantage but it pushes someone else back in the queue. A much better idea would be to slow down the initial releases until the queue stabilises. Limit any one machine to 2 units as a start and gradually increase that as the queue reduces, That way the units get downloaded and returned faster. No-one misses a deadline and the techs don't have to intervene to extend deadlines. But we have been saying this ever since it went to Krembil. Mike |
||
|
imakuni
Advanced Cruncher Joined: Jun 11, 2009 Post Count: 103 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The problem is that we get these queues every time there is a restart of ARP. Auto-clicking or manual retry might give one person an advantage but it pushes someone else back in the queue. A much better idea would be to slow down the initial releases until the queue stabilises. Limit any one machine to 2 units as a start and gradually increase that as the queue reduces, That way the units get downloaded and returned faster. No-one misses a deadline and the techs don't have to intervene to extend deadlines. But we have been saying this ever since it went to Krembil. Okay, this is so incredibly obvious to me, I don't understand how people can't see it... so I'll try a different approach. Lets say there are 100 tasks total. Lets work with that for simplicity. You folks are claiming the following, so please explain to me, in detail, how: -50 machines, with 2 tasks each, trying to download 100 files at the same time, is better than - ONE single machine with an autoclicker and a "cheater" Boinc config trying to download 16 files at a time, and once those 16 connections are made, no new connections / requests are made until another space goes vacant. Please. I beg of you. Explain to me, logically, how 100 requests at the same time is better than 16. Spoilers: it isn't. Yall need to understand, limiting tasks per machine only makes the problem WORSE, not better. The more users out there, the worse the problem becomes: the more connections the server has to handle. The limit needs to be in regards to the total amount of tasks going on, that needs to be severely lowered. And ideally, those fewer tasks should be sent out to as few host as possible, for the fewer hosts, the BETTER. EDIT: small mistake with the numbers. Dw, I'm dumb. ![]() Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840 [Edit 1 times, last edit by imakuni at Nov 4, 2024 11:40:55 PM] |
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 945 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Regarding connections and collecting files for tasks...
As far as I can tell (on my Linux systems) a connection to WCG is only good for one file when things get congested (and possibly always, though when things are quieter it might not seem so obvious!) The connection probably isn't re-used if the request fails. I don't try for more than two connections (which seems to have been the default when I set my systems up), and I don't run an auto-clicker for network activity. I occasionally do a manual full retry (checking to make sure I'm not already fetching something at the time!) in order to try to reset the eventual excessive retry times the client sets. I've just watched some of my systems take 14 hours to collect all the files they needed for a single ARP1 task because of the high number of large files, the relatively slow download rates and the repeated "transient HTTP errors" (or, worse, the "Project communication failed" cases); having been allocated two or three ARP1 tasks at once, the problem was very obvious!.. Note that if such errors crop up during a download rather than at its start, there is the possibility that a file will end up being reported as "wrong size" and the whole task is discarded :-( It has been suggested in the past that "wrong size" errors seemed to be more common if a user had enabled a higher number of connections -- this is consistent with the way that the ftp/http library (libcurl?) handles problems related to lots of connection errors as the faster one collects connection errors the sooner it has to take panic measures -- one user (can't recall who it was offhand and don't currently have time to search for the references) reported solving their problems by cutting down to one connection at a time :-) The "get lots of files at once" or "get lots of tasks at once" approaches might be fine when there isn't congestion although it might starve out "smaller" users (and let's not get into what happens when a user's system acquires [far] more work than it can possibly process in time, even without delays...) By the way, their data centre had actually asked them to cut down the maximum number of connections allowed (reported a long time ago). But as others have pointed out, there's more to the issue than just the number of admissible connections anyway. We just have to wait it out, hoping that the situation a few days further in isn't made worse by large numbers of retries due to late returners and that we all manage to strike a balance between getting work and swamping capacity. [I know, I'm being over-optimistic :-)] Cheers - Al. P.S. I just checked in on one of my still-delayed systems and as I did so it was fetching the last two pending files so it got another ARP1 (no stalled downloads...) -- that new task is a retry for an initial task that was "sent" at 08:18 UTC and failed because of a "wrong size" error after UTC midnight. So that user had a waste of time and my system still can't get MCM1 work because there are stalled ARP downloads :-( :-) P.P.S. It's been a long time since I coded anything using a CURL library, so someone else might be able to give better insights on that aspect :-) |
||
|
Maxxina
Advanced Cruncher Joined: Jan 5, 2008 Post Count: 124 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
sigh these speeds.. when i started i got like 30 Kbps on download .. Now down to like 3-4 kbps on dowlloading files .. And upload 10 .. Managed to download and finish two jobs in 24 hours .. These is AWFULLY BAD ..
----------------------------------------[Edit 1 times, last edit by Maxxina at Nov 5, 2024 4:58:59 AM] |
||
|
catchercradle
Advanced Cruncher Joined: Jan 16, 2009 Post Count: 126 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Just finished uploading a task. I have set the project to no new tasks till the current ones have finished downloading. I have 16 real cores and on my settings BOINC tried to fill them. One running one complete and just hoping another finishes downloading by the time the one running finishes! This is taking longer than when I ran CPDN tasks via dialup!
|
||
|
Crystal Pellet
Veteran Cruncher Joined: May 21, 2008 Post Count: 1320 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Some figures on the download of 1 single ARP1-task containing 11 files with sizes from 1.74 K up to 48227.62 K and their retry numbers for that file:
ARP1_0018249_140_ARP1_0018249_140.input 1,74 K retries: 8 Most of the times one gets the HTTP error 503: Service Unavailable Even during a standing connection the connection is broken by the server. |
||
|
stfn
Cruncher Joined: Jul 28, 2022 Post Count: 1 Status: Offline Project Badges: ![]() ![]() ![]() |
I'm slowly moving forward with the ARP tasks, already crunched and returned six of them. The downloads work best in the early morning my time, when the US is sleeping.
----------------------------------------However, I have two tasks, ARP1_0021368_140 and ARP1_0019449_140, that have been stuck all day at "Waiting to run". They have been waiting even when there were no other tasks to crunch. I changed the daily schedules, allocated CPU percentage, any other sensible option, but they do not want to start. Anyone else having such problems? |
||
|
spRocket
Senior Cruncher Joined: Mar 25, 2020 Post Count: 274 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Check your profiles, folks. I was noticing my "heavy hitters" (16- and 20-thread boxes) were always trying to run six ARP units, while anything else ran two. Not good when it takes ALL FREAKING NIGHT to download six of them, and that also impacts MCM, since the BOINC client always seems to have an easier time snagging the bigger files. This results in MCM units piling up in the download queue, until the ARP units have finally been picked up.
Whoops. I just took a look and realized that I had my "Default" profile set to 6 ARP tasks and "home" set to two. Needless to say, everything is on "home" now, and I'm on the fence on setting that to 1 ARP until/unless the bottleneck gets resolved. This clearly can't go on as-is, clickers or not. |
||
|
phytell
Cruncher Joined: Sep 8, 2014 Post Count: 33 Status: Offline |
Reducing the maximum number of file transfers seems to have had some moderate success for me, at least in terms of the amount of management required. Transfers still fail constantly, but with a maximum concurrent file transfer setting of 1 a steady trickle of data seems to be moving without my having to go in and clear the blocks as much.
I did turn off the project this morning though: 20 kb/sec is not enough to transfer the kind of data ARP requires, even if it never pauses. If you're sticking with it, try lowering the maximum number of file transfers in cc_config.xml. |
||
|
|
![]() |