World Community Grid - View Thread

World Community Grid Forums

Category: Active Research

Forum: Africa Rainfall Project

Thread: Work Available

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 3195

[ ]

Author

This topic has been viewed 2708824 times and has 3194 replies

imakuni
Advanced Cruncher
Joined: Jun 11, 2009
Post Count: 103
Status: Offline
Project Badges:

45 day badge for Human Proteome Folding - Phase 2

90 day badge for Help Fight Childhood Cancer

45 day badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for The Clean Energy Project - Phase 2

45 day badge for Computing for Clean Water

14 day badge for Drug Search for Leishmaniasis

45 day badge for GO Fight Against Malaria

1 year badge for Uncovering Genome Mysteries

1 year badge for Outsmart Ebola Together

1 year badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: Work Available

Well, you're auto-clicker scripts are just making things worse. As the basic issue is not only the net bandwidth but there are also limitations on the numbers of concurrent connections. And the later will be flooded when everyone is issuing automated retry request far to quick/often...

Ralf

I'm 99% sure an autoclicker with tasks actually finishing their downloads and then getting computed before the deadline ends is less taxing on the server than having 29 tasks assigned, downloading upwards of a GB of various files without any one unit actually managing to go through, then having them automatically aborted because they took too long to begin. And I'm also 100% sure it's better if you have that one user trying to make 2 connections every minute rather than 29 users trying to make 48 connections at a time.

Only thing I'm not sure of is if failing a task due to not meeting the deadline puts your computer on the list of "bad" hosts, forcing extra computers to work on the same task. If that's not the case, ignore the following, but if it is, that's even MORE reason to use an autoclicker to make damn sure you can finish your tasks in time, avoiding further strain from extra tasks.

Oh, and on that note, raw number of connections isn't as likely of an issue as the lack of bandwidth is. Well, okay, it's also possible the server is CPU / RAM / whatever starved, but I don't think we can see that on our end... Regardless, the end result for tasks stuck in download isn't because there's too many raw people trying to download, it's because the server can't feed the data fast enough to the various hosts. If it could, it wouldn't be an issue; many of us have have dozens of MB of download speed, yet tasks download at dozens of KB instead because the server can't get them out quickly.

Now sure, I'll admit that limiting based on network won't address the core problem (not enough "server" power). But with the stuff they do have, this should provide a bandaid fix: network too high = too many downloads stuck, so focus on finishing that before creating more tasks to assign to more users that will just get more stuck tasks.

Does that not sound reasonable? Do you have a better solution the devs can pull on their end? Like, look, I know you have no idea on what you're talking about, but you're not going to convince me I'm incorrect in my assessment if you neither offer a better explanation, nor a better suggestion on what to do (and why that would be better).

----------------------------------------

Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840

[Nov 4, 2024 9:08:29 PM]

Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12349
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

45 day badge for Discovering Dengue Drugs - Together

14 day badge for Nutritious Rice for the World

180 day badge for Help Fight Childhood Cancer

90 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

90 day badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

180 day badge for GO Fight Against Malaria

45 day badge for Computing for Sustainable Water

20 year badge for Mapping Cancer Markers

5 year badge for Uncovering Genome Mysteries

5 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

10 year badge for OpenPandemics - COVID-19


Re: Work Available

imakuni

The problem is that we get these queues every time there is a restart of ARP.

Auto-clicking or manual retry might give one person an advantage but it pushes someone else back in the queue.

A much better idea would be to slow down the initial releases until the queue stabilises. Limit any one machine to 2 units as a start and gradually increase that as the queue reduces, That way the units get downloaded and returned faster. No-one misses a deadline and the techs don't have to intervene to extend deadlines.

But we have been saying this ever since it went to Krembil.

Mike

[Nov 4, 2024 10:43:09 PM]

imakuni
Advanced Cruncher
Joined: Jun 11, 2009
Post Count: 103
Status: Offline
Project Badges:


Re: Work Available

The problem is that we get these queues every time there is a restart of ARP.

Auto-clicking or manual retry might give one person an advantage but it pushes someone else back in the queue.

A much better idea would be to slow down the initial releases until the queue stabilises. Limit any one machine to 2 units as a start and gradually increase that as the queue reduces, That way the units get downloaded and returned faster. No-one misses a deadline and the techs don't have to intervene to extend deadlines.

But we have been saying this ever since it went to Krembil.

Okay, this is so incredibly obvious to me, I don't understand how people can't see it... so I'll try a different approach. Lets say there are 100 tasks total. Lets work with that for simplicity. You folks are claiming the following, so please explain to me, in detail, how:

-50 machines, with 2 tasks each, trying to download 100 files at the same time,

is better than

~~-ONE single machine from an "evil" user, being assigned 100 tasks, with autoclicking and downloading 2 files... no, actually, scratch that.~~
- ONE single machine with an autoclicker and a "cheater" Boinc config trying to download 16 files at a time, and once those 16 connections are made, no new connections / requests are made until another space goes vacant.

Please. I beg of you. Explain to me, logically, how 100 requests at the same time is better than 16.

Spoilers: it isn't. Yall need to understand, limiting tasks per machine only makes the problem WORSE, not better. The more users out there, the worse the problem becomes: the more connections the server has to handle. The limit needs to be in regards to the total amount of tasks going on, that needs to be severely lowered. And ideally, those fewer tasks should be sent out to as few host as possible, for the fewer hosts, the BETTER.

EDIT: small mistake with the numbers. Dw, I'm dumb.

----------------------------------------

Want to have an image of yourself like this on? Check this thread: https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,29840

----------------------------------------
[Edit 1 times, last edit by imakuni at Nov 4, 2024 11:40:55 PM]

[Nov 4, 2024 11:10:58 PM]

alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 945
Status: Offline
Project Badges:

14 day badge for Discovering Dengue Drugs - Together

180 day badge for Computing for Clean Water

14 day badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

10 year badge for FightAIDS@Home - Phase 2

10 year badge for Microbiome Immunity Project


Re: Work Available

Regarding connections and collecting files for tasks...

As far as I can tell (on my Linux systems) a connection to WCG is only good for one file when things get congested (and possibly always, though when things are quieter it might not seem so obvious!) The connection probably isn't re-used if the request fails.

I don't try for more than two connections (which seems to have been the default when I set my systems up), and I don't run an auto-clicker for network activity. I occasionally do a manual full retry (checking to make sure I'm not already fetching something at the time!) in order to try to reset the eventual excessive retry times the client sets.

I've just watched some of my systems take 14 hours to collect all the files they needed for a single ARP1 task because of the high number of large files, the relatively slow download rates and the repeated "transient HTTP errors" (or, worse, the "Project communication failed" cases); having been allocated two or three ARP1 tasks at once, the problem was very obvious!..

Note that if such errors crop up during a download rather than at its start, there is the possibility that a file will end up being reported as "wrong size" and the whole task is discarded :-(

It has been suggested in the past that "wrong size" errors seemed to be more common if a user had enabled a higher number of connections -- this is consistent with the way that the ftp/http library (libcurl?) handles problems related to lots of connection errors as the faster one collects connection errors the sooner it has to take panic measures -- one user (can't recall who it was offhand and don't currently have time to search for the references) reported solving their problems by cutting down to one connection at a time :-)

The "get lots of files at once" or "get lots of tasks at once" approaches might be fine when there isn't congestion although it might starve out "smaller" users (and let's not get into what happens when a user's system acquires [far] more work than it can possibly process in time, even without delays...)

By the way, their data centre had actually asked them to cut down the maximum number of connections allowed (reported a long time ago). But as others have pointed out, there's more to the issue than just the number of admissible connections anyway.

We just have to wait it out, hoping that the situation a few days further in isn't made worse by large numbers of retries due to late returners and that we all manage to strike a balance between getting work and swamping capacity. [I know, I'm being over-optimistic :-)]

Cheers - Al.

P.S. I just checked in on one of my still-delayed systems and as I did so it was fetching the last two pending files so it got another ARP1 (no stalled downloads...) -- that new task is a retry for an initial task that was "sent" at 08:18 UTC and failed because of a "wrong size" error after UTC midnight. So that user had a waste of time and my system still can't get MCM1 work because there are stalled ARP downloads :-( :-)

P.P.S. It's been a long time since I coded anything using a CURL library, so someone else might be able to give better insights on that aspect :-)

[Nov 5, 2024 1:13:38 AM]

Maxxina
Advanced Cruncher
Joined: Jan 5, 2008
Post Count: 124
Status: Offline
Project Badges:

14 day badge for The Clean Energy Project - Phase 2

90 day badge for Uncovering Genome Mysteries

180 day badge for FightAIDS@Home - Phase 2

180 day badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Work Available

sigh these speeds.. when i started i got like 30 Kbps on download .. Now down to like 3-4 kbps on dowlloading files .. And upload 10 .. Managed to download and finish two jobs in 24 hours .. These is AWFULLY BAD ..

----------------------------------------
[Edit 1 times, last edit by Maxxina at Nov 5, 2024 4:58:59 AM]

[Nov 5, 2024 4:47:37 AM]

catchercradle
Advanced Cruncher
Joined: Jan 16, 2009
Post Count: 126
Status: Offline
Project Badges:

14 day badge for OpenPandemics - COVID-19


Re: Work Available

Just finished uploading a task. I have set the project to no new tasks till the current ones have finished downloading. I have 16 real cores and on my settings BOINC tried to fill them. One running one complete and just hoping another finishes downloading by the time the one running finishes! This is taking longer than when I ran CPDN tasks via dialup!

[Nov 5, 2024 7:25:24 AM]

Crystal Pellet
Veteran Cruncher
Joined: May 21, 2008
Post Count: 1320
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

90 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

20 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

5 year badge for Microbiome Immunity Project

50 year badge for OpenPandemics - COVID-19


Re: Work Available

Some figures on the download of 1 single ARP1-task containing 11 files with sizes from 1.74 K up to 48227.62 K and their retry numbers for that file:

ARP1_0018249_140_ARP1_0018249_140.input	    1,74 K	retries: 8
f08f86b8bfff198ed30e6fdf44b701c6.		  644,39 K	retries: 16
7baacec45ece8184f21856eaea1e09e3.		  644,39 K	retries: 6
16c8112ecc9e4702343c4122f7661ece.		  644,39 K	retries: 4
73fcce9ab263666210d5d8165e67d5a7.		48227,62 K	retries: 21
ae066c024b99b633829893b707d333f3.7z		19029,24 K	retries: 35
c38b89f7b944bcf337dbbd153a688834.7z		16329,62 K	retries: 15
95a7432bffd5063fe7f40f218f3ed7df.7z		15499,26 K	retries: 22
ARP1_0018249_140_ARP1_0018249_input_d01	12192,82 K	retries: 6
ARP1_0018249_140_ARP1_0018249_input_d02	12192,82 K	retries: 27
ARP1_0018249_140_ARP1_0018249_input_d03	12192,82 K	retries: 54

Most of the times one gets the HTTP error 503: Service Unavailable
Even during a standing connection the connection is broken by the server.

[Nov 5, 2024 11:18:01 AM]

stfn
Cruncher
Joined: Jul 28, 2022
Post Count: 1
Status: Offline
Project Badges:

180 day badge for Mapping Cancer Markers

14 day badge for Africa Rainfall Project

45 day badge for OpenPandemics - COVID-19


Re: Work Available

I'm slowly moving forward with the ARP tasks, already crunched and returned six of them. The downloads work best in the early morning my time, when the US is sleeping.

However, I have two tasks, ARP1_0021368_140 and ARP1_0019449_140, that have been stuck all day at "Waiting to run". They have been waiting even when there were no other tasks to crunch. I changed the daily schedules, allocated CPU percentage, any other sensible option, but they do not want to start. Anyone else having such problems?

----------------------------------------

my blog about IT, Raspberry Pis, and BOINC

[Nov 6, 2024 12:12:44 PM]

spRocket
Senior Cruncher
Joined: Mar 25, 2020
Post Count: 274
Status: Offline
Project Badges:

1 year badge for Microbiome Immunity Project

20 year badge for OpenPandemics - COVID-19


Re: Work Available

Check your profiles, folks. I was noticing my "heavy hitters" (16- and 20-thread boxes) were always trying to run six ARP units, while anything else ran two. Not good when it takes ALL FREAKING NIGHT to download six of them, and that also impacts MCM, since the BOINC client always seems to have an easier time snagging the bigger files. This results in MCM units piling up in the download queue, until the ARP units have finally been picked up.

Whoops. I just took a look and realized that I had my "Default" profile set to 6 ARP tasks and "home" set to two. Needless to say, everything is on "home" now, and I'm on the fence on setting that to 1 ARP until/unless the bottleneck gets resolved. This clearly can't go on as-is, clickers or not.

[Nov 6, 2024 2:04:41 PM]

phytell
Cruncher
Joined: Sep 8, 2014
Post Count: 33
Status: Offline


Re: Work Available

Reducing the maximum number of file transfers seems to have had some moderate success for me, at least in terms of the amount of management required. Transfers still fail constantly, but with a maximum concurrent file transfer setting of 1 a steady trickle of data seems to be moving without my having to go in and clear the blocks as much.
I did turn off the project this morning though: 20 kb/sec is not enough to transfer the kind of data ARP requires, even if it never pauses.
If you're sticking with it, try lowering the maximum number of file transfers in cc_config.xml.

[Nov 6, 2024 3:54:03 PM]

[ ]