World Community Grid - View Thread - 2022-09-15 Update (Networking & Workunits)

World Community Grid Forums

Category: Official Messages

Forum: News

Thread: 2022-09-15 Update (Networking & Workunits)

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 214

[ ]

Author

This topic has been viewed 91606 times and has 213 replies

wildhagen
Veteran Cruncher
The Netherlands
Joined: Jun 5, 2009
Post Count: 845
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Nutritious Rice for the World

1 year badge for Help Fight Childhood Cancer

14 day badge for Influenza Antiviral Drug Search

1 year badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

1 year badge for Drug Search for Leishmaniasis

1 year badge for GO Fight Against Malaria

14 day badge for Computing for Sustainable Water

200 year badge for Mapping Cancer Markers

1 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

10 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Seems to be better here too....

Only OPN1 work units available, I think? The last hours I don't get any OPNG, ARP1 of MCM1 units at all.

But downloads (en uploads) of the OPN1 is working OK here, no backoffs or retries.

[Oct 10, 2022 6:32:29 AM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:

10 year badge for Help Fight Childhood Cancer

2 year badge for Help Cure Muscular Dystrophy - Phase 2

5 year badge for The Clean Energy Project - Phase 2

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

2 year badge for Computing for Sustainable Water

5 year badge for Uncovering Genome Mysteries

50 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

50 year badge for Smash Childhood Cancer

50 year badge for Microbiome Immunity Project

5 year badge for Africa Rainfall Project

100 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Just ran some updates before going to sleep and both of my GPU hosts got each about a dozen OPNG WUs, that will keep them busy for about an hour...

After the web site and forum went out again earlier this evening for some time, I wasn't very hopeful, but maybe, just maybe, we can get back to normal procedures.

Unfortunately though, still cirickets from WCG/Krembil... sad

Ralf

----------------------------------------

[Oct 10, 2022 6:56:33 AM]

poppinfresh99
Cruncher
Joined: Feb 29, 2020
Post Count: 49
Status: Offline
Project Badges:

20 year badge for Mapping Cancer Markers

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Since you only need to download the large MCM1 once (as long as you have a steady supply of MCM1 work units), the download issues continue even when not downloading this file. Besides, the downloading of the OPN files should be entirely independent of the large MCM1 file. I would welcome any alternative explanation.
Cheers

One of us (or both) has misunderstandings because your points don't make any sense to me.

The download SERVER is overwhelmed. Some fraction of users like me, due to the inconsistent work from Krembil and due to us not storing much work, have downloaded mcm1.dataset-sarc1.txt many times since Krembil started. Since this file is *MUCH* larger than other files, the download server needs to do MUCH more work. Certainly whether or not Sgt. Joe is downloading a file doesn't much affect the download server.

Why should OPN not affect MCM? The download server (servers?) is at Krembil regardless. It *seems* to me that they do affect each other. I only run MCM and, when downloads are slow, I often see people saying that some OPN tasks just became available.

When a bunch of small files are waiting to download, there is a project backoff done by the BOINC client to protect the download server. Just because a BOINC client is sitting there waiting for a bunch of small files to download doesn't mean that the WCG download server is being burdened (though it might be or perhaps the server the generates the small files is being burdened?). When the connection to the server is finally made, the small files download in an instant.

I agree with whoever said that we should stop focusing on symptoms. The way I see it, the inability to get a connection to the download server is the symptom. A hypothesis is that the partial cause is the download server being busy from repeatedly serving large files due to the inconsistent work.

[Oct 10, 2022 12:09:53 PM]

Just1vet
Cruncher
Joined: Nov 9, 2005
Post Count: 25
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Cure Muscular Dystrophy

2 year badge for Help Fight Childhood Cancer

2 year badge for The Clean Energy Project - Phase 2

20 year badge for Outsmart Ebola Together

50 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

The whole weekend without a hiccup here. Did increase number of days storage to 4. That seemed to help. Another few days like that, and I'll restart the farm.

[Oct 10, 2022 12:11:30 PM]

mwroggenbuck
Advanced Cruncher
USA
Joined: Nov 1, 2006
Post Count: 77
Status: Offline
Project Badges:

14 day badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

14 day badge for Drug Search for Leishmaniasis

14 day badge for GO Fight Against Malaria

45 day badge for Outsmart Ebola Together

14 day badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

14 day badge for Microbiome Immunity Project

14 day badge for Africa Rainfall Project


Re: 2022-09-15 Update (Networking & Workunits)

The website is definitely faster. I also have had no download problems. Something must have changed. It would be nice to know if the WCG staff did do something significant. Maybe, just maybe, things are getting better... applause

[Oct 10, 2022 1:50:28 PM]

Kirel2
Advanced Cruncher
United States
Joined: Sep 24, 2014
Post Count: 99
Status: Offline
Project Badges:

45 day badge for The Clean Energy Project - Phase 2

45 day badge for Uncovering Genome Mysteries

90 day badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

180 day badge for Microbiome Immunity Project

45 day badge for Africa Rainfall Project

1 year badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

Yeah, website and down/uploads have been very snappy for the last 12 hours or so. Fingers crossed.

----------------------------------------

[Oct 10, 2022 2:05:27 PM]

TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:


Re: 2022-09-15 Update (Networking & Workunits)

I agree with whoever said that we should stop focusing on symptoms. The way I see it, the inability to get a connection to the download server is the symptom. A hypothesis is that the partial cause is the download server being busy from repeatedly serving large files due to the inconsistent work.

You agree (with me in this case), yet you do exactly that, make assumption based on symptoms. No, the connection issue is not just a symptom, as there is a clear error message that is being returned.
Even if an external CDN isn't working well with WCG, as pretty much all the contents is "dynamic" (constantly changing, at least for all the WUs, not two requests result in the same file being send), if that large MCM1 text file would be the same over a large number of requests, this would be something that the read cache of the underlying file system on the server levels should take care of. And if the FS doesn't do it, a load balancing proxy like HAProxy, which is apparently being used on the software end, should do some caching for cases like this.
Even when download issues are very bad, those large files, once they get a connection, download reliably. At least I have not seen any issue once they started to load. Yes, at times it seemed rather slow, but if there were WUs available, I have not seen that any of my hosts would run out of work because of that. And a lot of people that complain about those things the most are ones that seem to run with modified settings, which is something that makes testing on the server/project side so much more difficult.

I don't know why MCM1 was stopped over the weekend, if just the hopper ran empty, the project is taking a break or if the techs are testing specific issues on their end. THAT is the root of all problems. In IMHO, i t would be much more useful if WCG/Krembil would be MUCH more communicative, telling us what is going on, so that there would be less speculation based on perceived symptoms. Like it used to be in "the good old days". We don't have really insight on the monitoring data at Krembil's end, but only that is what can help to narrow down the issues. But then only when the techs would get qualified responses from the users at the far end, based on information provided at their end what they might have tweaked/adjusted.
But instead, the silence out of Toronto is deafening... crying

Ralf

----------------------------------------

[Oct 10, 2022 2:38:53 PM]

Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 244
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

14 day badge for The Clean Energy Project

50 year badge for The Clean Energy Project - Phase 2

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Africa Rainfall Project


Re: 2022-09-15 Update (Networking & Workunits)

...
I don't know why MCM1 was stopped over the weekend, if just the hopper ran empty, the project is taking a break or if the techs are testing specific issues on their end. THAT is the root of all problems. In IMHO, i t would be much more useful if WCG/Krembil would be MUCH more communicative, telling us what is going on, so that there would be less speculation based on perceived symptoms. Like it used to be in "the good old days". We don't have really insight on the monitoring data at Krembil's end, but only that is what can help to narrow down the issues. But then only when the techs would get qualified responses from the users at the far end, based on information provided at their end what they might have tweaked/adjusted.
But instead, the silence out of Toronto is deafening... crying

Ralf

Agreed. That was very well stated.
We don't know whether the change we see on our end is from the change in work-units (now mostly OpenPandemics CPU), perhaps participation drop off, or something else.
Some updates from the lab on what they are doing (or not) regarding the issue would be much appreciated.

----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)

----------------------------------------
[Edit 1 times, last edit by Paul Schlaffer at Oct 10, 2022 10:22:07 PM]

[Oct 10, 2022 3:34:46 PM]

Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7675
Status: Offline
Project Badges:

2 year badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

90 day badge for Influenza Antiviral Drug Search

45 day badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for Drug Search for Leishmaniasis

5 year badge for GO Fight Against Malaria

10 year badge for FightAIDS@Home - Phase 2

100 year badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project


Re: 2022-09-15 Update (Networking & Workunits)

The download SERVER is overwhelmed.

This is assuming there is only one download server. Since there has been mention of a load balancer, this implies the existence of 2 or more servers for downloads. It is not unreasonable to suspect the partitioning of the load among multiple servers to be segmented by project, especially for the projects which require the greatest number of downloads. If they are not partitioned by project, the load would fall on the load balancer to properly apportion the available work units to meet the set proportion of work units for each project.
However, not being privy to the topology of the setup for disbursing work units to the volunteers, this is all mere speculation on my part. I do recall an old post by Uplinger which did detail the percentage of resource allocation for each project which was set into the system, which they would occasionally tweak at the request of the researchers. They would try to provide to the researchers enough results to satisfy their needs without overwhelming the resources they had to store and process the results.
Cheers

----------------------------------------

Sgt. Joe
*Minnesota Crunchers*

[Oct 10, 2022 4:09:17 PM]

Link64
Advanced Cruncher
Joined: Feb 19, 2021
Post Count: 129
Status: Offline
Project Badges:

14 day badge for OpenPandemics - COVID-19


Re: 2022-09-15 Update (Networking & Workunits)

No, we should not need to do this. And this never has been a problem before (the move)...

Perhaps it wasn't an issue for IBM, but there are still lots of people with slow connections and/or with not unlimited traffic. Not me, I was just doing it "for fun". It's just pretty stupid to let people downlod same file again and again when BOINC offers the possibility to keep it in the project folder for future use.

----------------------------------------

[Oct 10, 2022 4:27:00 PM]

[ ]