Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 214
Posts: 214   Pages: 22   [ Previous Page | 13 14 15 16 17 18 19 20 21 22 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 91606 times and has 213 replies Next Thread
wildhagen
Veteran Cruncher
The Netherlands
Joined: Jun 5, 2009
Post Count: 845
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Seems to be better here too....

Only OPN1 work units available, I think? The last hours I don't get any OPNG, ARP1 of MCM1 units at all.

But downloads (en uploads) of the OPN1 is working OK here, no backoffs or retries.
[Oct 10, 2022 6:32:29 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Seems to be better here too....

Only OPN1 work units available, I think? The last hours I don't get any OPNG, ARP1 of MCM1 units at all.

But downloads (en uploads) of the OPN1 is working OK here, no backoffs or retries.
Just ran some updates before going to sleep and both of my GPU hosts got each about a dozen OPNG WUs, that will keep them busy for about an hour...

After the web site and forum went out again earlier this evening for some time, I wasn't very hopeful, but maybe, just maybe, we can get back to normal procedures.

Unfortunately though, still cirickets from WCG/Krembil... sad

Ralf
----------------------------------------

[Oct 10, 2022 6:56:33 AM]   Link   Report threatening or abusive post: please login first  Go to top 
poppinfresh99
Cruncher
Joined: Feb 29, 2020
Post Count: 49
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Since you only need to download the large MCM1 once (as long as you have a steady supply of MCM1 work units), the download issues continue even when not downloading this file. Besides, the downloading of the OPN files should be entirely independent of the large MCM1 file. I would welcome any alternative explanation.
Cheers


One of us (or both) has misunderstandings because your points don't make any sense to me.

The download SERVER is overwhelmed. Some fraction of users like me, due to the inconsistent work from Krembil and due to us not storing much work, have downloaded mcm1.dataset-sarc1.txt many times since Krembil started. Since this file is *MUCH* larger than other files, the download server needs to do MUCH more work. Certainly whether or not Sgt. Joe is downloading a file doesn't much affect the download server.

Why should OPN not affect MCM? The download server (servers?) is at Krembil regardless. It *seems* to me that they do affect each other. I only run MCM and, when downloads are slow, I often see people saying that some OPN tasks just became available.

When a bunch of small files are waiting to download, there is a project backoff done by the BOINC client to protect the download server. Just because a BOINC client is sitting there waiting for a bunch of small files to download doesn't mean that the WCG download server is being burdened (though it might be or perhaps the server the generates the small files is being burdened?). When the connection to the server is finally made, the small files download in an instant.

I agree with whoever said that we should stop focusing on symptoms. The way I see it, the inability to get a connection to the download server is the symptom. A hypothesis is that the partial cause is the download server being busy from repeatedly serving large files due to the inconsistent work.
[Oct 10, 2022 12:09:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Just1vet
Cruncher
Joined: Nov 9, 2005
Post Count: 25
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

The whole weekend without a hiccup here. Did increase number of days storage to 4. That seemed to help. Another few days like that, and I'll restart the farm.
[Oct 10, 2022 12:11:30 PM]   Link   Report threatening or abusive post: please login first  Go to top 
mwroggenbuck
Advanced Cruncher
USA
Joined: Nov 1, 2006
Post Count: 77
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

The website is definitely faster. I also have had no download problems. Something must have changed. It would be nice to know if the WCG staff did do something significant. Maybe, just maybe, things are getting better... applause
[Oct 10, 2022 1:50:28 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Kirel2
Advanced Cruncher
United States
Joined: Sep 24, 2014
Post Count: 99
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

Yeah, website and down/uploads have been very snappy for the last 12 hours or so. Fingers crossed.
----------------------------------------

[Oct 10, 2022 2:05:27 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

I agree with whoever said that we should stop focusing on symptoms. The way I see it, the inability to get a connection to the download server is the symptom. A hypothesis is that the partial cause is the download server being busy from repeatedly serving large files due to the inconsistent work.
You agree (with me in this case), yet you do exactly that, make assumption based on symptoms. No, the connection issue is not just a symptom, as there is a clear error message that is being returned.
Even if an external CDN isn't working well with WCG, as pretty much all the contents is "dynamic" (constantly changing, at least for all the WUs, not two requests result in the same file being send), if that large MCM1 text file would be the same over a large number of requests, this would be something that the read cache of the underlying file system on the server levels should take care of. And if the FS doesn't do it, a load balancing proxy like HAProxy, which is apparently being used on the software end, should do some caching for cases like this.
Even when download issues are very bad, those large files, once they get a connection, download reliably. At least I have not seen any issue once they started to load. Yes, at times it seemed rather slow, but if there were WUs available, I have not seen that any of my hosts would run out of work because of that. And a lot of people that complain about those things the most are ones that seem to run with modified settings, which is something that makes testing on the server/project side so much more difficult.

I don't know why MCM1 was stopped over the weekend, if just the hopper ran empty, the project is taking a break or if the techs are testing specific issues on their end. THAT is the root of all problems. In IMHO, i t would be much more useful if WCG/Krembil would be MUCH more communicative, telling us what is going on, so that there would be less speculation based on perceived symptoms. Like it used to be in "the good old days". We don't have really insight on the monitoring data at Krembil's end, but only that is what can help to narrow down the issues. But then only when the techs would get qualified responses from the users at the far end, based on information provided at their end what they might have tweaked/adjusted.
But instead, the silence out of Toronto is deafening... crying

Ralf sad
----------------------------------------

[Oct 10, 2022 2:38:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Paul Schlaffer
Senior Cruncher
USA
Joined: Jun 12, 2005
Post Count: 244
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)


...
I don't know why MCM1 was stopped over the weekend, if just the hopper ran empty, the project is taking a break or if the techs are testing specific issues on their end. THAT is the root of all problems. In IMHO, i t would be much more useful if WCG/Krembil would be MUCH more communicative, telling us what is going on, so that there would be less speculation based on perceived symptoms. Like it used to be in "the good old days". We don't have really insight on the monitoring data at Krembil's end, but only that is what can help to narrow down the issues. But then only when the techs would get qualified responses from the users at the far end, based on information provided at their end what they might have tweaked/adjusted.
But instead, the silence out of Toronto is deafening... crying

Ralf sad


Agreed. That was very well stated.
We don't know whether the change we see on our end is from the change in work-units (now mostly OpenPandemics CPU), perhaps participation drop off, or something else.
Some updates from the lab on what they are doing (or not) regarding the issue would be much appreciated.
----------------------------------------

“Where an excess of power prevails, property of no sort is duly respected. No man is safe in his opinions, his person, his faculties, or his possessions.” – James Madison (1792)
----------------------------------------
[Edit 1 times, last edit by Paul Schlaffer at Oct 10, 2022 10:22:07 PM]
[Oct 10, 2022 3:34:46 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7675
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

The download SERVER is overwhelmed.

This is assuming there is only one download server. Since there has been mention of a load balancer, this implies the existence of 2 or more servers for downloads. It is not unreasonable to suspect the partitioning of the load among multiple servers to be segmented by project, especially for the projects which require the greatest number of downloads. If they are not partitioned by project, the load would fall on the load balancer to properly apportion the available work units to meet the set proportion of work units for each project.
However, not being privy to the topology of the setup for disbursing work units to the volunteers, this is all mere speculation on my part. I do recall an old post by Uplinger which did detail the percentage of resource allocation for each project which was set into the system, which they would occasionally tweak at the request of the researchers. They would try to provide to the researchers enough results to satisfy their needs without overwhelming the resources they had to store and process the results.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Oct 10, 2022 4:09:17 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Link64
Advanced Cruncher
Joined: Feb 19, 2021
Post Count: 129
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2022-09-15 Update (Networking & Workunits)

No, we should not need to do this. And this never has been a problem before (the move)...

Perhaps it wasn't an issue for IBM, but there are still lots of people with slow connections and/or with not unlimited traffic. Not me, I was just doing it "for fun". It's just pretty stupid to let people downlod same file again and again when BOINC offers the possibility to keep it in the project folder for future use.
----------------------------------------

[Oct 10, 2022 4:27:00 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 214   Pages: 22   [ Previous Page | 13 14 15 16 17 18 19 20 21 22 | Next Page ]
[ Jump to Last Post ]
Post new Thread