Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 51
Posts: 51   Pages: 6   [ Previous Page | 1 2 3 4 5 6 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 388745 times and has 50 replies Next Thread
Greg_BE
Advanced Cruncher
Joined: May 9, 2016
Post Count: 82
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

4/8/2023 10:56:39 PM | World Community Grid | Temporarily failed upload of OPN1_0129871_01847_0_r1837762973_0: transient HTTP error


Are you kidding me...what the blank now?!??!
[Apr 8, 2023 9:58:09 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7668
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

Maybe it is just here, but my Windows system is getting the transient errors, but eventually they do go through. The Linux systems appear to upload much more quickly with fewer retries. I have not had to press the retry button on them and they appear, at least for the time being, of uploading more or less properly.It doesn't really make sense.
Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Apr 8, 2023 11:35:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 802
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

My one box (Linux) fails all re-tries, and I'm also unable to get new work units in both Linux and Windows. I'm glad I set a 2 day cache on the Linux box when WCG restarted. I may increase that to 3 days until confidence is restored in WCG availability.

I'm impressed the validator seems to have caught up Friday-Saturday. That's great!

Not impressed at the transient HTTP errors again even on the new flash storage array. Are we out of disk space, or is this just a stuck process?

A self-healing BOINC architecture would be really nice, if such a manual process is simple enough that it can be automated.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

[Apr 9, 2023 2:49:10 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

Maybe it is just here, but my Windows system is getting the transient errors, but eventually they do go through. The Linux systems appear to upload much more quickly with fewer retries. I have not had to press the retry button on them and they appear, at least for the time being, of uploading more or less properly.It doesn't really make sense.
Cheers
Well, on those machines that I can reach, I have to retry maybe up to 20-30 times...

So it looks like fall all over again, so here's something to pass the time while waiting for the uploads to go through....

https://www.youtube.com/watch?v=VZXylmSZf4Q


Ralf sad
----------------------------------------

[Apr 9, 2023 5:10:22 AM]   Link   Report threatening or abusive post: please login first  Go to top 
nivrip
Senior Cruncher
North Yorkshire
Joined: Sep 13, 2007
Post Count: 264
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

Well, I'm the lucky one. I'm having no problems at all with either downloading or uploading on both OPN (and one OPNG) and MCM WUs.

All seems to be well. (I shouldn't have said that.)



I definitely shouldn't have said it !!!!!!!

Almost immediately, I started getting upload errors. Sod's Law in all its glory.
----------------------------------------
ЮРКШИР КРУНЧЕР
[Apr 9, 2023 8:28:05 AM]   Link   Report threatening or abusive post: please login first  Go to top 
cuphi
Cruncher
Joined: Aug 8, 2021
Post Count: 7
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

Not only am I having a lot of trouble uploading results, I have now stopped getting new WU's.
[Apr 9, 2023 11:33:34 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Jesse Viviano
Cruncher
United States of America
Joined: Dec 14, 2007
Post Count: 15
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

My one box (Linux) fails all re-tries, and I'm also unable to get new work units in both Linux and Windows. I'm glad I set a 2 day cache on the Linux box when WCG restarted. I may increase that to 3 days until confidence is restored in WCG availability.

I'm impressed the validator seems to have caught up Friday-Saturday. That's great!

Not impressed at the transient HTTP errors again even on the new flash storage array. Are we out of disk space, or is this just a stuck process?

A self-healing BOINC architecture would be really nice, if such a manual process is simple enough that it can be automated.

I am pretty sure that WCG is not out of disk space. BOINC upload servers on other projects have reported that they are out of disk space when they are unable to find any free inodes or disk space.

Einstein@home once suffered a situation where it often reported out of disk space when its storage ran out of unused inodes (file system structures used to describe files in Unix-style operating systems), and started taking too long to find used but now free inodes to take uploads with. During that time, it often reported out of disk space errors to BOINC clients. The situation was solved when Einstein@home moved to storage formatted with a later version of its file system that includes a dynamically-built B-tree of used but free inodes so that reuse of inodes that once belonged to deleted files can now be quickly recycled for new files. This is how I know that BOINC upload servers will report out of space errors to its clients if the storage cannot accept an upload.

These errors feel like the earlier WCG download errors where the download server(s) ran out of threads to allow clients to download files, and if you did manage to snag a thread to download a file, the download was super slow. The download side's problems appear to be solved by replacing the HDDs with SSDs. Now the upload side feels like it has the same problems that the download side used to have. When the download side was the bottleneck, there apparently was no way have enough work units in the field to allow result uploads to crush the upload servers. The OpenPandemics and Mapping Cancer Markers projects have very small uploads of several kilobytes that the upload server(s) can easily take. The Africa Rainfall Project has big uploads of around 105 megabytes split into up to 7 files per result by my estimation from what I can see in BOINC's upload page in the advanced view, so these big uploads are crushing the upload server(s).
----------------------------------------
[Edit 1 times, last edit by Jesse Viviano at Apr 9, 2023 1:16:41 PM]
[Apr 9, 2023 1:14:02 PM]   Link   Report threatening or abusive post: please login first  Go to top 
grumpy.
Cruncher
Joined: Sep 20, 2009
Post Count: 4
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

Plenty of transient HTTP errors my self with MCM & OPN.
I have had plenty stalls at 100% but, I have noticed some thing not noticed before... the file sizes....for upload and in upload progress when they reach 100% and stalls these are not the same size, the uploaded size is reported larger.

ex. 10.30/10.20 KB , 29.77/29.66 KB @ 100%

If I do a retry and the reported files sizes are equal 10.20/10.20KB it will upload. if not it stalls.
Wonder if this a bug.
[Apr 9, 2023 6:45:39 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

These errors feel like the earlier WCG download errors where the download server(s) ran out of threads to allow clients to download files, and if you did manage to snag a thread to download a file, the download was super slow. The download side's problems appear to be solved by replacing the HDDs with SSDs. Now the upload side feels like it has the same problems that the download side used to have. When the download side was the bottleneck, there apparently was no way have enough work units in the field to allow result uploads to crush the upload servers. The OpenPandemics and Mapping Cancer Markers projects have very small uploads of several kilobytes that the upload server(s) can easily take. The Africa Rainfall Project has big uploads of around 105 megabytes split into up to 7 files per result by my estimation from what I can see in BOINC's upload page in the advanced view, so these big uploads are crushing the upload server(s).
Well, while the current upload errors might "feel" like the ones we had mainly last year/round the turn of the year, they seem to be distinctively different in what seems to be actually happening.

Back in the past, the system wasn't able to establish a connection and just send a +100 byte header with a respective error message back. No actual data was transferred.

In this weekends sequel however, in my observation over the last few hours (on and off), is that the system actually is transferring data (from host to server(s)), but then either craps out in the middle of the transfer, or right after the transfer has reached 100%, throws an error, which is noticeable on the smaller, just several KB sized files, with those +100 bytes added to the total bytes transferred in the BOINC Manager.

While that was just a waste of available connections in the past, this new behavior is now also significantly wasting bandwidth as there are a lot of multi (2+) MB sized files that get interrupted and retried from scratch each time until successful.

It is really disheartening to see that WCG is caving in to whining of people that they can't crunch this or that (sub) project and release all this just before a long weekend, instead trying this in a more controlled fashion at a time where this can be more closely monitored and in case of issues like currently observed be more timely reacted upon... sad


Ralf
----------------------------------------

[Apr 9, 2023 6:49:52 PM]   Link   Report threatening or abusive post: please login first  Go to top 
phillipspencer
Advanced Cruncher
France
Joined: Apr 9, 2015
Post Count: 71
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: 2023-04-06 Update (WU Distribution Update)

It is really disheartening to see that WCG is caving in to whining of people that they can't crunch this or that (sub) project and release all this just before a long weekend, instead trying this in a more controlled fashion at a time where this can be more closely monitored and in case of issues like currently observed be more timely reacted upon... sad
Ralf

I have to agree that moving beyond MCM and OPN WUs, which had been working fine, just before a holiday weekend was ill-advised given the support arrangements are "working day business hours only". I hope that someone at Krembil schedules a "lessons learnt" exercise so that things can be managed smoother in future.
Oh well, at least with the transient errors the uploads get through eventually (with a lot of manual intervention).
[Apr 9, 2023 7:34:35 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 51   Pages: 6   [ Previous Page | 1 2 3 4 5 6 | Next Page ]
[ Jump to Last Post ]
Post new Thread