Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 265
Posts: 265   Pages: 27   [ Previous Page | 3 4 5 6 7 8 9 10 11 12 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 575304 times and has 264 replies Next Thread
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

I would imagine that once the servers in the cluster are activated, there will be a lot of filesystem checking and resyncing going on prior to mounting. Depending on the size of the filesystems, could take hours.
Actually, this should be happening "on the fly", no need to wait for a "checking and resync", that's the whole idea of having a cloud based storage cluster, running a distributed file system like ZFS.

To have a whole cluster failing, that is far from common, be it on-premise or "in the cloud". There must be a bigger underlying issue at hand here, and that is most likely the reason why things are taking that long.

Ralf

But it was the storage cluster that failed. There is no "on the fly" if there aren't any servers in the storage cluster running. If they went down hard and the filesystem wasn't unmounted cleanly, it will have to be checked and then ALL the nodes (servers in the cluster) will need to see the same consistent view of the filesystem and that is just the storage cluster backend. What the cluster presents to the other servers could be a whole different thing.
[Jun 25, 2017 10:21:36 PM]   Link   Report threatening or abusive post: please login first  Go to top 
keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18665
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

In the past when a server ran out of disk space, it took a LONG time to get it running. Not that it wasn't back, it had to check the filesystem. Yesterday was not a clean shutdown. While different parts of the file cluster may not have to sync before they're available, I can't imagine that the OS would not require a check (like a CHKDSK on a wintel box) to run first. I would imagine they have what servers that can be started first back up and have to wait for this to complete before starting the rest. With the sizes of the filesystems WCG has, I don't expect to see any uploads/downloads and such until midday Monday.
----------------------------------------
Join/Website/IMODB



[Jun 25, 2017 10:27:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

My 1st notification of this problem:

6/24/2017 8:26:19 PM | World Community Grid | Temporarily failed upload of OET1_0004784_x1WLJ_rig_16319_0_r353631776_0: connect() failed

-- it is now 6/25/2017 7:26 PM (23 Hours down time and counting)

Can we have an official update on the recovery effort please?
[Jun 25, 2017 11:28:33 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

WowBoo!!!
[Jun 25, 2017 11:29:25 PM]   Link   Report threatening or abusive post: please login first  Go to top 
rotalumis
Cruncher
Joined: Apr 26, 2012
Post Count: 1
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

A Cluster Something
[Jun 25, 2017 11:30:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2982
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

Can we have an official update on the recovery effort please?
The last official update was by Keith Uplinger, 8 hrs ago at Jun 25, 2017 3:30:49 PM (UTC) - see here, so yes, a brief update would be appreciated - but, not if it's going to distract the WCG techs from attempting to resolve the situation (possibly, a WCG Admin person could provide an update?).
----------------------------------------

[Jun 25, 2017 11:36:05 PM]   Link   Report threatening or abusive post: please login first  Go to top 
RTorpey
Advanced Cruncher
Joined: Aug 24, 2005
Post Count: 67
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

"Can we have an official update on the recovery effort please?"

As someone who has spent many years supporting complex, critical systems, I can nearly guarantee that the current status is " A number of very talented people are busting their butts trying to get the system back up as quickly as possible".
No systems admin enjoys their systems being down, especially when there are hundreds of thousands of people around the world watching and asking "Is it done yet? When will it be done? Is it done yet????"
[Jun 25, 2017 11:43:01 PM]   Link   Report threatening or abusive post: please login first  Go to top 
caitilarkin
Former World Community Grid Admin
USA
Joined: Nov 4, 2015
Post Count: 331
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

Greetings everyone. We are currently rebooting the storage cluster at the moment. It is the main issue on why everything is down. It does have redundancy in case some of the servers go down. However all the servers for that cluster went down. We are still in the process of bringing those up.

Also, with the regards to workunits and too late. I will extend the time limit on the results out there by 24 hours of when everything is turned back on. I will also do the staggered start for FAHB trickle messages for members as well, similar to what we did for the migration to help minimize the chance of invalids to members.

Again, we are sorry about the outage and hope to resolve it as quickly as possible. Thank you for your patience and support towards World Community Grid.

Thanks,
-Uplinger


Here's the most recent update from the tech team (the post has gotten pushed back by subsequent replies). They are heads-down working on this. We'll let everyone know as soon as we have more information. Thanks again for your patience.
[Jun 25, 2017 11:43:20 PM]   Link   Report threatening or abusive post: please login first  Go to top 
knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

We have a number of people working on this issue and trying to understand what changed to cause this outage. For those who are technical, when we start up the storage cluster we get a large number of "kernel:NMI watchdog: BUG: soft lockup" issues being reported and the load on the systems reaches very high level and the systems become very sluggish. We have reached out beyond our standard support teams and are bring more people on the issue.
[Jun 25, 2017 11:49:59 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2982
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Unplanned website outage 2017-06-25.

Thanks caitilarkin for confirming what we suspected, that you and the techs are hard at work resolving this situation (it's always reassuring, knowing that, despite not having an update every 8 hrs or so, that the techs are still there, hard at work).

Edit: Also, thanks Kevin for popping by and giving that extra update - very much appreciated.
----------------------------------------

----------------------------------------
[Edit 1 times, last edit by gb009761 at Jun 25, 2017 11:53:41 PM]
[Jun 25, 2017 11:52:16 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 265   Pages: 27   [ Previous Page | 3 4 5 6 7 8 9 10 11 12 | Next Page ]
[ Jump to Last Post ]
Post new Thread