World Community Grid - View Thread - Unplanned website outage 2017-06-25.

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: Unplanned website outage 2017-06-25.

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 265

[ ]

Author

This topic has been viewed 575304 times and has 264 replies

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Unplanned website outage 2017-06-25.

I would imagine that once the servers in the cluster are activated, there will be a lot of filesystem checking and resyncing going on prior to mounting. Depending on the size of the filesystems, could take hours.

Actually, this should be happening "on the fly", no need to wait for a "checking and resync", that's the whole idea of having a cloud based storage cluster, running a distributed file system like ZFS.

To have a whole cluster failing, that is far from common, be it on-premise or "in the cloud". There must be a bigger underlying issue at hand here, and that is most likely the reason why things are taking that long.

Ralf

But it was the storage cluster that failed. There is no "on the fly" if there aren't any servers in the storage cluster running. If they went down hard and the filesystem wasn't unmounted cleanly, it will have to be checked and then ALL the nodes (servers in the cluster) will need to see the same consistent view of the filesystem and that is just the storage cluster backend. What the cluster presents to the other servers could be a whole different thing.

[Jun 25, 2017 10:21:36 PM]

keithhenry
Ace Cruncher
Senile old farts of the world ....uh.....uh..... nevermind
Joined: Nov 18, 2004
Post Count: 18665
Status: Offline
Project Badges:

2 year badge for Human Proteome Folding - Phase 2

180 day badge for Discovering Dengue Drugs - Together

2 year badge for Nutritious Rice for the World

90 day badge for The Clean Energy Project

2 year badge for Help Fight Childhood Cancer

90 day badge for Influenza Antiviral Drug Search

2 year badge for Help Cure Muscular Dystrophy - Phase 2

2 year badge for Discovering Dengue Drugs - Together - Phase 2

2 year badge for The Clean Energy Project - Phase 2

2 year badge for Computing for Clean Water

2 year badge for Drug Search for Leishmaniasis

2 year badge for GO Fight Against Malaria

180 day badge for Computing for Sustainable Water

100 year badge for Mapping Cancer Markers

10 year badge for Uncovering Genome Mysteries

10 year badge for Outsmart Ebola Together

10 year badge for FightAIDS@Home - Phase 2

20 year badge for Smash Childhood Cancer

20 year badge for Microbiome Immunity Project

20 year badge for Africa Rainfall Project

20 year badge for OpenPandemics - COVID-19


Re: Unplanned website outage 2017-06-25.

In the past when a server ran out of disk space, it took a LONG time to get it running. Not that it wasn't back, it had to check the filesystem. Yesterday was not a clean shutdown. While different parts of the file cluster may not have to sync before they're available, I can't imagine that the OS would not require a check (like a CHKDSK on a wintel box) to run first. I would imagine they have what servers that can be started first back up and have to wait for this to complete before starting the rest. With the sizes of the filesystems WCG has, I don't expect to see any uploads/downloads and such until midday Monday.

----------------------------------------

Join/Website/IMODB

[Jun 25, 2017 10:27:22 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Unplanned website outage 2017-06-25.

My 1st notification of this problem:

6/24/2017 8:26:19 PM | World Community Grid | Temporarily failed upload of OET1_0004784_x1WLJ_rig_16319_0_r353631776_0: connect() failed

-- it is now 6/25/2017 7:26 PM (23 Hours down time and counting)

Can we have an official update on the recovery effort please?

[Jun 25, 2017 11:28:33 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: Unplanned website outage 2017-06-25.

WowBoo!!!

[Jun 25, 2017 11:29:25 PM]

rotalumis
Cruncher
Joined: Apr 26, 2012
Post Count: 1
Status: Offline
Project Badges:

1 year badge for Human Proteome Folding - Phase 2

14 day badge for Help Fight Childhood Cancer

14 day badge for Help Cure Muscular Dystrophy - Phase 2

14 day badge for Computing for Clean Water

180 day badge for Drug Search for Leishmaniasis

14 day badge for Computing for Sustainable Water

10 year badge for Mapping Cancer Markers

2 year badge for Uncovering Genome Mysteries

2 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

180 day badge for Smash Childhood Cancer

10 year badge for Microbiome Immunity Project

2 year badge for OpenPandemics - COVID-19


Re: Unplanned website outage 2017-06-25.

A Cluster Something

[Jun 25, 2017 11:30:20 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2982
Status: Offline
Project Badges:

90 day badge for Help Cure Muscular Dystrophy

90 day badge for Discovering Dengue Drugs - Together

90 day badge for Nutritious Rice for the World

1 year badge for Help Cure Muscular Dystrophy - Phase 2

1 year badge for Discovering Dengue Drugs - Together - Phase 2

1 year badge for The Clean Energy Project - Phase 2

1 year badge for Computing for Clean Water

180 day badge for GO Fight Against Malaria

20 year badge for Mapping Cancer Markers

2 year badge for FightAIDS@Home - Phase 2

2 year badge for Microbiome Immunity Project

2 year badge for Africa Rainfall Project

5 year badge for OpenPandemics - COVID-19


Re: Unplanned website outage 2017-06-25.

Can we have an official update on the recovery effort please?

The last official update was by Keith Uplinger, 8 hrs ago at Jun 25, 2017 3:30:49 PM (UTC) - see here, so yes, a brief update would be appreciated - but, not if it's going to distract the WCG techs from attempting to resolve the situation (possibly, a WCG Admin person could provide an update?).

----------------------------------------

[Jun 25, 2017 11:36:05 PM]

RTorpey
Advanced Cruncher
Joined: Aug 24, 2005
Post Count: 67
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

2 year badge for Help Cure Muscular Dystrophy

45 day badge for Discovering Dengue Drugs - Together

1 year badge for Nutritious Rice for the World

14 day badge for The Clean Energy Project

45 day badge for Influenza Antiviral Drug Search

5 year badge for The Clean Energy Project - Phase 2

1 year badge for GO Fight Against Malaria

50 year badge for Mapping Cancer Markers

100 year badge for Outsmart Ebola Together

20 year badge for FightAIDS@Home - Phase 2

100 year badge for Microbiome Immunity Project

100 year badge for OpenPandemics - COVID-19


Re: Unplanned website outage 2017-06-25.

"Can we have an official update on the recovery effort please?"

As someone who has spent many years supporting complex, critical systems, I can nearly guarantee that the current status is " A number of very talented people are busting their butts trying to get the system back up as quickly as possible".
No systems admin enjoys their systems being down, especially when there are hundreds of thousands of people around the world watching and asking "Is it done yet? When will it be done? Is it done yet????"

[Jun 25, 2017 11:43:01 PM]

caitilarkin
Former World Community Grid Admin
USA
Joined: Nov 4, 2015
Post Count: 331
Status: Offline
Project Badges:

14 day badge for Uncovering Genome Mysteries

180 day badge for Outsmart Ebola Together

90 day badge for FightAIDS@Home - Phase 2

180 day badge for Africa Rainfall Project


Re: Unplanned website outage 2017-06-25.

Greetings everyone. We are currently rebooting the storage cluster at the moment. It is the main issue on why everything is down. It does have redundancy in case some of the servers go down. However all the servers for that cluster went down. We are still in the process of bringing those up.

Also, with the regards to workunits and too late. I will extend the time limit on the results out there by 24 hours of when everything is turned back on. I will also do the staggered start for FAHB trickle messages for members as well, similar to what we did for the migration to help minimize the chance of invalids to members.

Again, we are sorry about the outage and hope to resolve it as quickly as possible. Thank you for your patience and support towards World Community Grid.

Thanks,
-Uplinger

Here's the most recent update from the tech team (the post has gotten pushed back by subsequent replies). They are heads-down working on this. We'll let everyone know as soon as we have more information. Thanks again for your patience.

[Jun 25, 2017 11:43:20 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: Unplanned website outage 2017-06-25.

We have a number of people working on this issue and trying to understand what changed to cause this outage. For those who are technical, when we start up the storage cluster we get a large number of "kernel:NMI watchdog: BUG: soft lockup" issues being reported and the load on the systems reaches very high level and the systems become very sluggish. We have reached out beyond our standard support teams and are bring more people on the issue.

[Jun 25, 2017 11:49:59 PM]

gb009761
Master Cruncher
Scotland
Joined: Apr 6, 2005
Post Count: 2982
Status: Offline
Project Badges:


Re: Unplanned website outage 2017-06-25.

Thanks caitilarkin for confirming what we suspected, that you and the techs are hard at work resolving this situation (it's always reassuring, knowing that, despite not having an update every 8 hrs or so, that the techs are still there, hard at work).

Edit: Also, thanks Kevin for popping by and giving that extra update - very much appreciated.

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by gb009761 at Jun 25, 2017 11:53:41 PM]

[Jun 25, 2017 11:52:16 PM]

[ ]