Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 265
|
![]() |
Author |
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
I would imagine that once the servers in the cluster are activated, there will be a lot of filesystem checking and resyncing going on prior to mounting. Depending on the size of the filesystems, could take hours. Actually, this should be happening "on the fly", no need to wait for a "checking and resync", that's the whole idea of having a cloud based storage cluster, running a distributed file system like ZFS.To have a whole cluster failing, that is far from common, be it on-premise or "in the cloud". There must be a bigger underlying issue at hand here, and that is most likely the reason why things are taking that long. Ralf But it was the storage cluster that failed. There is no "on the fly" if there aren't any servers in the storage cluster running. If they went down hard and the filesystem wasn't unmounted cleanly, it will have to be checked and then ALL the nodes (servers in the cluster) will need to see the same consistent view of the filesystem and that is just the storage cluster backend. What the cluster presents to the other servers could be a whole different thing. |
||
|
keithhenry
Ace Cruncher Senile old farts of the world ....uh.....uh..... nevermind Joined: Nov 18, 2004 Post Count: 18665 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
In the past when a server ran out of disk space, it took a LONG time to get it running. Not that it wasn't back, it had to check the filesystem. Yesterday was not a clean shutdown. While different parts of the file cluster may not have to sync before they're available, I can't imagine that the OS would not require a check (like a CHKDSK on a wintel box) to run first. I would imagine they have what servers that can be started first back up and have to wait for this to complete before starting the rest. With the sizes of the filesystems WCG has, I don't expect to see any uploads/downloads and such until midday Monday.
---------------------------------------- |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
My 1st notification of this problem:
6/24/2017 8:26:19 PM | World Community Grid | Temporarily failed upload of OET1_0004784_x1WLJ_rig_16319_0_r353631776_0: connect() failed -- it is now 6/25/2017 7:26 PM (23 Hours down time and counting) Can we have an official update on the recovery effort please? |
||
|
Former Member
Cruncher Joined: May 22, 2018 Post Count: 0 Status: Offline |
WowBoo!!!
|
||
|
rotalumis
Cruncher Joined: Apr 26, 2012 Post Count: 1 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A Cluster Something
|
||
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 2982 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Can we have an official update on the recovery effort please? The last official update was by Keith Uplinger, 8 hrs ago at Jun 25, 2017 3:30:49 PM (UTC) - see here, so yes, a brief update would be appreciated - but, not if it's going to distract the WCG techs from attempting to resolve the situation (possibly, a WCG Admin person could provide an update?).![]() |
||
|
RTorpey
Advanced Cruncher Joined: Aug 24, 2005 Post Count: 67 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
"Can we have an official update on the recovery effort please?"
As someone who has spent many years supporting complex, critical systems, I can nearly guarantee that the current status is " A number of very talented people are busting their butts trying to get the system back up as quickly as possible". No systems admin enjoys their systems being down, especially when there are hundreds of thousands of people around the world watching and asking "Is it done yet? When will it be done? Is it done yet????" |
||
|
caitilarkin
Former World Community Grid Admin USA Joined: Nov 4, 2015 Post Count: 331 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Greetings everyone. We are currently rebooting the storage cluster at the moment. It is the main issue on why everything is down. It does have redundancy in case some of the servers go down. However all the servers for that cluster went down. We are still in the process of bringing those up. Also, with the regards to workunits and too late. I will extend the time limit on the results out there by 24 hours of when everything is turned back on. I will also do the staggered start for FAHB trickle messages for members as well, similar to what we did for the migration to help minimize the chance of invalids to members. Again, we are sorry about the outage and hope to resolve it as quickly as possible. Thank you for your patience and support towards World Community Grid. Thanks, -Uplinger Here's the most recent update from the tech team (the post has gotten pushed back by subsequent replies). They are heads-down working on this. We'll let everyone know as soon as we have more information. Thanks again for your patience. |
||
|
knreed
Former World Community Grid Tech Joined: Nov 8, 2004 Post Count: 4504 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
We have a number of people working on this issue and trying to understand what changed to cause this outage. For those who are technical, when we start up the storage cluster we get a large number of "kernel:NMI watchdog: BUG: soft lockup" issues being reported and the load on the systems reaches very high level and the systems become very sluggish. We have reached out beyond our standard support teams and are bring more people on the issue.
|
||
|
gb009761
Master Cruncher Scotland Joined: Apr 6, 2005 Post Count: 2982 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks caitilarkin for confirming what we suspected, that you and the techs are hard at work resolving this situation (it's always reassuring, knowing that, despite not having an update every 8 hrs or so, that the techs are still there, hard at work).
----------------------------------------Edit: Also, thanks Kevin for popping by and giving that extra update - very much appreciated. ![]() [Edit 1 times, last edit by gb009761 at Jun 25, 2017 11:53:41 PM] |
||
|
|
![]() |