Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go ยป
No member browsing this thread
Thread Status: Active
Total posts in this thread: 16
Posts: 16   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 126462 times and has 15 replies Next Thread
TigerLily
Senior Cruncher
Joined: May 26, 2023
Post Count: 280
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Update: July 21st system outage

As many of you are aware, we experienced an unexpected outage on Friday, July 21st due to issues with the data centre where WCG servers reside. The issues arose from the failure of the primary and failover DHCP agents, which would have been able to renew the lease our virtual machines hold over their IP addresses on our virtual networks in our cloud environment. When the agents failed, after some time those leases expired and our virtual machines were no longer able to interact with each other or the internet. When the DHCP agents were restarted and the virtual machines restarted afterwards, there were further issues with the network and continue to be issues which we are trying to mitigate until they can be resolved, such as some interfaces that still do not work as they did before.

Thanks for your support, patience, and understanding. If you have any questions or comments, please leave them in this thread for us to answer.
[Aug 2, 2023 5:43:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
phillipspencer
Advanced Cruncher
France
Joined: Apr 9, 2015
Post Count: 71
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

Thank you for the explanation. It is good to understand what went wrong though I continue to worry about how fragile the Krembil systems are.
[Aug 2, 2023 7:55:50 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 796
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

Thanks for the info, TigerLily and team. Big oof @ total DHCP failure taking down an entire environment. That's... interesting.

Edit: I mean terrifying lol.
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 2 times, last edit by hchc at Aug 2, 2023 8:05:52 PM]
[Aug 2, 2023 8:05:03 PM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1949
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

You are running servers on DHCP addresses? sad confused crying

Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed...


Ralf
----------------------------------------

[Aug 2, 2023 8:55:12 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Unixchick
Veteran Cruncher
Joined: Apr 16, 2020
Post Count: 946
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

Thanks for the update.
[Aug 2, 2023 10:01:26 PM]   Link   Report threatening or abusive post: please login first  Go to top 
hchc
Veteran Cruncher
USA
Joined: Aug 15, 2006
Post Count: 796
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

You are running servers on DHCP addresses? sad confused crying

Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed...


Ralf

You're absolutely right, Ralf. Forgot to notice that. My access point, printer, all servers/services are assigned statically even at home. Only client devices get DHCP leases.

To mitigate this from happening, WCG network and server admins can simply make sure all servers/services, routers, switches, load balancers, etc are using static network configs. (Of course, if further changes are made to the infrastructure, that'll break things, naturally.)
----------------------------------------
  • i5-7500 (Kaby Lake, 4C/4T) @ 3.4 GHz
  • i5-4590 (Haswell, 4C/4T) @ 3.3 GHz
  • i5-3570 (Broadwell, 4C/4T) @ 3.4 GHz

----------------------------------------
[Edit 1 times, last edit by hchc at Aug 2, 2023 11:36:09 PM]
[Aug 2, 2023 11:35:18 PM]   Link   Report threatening or abusive post: please login first  Go to top 
markfw
Cruncher
Joined: Oct 13, 2016
Post Count: 22
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

When are the stats exports going to be working again ?
[Aug 3, 2023 12:32:24 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 946
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

You are running servers on DHCP addresses? sad confused crying

Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed...


Ralf

You're absolutely right, Ralf. Forgot to notice that. My access point, printer, all servers/services are assigned statically even at home. Only client devices get DHCP leases.

To mitigate this from happening, WCG network and server admins can simply make sure all servers/services, routers, switches, load balancers, etc are using static network configs. (Of course, if further changes are made to the infrastructure, that'll break things, naturally.)
Isn't the use of DHCP a consequence of the use of VMs to run services? It's not quite the same as a home or small office network :-) -- something like a network file server might be eligible for a fixed address, but VMs are more akin to client services (even if they are being used to run server-type software). I think WCG are running everything but their network file server on VMs...

Cheers - Al.
[Aug 3, 2023 12:53:27 AM]   Link   Report threatening or abusive post: please login first  Go to top 
thunder7
Senior Cruncher
Netherlands
Joined: Mar 6, 2013
Post Count: 232
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

So does this mean the outage / maintenance that was planned for the 25th is past, or were you unable to do that because systems were down and will that have to happen at some point in the future?
[Aug 3, 2023 4:08:50 AM]   Link   Report threatening or abusive post: please login first  Go to top 
TPCBF
Master Cruncher
USA
Joined: Jan 2, 2011
Post Count: 1949
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Update: July 21st system outage

You are running servers on DHCP addresses? sad confused crying

Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed...


Ralf

You're absolutely right, Ralf. Forgot to notice that. My access point, printer, all servers/services are assigned statically even at home. Only client devices get DHCP leases.

To mitigate this from happening, WCG network and server admins can simply make sure all servers/services, routers, switches, load balancers, etc are using static network configs. (Of course, if further changes are made to the infrastructure, that'll break things, naturally.)
Isn't the use of DHCP a consequence of the use of VMs to run services? It's not quite the same as a home or small office network :-) -- something like a network file server might be eligible for a fixed address, but VMs are more akin to client services (even if they are being used to run server-type software). I think WCG are running everything but their network file server on VMs...

Cheers - Al.
Using DHCP is the lazy way out, but there is nothing that prevents you from setting up a VM with a static IP instead, VM or not. It just requires a tad more work when setting up, of course a bit more planning, but in the end, it would rather prevent any such occurrence as was given as the excuse here...

Ralf
----------------------------------------

[Aug 3, 2023 4:32:11 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 16   Pages: 2   [ 1 2 | Next Page ]
[ Jump to Last Post ]
Post new Thread