Index | Recent Threads | Unanswered Threads | Who's Active | Guidelines | Search |
![]() |
World Community Grid Forums
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
No member browsing this thread |
Thread Status: Active Total posts in this thread: 16
|
![]() |
Author |
|
TigerLily
Senior Cruncher Joined: May 26, 2023 Post Count: 280 Status: Offline Project Badges: ![]() |
As many of you are aware, we experienced an unexpected outage on Friday, July 21st due to issues with the data centre where WCG servers reside. The issues arose from the failure of the primary and failover DHCP agents, which would have been able to renew the lease our virtual machines hold over their IP addresses on our virtual networks in our cloud environment. When the agents failed, after some time those leases expired and our virtual machines were no longer able to interact with each other or the internet. When the DHCP agents were restarted and the virtual machines restarted afterwards, there were further issues with the network and continue to be issues which we are trying to mitigate until they can be resolved, such as some interfaces that still do not work as they did before.
Thanks for your support, patience, and understanding. If you have any questions or comments, please leave them in this thread for us to answer. |
||
|
phillipspencer
Advanced Cruncher France Joined: Apr 9, 2015 Post Count: 71 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thank you for the explanation. It is good to understand what went wrong though I continue to worry about how fragile the Krembil systems are.
|
||
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 796 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Thanks for the info, TigerLily and team. Big oof @ total DHCP failure taking down an entire environment. That's... interesting.
----------------------------------------Edit: I mean terrifying lol.
[Edit 2 times, last edit by hchc at Aug 2, 2023 8:05:52 PM] |
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1949 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You are running servers on DHCP addresses?
----------------------------------------![]() ![]() ![]() Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed... Ralf ![]() |
||
|
Unixchick
Veteran Cruncher Joined: Apr 16, 2020 Post Count: 946 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() |
Thanks for the update.
|
||
|
hchc
Veteran Cruncher USA Joined: Aug 15, 2006 Post Count: 796 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You are running servers on DHCP addresses? ![]() ![]() ![]() Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed... Ralf You're absolutely right, Ralf. Forgot to notice that. My access point, printer, all servers/services are assigned statically even at home. Only client devices get DHCP leases. To mitigate this from happening, WCG network and server admins can simply make sure all servers/services, routers, switches, load balancers, etc are using static network configs. (Of course, if further changes are made to the infrastructure, that'll break things, naturally.)
[Edit 1 times, last edit by hchc at Aug 2, 2023 11:36:09 PM] |
||
|
markfw
Cruncher Joined: Oct 13, 2016 Post Count: 22 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
When are the stats exports going to be working again ?
|
||
|
alanb1951
Veteran Cruncher Joined: Jan 20, 2006 Post Count: 946 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You are running servers on DHCP addresses? ![]() ![]() ![]() Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed... Ralf You're absolutely right, Ralf. Forgot to notice that. My access point, printer, all servers/services are assigned statically even at home. Only client devices get DHCP leases. To mitigate this from happening, WCG network and server admins can simply make sure all servers/services, routers, switches, load balancers, etc are using static network configs. (Of course, if further changes are made to the infrastructure, that'll break things, naturally.) Cheers - Al. |
||
|
thunder7
Senior Cruncher Netherlands Joined: Mar 6, 2013 Post Count: 232 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So does this mean the outage / maintenance that was planned for the 25th is past, or were you unable to do that because systems were down and will that have to happen at some point in the future?
|
||
|
TPCBF
Master Cruncher USA Joined: Jan 2, 2011 Post Count: 1949 Status: Offline Project Badges: ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You are running servers on DHCP addresses? ![]() ![]() ![]() Well, why am I surprised about this, common good practices that any machine that needs to be reachable, from anywhere, gets a static IP address just don't seem to be known or followed... Ralf You're absolutely right, Ralf. Forgot to notice that. My access point, printer, all servers/services are assigned statically even at home. Only client devices get DHCP leases. To mitigate this from happening, WCG network and server admins can simply make sure all servers/services, routers, switches, load balancers, etc are using static network configs. (Of course, if further changes are made to the infrastructure, that'll break things, naturally.) Cheers - Al. Ralf ![]() |
||
|
|
![]() |