Not arguing root cause. And, not sure how WinOS rigs would take down a piece of backbone unless someone fucked up really bad.
We’re pushing a lot of traffic for research and analysis. Can’t go Chicago west. Routing is fucked. For those sessions, we’re starting on whatever network looks best at the moment and are auto restarting upon degradation detection.
Absolutely, we are entirely unsure of real root cause too (until microsoft releases an explanation).
We have pretty simple networking, but for a while internal vnet communication was really all over the place. That seems to have stabilised for us recently
Though I started it we’ve better venues for the nuance. People here want to know how it’ll effect their lives.
If your shift isn’t getting cut because a computer is broken then you’re fine.
No idea why someone downvoted you for this!
I’m either somewhat sociopathic or ahead of the pack in understanding and committing to the endeavor of social change in the United States. I sometimes get followed around and my posts down voted randomly.
Exactly! We made sure to contact everyone who would travel to an office to inform them its at least a half day off and to check in at lunch time. Email everyone WFH the same thing, SMS too just in case their computer is down.
But I (I assume you are in the same boat) was up much earlier to tackle this! Fun day.
I’ve got easy mode: On-call woke me up. We’re five techies and equity holders with one employee, an intern. We pushed a few buttons to effect plan B and enable plan C, and sent a medium priority notification by chat.
Then, we sent the intern to the office to watch the meters. If she fucks up it’s not a problem until about Tuesday. So, we’re waiting to see if she automates the task, uses the existing routing template to ensure she’s notified of wonkiness if she leaves the office, and asks permission. She’s recreating the wheel and can then compare to our work. I thought I’d give her a couple hours before checking in.
Article has been updated with the root cause - Crowdstrike. The reason is simple: Azure has tons of Windows systems that are protected with CrowdStrike Falcon. Crowdstrke released a bad version that is causing boot loops on Windows computers, including Windows VM servers.
At a shallow glance of very limited data not collected for this purpose, it looks like we’ve a tier 1 failure maybe Chicago westbound.
I don’t have enough topical information or expertise to have a discussion about causality or truth. This is not the right venue and I’d just be an observer to whatever conversation was taking place there.
Microsoft, Azure, and Crowdstrike have all stated the root cause at this point. Furthermore, this tells me most of the Falcon sensor installs are done bad, as we also use Crowdstrke and have ours set to “latest version - 1” to ensure this exact thing doesn’t happen.
Cool. But, routers don’t run MS and neither does my organization sitting on either side of the connection. So, right now I don’t give a flying fuck about what what some assholes did to Windows or the root cause. I want my throughput west back on primary so I can keep my hoppers full. Right now it looks like some other assholes fucked up tier 1.
There aren’t any backbone outages right now that are being discussed. Many servers that run MANY services are on Windows, using Crowdstrike. Flights, banks, entertainment (some Netflix, for example).
The overall result: it looks like a backbone outage, but isn’t.
Not arguing root cause. And, not sure how WinOS rigs would take down a piece of backbone unless someone fucked up really bad.
We’re pushing a lot of traffic for research and analysis. Can’t go Chicago west. Routing is fucked. For those sessions, we’re starting on whatever network looks best at the moment and are auto restarting upon degradation detection.
No idea why someone downvoted you for this!
Absolutely, we are entirely unsure of real root cause too (until microsoft releases an explanation).
We have pretty simple networking, but for a while internal vnet communication was really all over the place. That seems to have stabilised for us recently
(UK south / west regions)
Though I started it we’ve better venues for the nuance. People here want to know how it’ll effect their lives.
If your shift isn’t getting cut because a computer is broken then you’re fine.
I’m either somewhat sociopathic or ahead of the pack in understanding and committing to the endeavor of social change in the United States. I sometimes get followed around and my posts down voted randomly.
Exactly! We made sure to contact everyone who would travel to an office to inform them its at least a half day off and to check in at lunch time. Email everyone WFH the same thing, SMS too just in case their computer is down.
But I (I assume you are in the same boat) was up much earlier to tackle this! Fun day.
I’ve got easy mode: On-call woke me up. We’re five techies and equity holders with one employee, an intern. We pushed a few buttons to effect plan B and enable plan C, and sent a medium priority notification by chat.
Then, we sent the intern to the office to watch the meters. If she fucks up it’s not a problem until about Tuesday. So, we’re waiting to see if she automates the task, uses the existing routing template to ensure she’s notified of wonkiness if she leaves the office, and asks permission. She’s recreating the wheel and can then compare to our work. I thought I’d give her a couple hours before checking in.
Article has been updated with the root cause - Crowdstrike. The reason is simple: Azure has tons of Windows systems that are protected with CrowdStrike Falcon. Crowdstrke released a bad version that is causing boot loops on Windows computers, including Windows VM servers.
At a shallow glance of very limited data not collected for this purpose, it looks like we’ve a tier 1 failure maybe Chicago westbound.
I don’t have enough topical information or expertise to have a discussion about causality or truth. This is not the right venue and I’d just be an observer to whatever conversation was taking place there.
Microsoft, Azure, and Crowdstrike have all stated the root cause at this point. Furthermore, this tells me most of the Falcon sensor installs are done bad, as we also use Crowdstrke and have ours set to “latest version - 1” to ensure this exact thing doesn’t happen.
Cool. But, routers don’t run MS and neither does my organization sitting on either side of the connection. So, right now I don’t give a flying fuck about what what some assholes did to Windows or the root cause. I want my throughput west back on primary so I can keep my hoppers full. Right now it looks like some other assholes fucked up tier 1.
There aren’t any backbone outages right now that are being discussed. Many servers that run MANY services are on Windows, using Crowdstrike. Flights, banks, entertainment (some Netflix, for example).
The overall result: it looks like a backbone outage, but isn’t.
Thank you.
But, fuck. That means we screwed up primary design or someone broke the contract.
Gotta work today.