Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

MicroWave@lemmy.world · 1 year ago

Delta CEO says CrowdStrike-Microsoft outage cost the airline $500 million

rekorse@lemmy.world · 1 year ago

First thank you for taking the time to type all of that out.

I think I follow your theory well enough but (I know this is 2 weeks later so I won’t look up any new information) I was under the impression delta was an outlier in their response compared to other airlines.

And one point about redundancies. Why shouldnt they consider a single operating system as a single failure point? If all 6 servers in the multiple locations all run windows, and windows fails thats awful right? Can they not dual boot orhavee a second set of servers? I do this in my own home but maybe thats not something that scales well.

I’m interested if your opinion has changed now that there has been a bit of time to have some more data come out on it.

ricecake@sh.itjust.works · 1 year ago

You are correct that Delta was an outlier, but it wasn’t with regards to the scale of the outage, it was that their scheduling software was down far longer and they handled a lot of the customer side of things significantly less well.

Generally, your protection against operating system issues is the aforementioned restriction on changes and how they go out.
If something is stable, you can expect it to remain stable unless something changes or random chance breaks something.
The operational cost of running multiple operating systems in production like you describe would be high. Typically software is only written to work on one platform, and while it can be modified to work on others, it’s usually a cost with no benefit outside of a consumer environment.
Different operating systems have different performance characteristics you need to factor in for load scaling, different security models, and different maintenance requirements.
Often, but not always, server administrators will focus on one OS, so adding more to the mix can mean people are rusty with whichever is your backup, which can be worse than just focusing on fixing the issue with the primary.
OS bugs are rare, and they usually manifest early or randomly. It’s why production deployments tend to use the OS as long as it’s supported: change means learning the new issues and you’ve probably already encountered all the bullshit with what you’re currently using. That’s why the Linux distros tend to have long term support versions, and windows server edition tends to just get support for a long time with terrible documentation.

I’m a Linux guy, so defending windows feels weird, and I want to include that I don’t think anyone should use it, particularly for a server, but the professional in me acknowledges that it’s a perfectly functional hammer.

As we’ve learned more, I’ve become more disparaging of deltas choice to not keep the scheduling system modernized in a way that could recover faster, and not investing enough in making systems homogeneous across different airports. I still think that these issues are largely independent of their actual disaster recovery or resiliency plans.
Inevitably, the lawsuits will determine that the blame for the damage is split between the two of them. My bet is 70/30 crowdstrike/delta, since they can easily demonstrate that the issue was fundamentally caused by crowdstrike and negatively impacted other airlines and businesses in general. Some was clearly deltas fault for just failing to keep a system modernized to handle a massive shift like this, and would have been similarly disrupted by any outage with flight cancellations.

rekorse@lemmy.world · 1 year ago

Would you say that an OS forced update type error like this is so rare that Delta didnt need to plan for it? If I understand you right, its not actually a problem that Delta used Windows for their servers, at least not to the point it would affect liability.

If Delta was the only airline who set up their infrastructure in this way, to the point it was markedly different than other companies, could they argue they essentially didnt protect at all?

I’m still having a lot of trouble figuring out how CrowdStrike would even assess a risk like this if the possible payment is based on how well a company recovers and how much income they lost.

I actually agree with your 70/30 split but unless Delta paid more than the other airlines to justify the pay out in damages, its still confusing to me how the amount CrowdStrike has to pay to some degree does depend on Deltas setup and restoration.

I think theres just not any better of a way to handle this and I’m searching for an answer that doesnt exist.