What's the worst way you ever broke production?

RacerX@lemm.ee · 2 years ago

What's the worst way you ever broke production?

spaghetti_carbanana@krabb.org · edit-2 2 years ago

Worked for an MSP, we had a large storage array which was our cloud backup repository for all of our clients. It locked up and was doing this semi-regularly, so we decided to run an “OS reinstall”. Basically these things install the OS across all of the disks, on a separate partition to where the data lives. “OS Reinstall” clones the OS from the flash drive plugged into the mainboard back to all the disks and retains all configuration and data. “Factory default”, however, does not.

This array was particularly… special… In that you booted it up, held a paperclip into the reset pin, and the LEDs would flash a pattern to let you know you’re in the boot menu. You click the pin to move through the boot menu options, each time you click it the lights flash a different pattern to tell you which option is selected. First option was normal boot, second or third was OS reinstall, the very next option was factory default.

I head into the data centre. I had the manual, I watched those lights like a hawk and verified the “OS reinstall” LED flash pattern matched up, then I held the pin in for a few seconds to select the option.

All the disks lit up, away we go. 10 minutes pass. Nothing. Not responding on its interface. 15 minutes. 20 minutes, I start sweating. I plug directly into the NIC and head to the default IP filled with dread. It loads. I enter the default password, it works.

There staring back at me: “0B of 45TB used”.

Fuck.

This was in the days where 50M fibre was rare and most clients had 1-20M ADSL. Yes, asymmetric. We had to send guys out as far as 3 hour trips with portable hard disks to re-seed the backups over a painful 30ish days of re-ingesting them into the NAS.

The worst part? Years later I discovered that, completely undocumented, you can plug a VGA cable in and you get a text menu on the screen that shows you which option you have selected.

I (somehow) did not get fired.

Appoxo@lemmy.dbzer0.com · 2 years ago

You still remember so. That means you learned and probably won’t do it again.

GolfNovemberUniform@lemmy.ml · 2 years ago

Installed a flatpak app (can’t remember which one but it wasn’t obscure or shady) and smh it broke the file system on one of my main machines :) (at least I think that’s what happened because the machine started lagging, any app refused to launch and after a reboot I got an fsck error or something like that)

SorteKanin@feddit.dk · edit-2 2 years ago

Pretty run of the mill for me, so not that bad: Pushed a long-running migration during peak load hours that locked an important table for an extended period of time, effectively taking our site offline.

Also consider !ask_experienced_devs@programming.dev :)

Quazatron@lemmy.world · 2 years ago

Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.

Billegh@lemmy.world · 2 years ago

It doesn’t help that the webui used to hide stop. I think it still does.

Flax@feddit.uk · 2 years ago

Explain more?

Quazatron@lemmy.world · 2 years ago

Noob was told to change some parameters on an AWS EC2 instance, requiring a stop/start. Selected terminate instead, killing the instance.

Crappy company, running production infrastructure in AWS without giving proper training and securing a suitable backup process.

BestBouclettes@jlai.lu · 2 years ago

Apparently Terminate means stop and destroy. Definitely something to use with care.

tslnox@reddthat.com · 2 years ago

Maybe there should be some warning message… Maybe a question requiring you to manually type “yes I want it” or something.

synae[he/him]@lemmy.sdf.org · 2 years ago

Maybe an entire feature that disables it so you can’t do it accidentally, call it “termination protection” or something

ilinamorato@lemmy.world · 2 years ago

“Stop” is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you’re done testing or developing something for the day but you’ll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it’s just like power cycling a computer.

“Terminate” is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It’s the “nuke it from orbit” option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you’re very, very lucky, you restore from backups (or an AMI).

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · edit-2 2 years ago

Accidentally deleted an entire column in a police department’s evidence database 😬

Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets.

SuperDuper@lemmy.world · 2 years ago

deleted an entire column in a police department’s evidence database

Based and ACAB-pilled

aksdb@lemmy.world · 2 years ago

And if you couldn’t reconstruct, you still had backups, right? … right?!

𝕱𝖎𝖗𝖊𝖜𝖎𝖙𝖈𝖍@lemmy.world · 2 years ago

Oh sweet summer child

FartsWithAnAccent@lemmy.world · 2 years ago

What the fuck is a “backups”?

z00s@lemmy.world · 2 years ago

He’s the guy that sits next to fuckups

BestBouclettes@jlai.lu · 2 years ago

I was still a wee IT technician, I was supposed to remove some cables from a patch panel. I pulled at least two cables that were used as ISCSI from the hypervisors to the storage bays. During production hours. Not my proudest memory.

shyguyblue@lemmy.world · 2 years ago

Updated WordPress…

Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.

Fuck WordPress for static sites…

-RJ-@lemmy.world · 2 years ago

Plugged a server in after it had been repaired but the person whose responsibility it was insisted it would be fine - they didn’t release the FSMO roles from it, the time was an hour out, it changed the time EVERYWHERE and broke ALL THE THINGS. Not technically my fault, but i should have pushed harder for them to have demoted it before I turned it back on.

Futs@lemmy.world · 2 years ago

Advertised an OS deployment to the ‘All Wokstations’ collection by mistake. I only realized after 30 minutes when peoples workstations started rebooting. Worked right through the night recovering and restoring about 200 machines.

CyanFen@lemmy.one · 2 years ago

Flushed the entire AD not realizing I somehow got back into prod

TheMadIrishman@sh.itjust.works · 2 years ago

Was troubleshooting a failed drive in a raid array on a small business DC/File Serv/Print/Everything else box. Replaced drive still showed failed. Moved to another bay thinking it was the slot not the drive. Accidentally hit yes when asked to initialize the array. Blew the whole thing away. It was an OLD server the customer was working on replacing, so I told them it finally gave up the ghost and I was taking it back to the office to keep working on it. I had been on the job for about 4 months and thought for SURE I was fired. Turns out we were already working on moving them to the cloud, so it ended up not being a big deal.

Monkey With A Shell@lemmy.socdojo.com · 2 years ago

Found out the hard way to triple check your work when adding a new line to the proxy policy. Or, more accurately 2 lines when you only planned one, and that second one defaulted to a ‘deny all’ and resulted in dropping all web traffic out for the company…

That made for a REAL tense meeting the next day after it got deployed and people started asking WTF happened…

Tarkcanis@lemmy.world · 2 years ago

Forgot to turn the commercial power back on after testing the battery backups… oopsie.

FaceDeer@kbin.social · 2 years ago

It wasn’t “worst” in terms of how much time it wasted, but the worst in terms of how tricky it was to figure out. I submitted a change list that worked on my machine as well as 90% of the build farm and most other dev and QA machines, but threw a baffling linker error on the remaining 10%. It turned out that the change worked fine on any machine that used to have a particular old version of Visual Studio installed on it, even though we no longer used that version and had phased it out for a newer one. The code I had written depended on a library that was no longer in current VS installs but got left behind when uninstalling the old one. So only very new computers were hitting that, mostly belonging to newer hires who were least equipped to figure out what was going on.

tslnox@reddthat.com · 2 years ago

That reminds me of when some of my former colleagues and I were on a training about programming industrial camera system that judges the quality of produced parts. I’m not really a programmer, just a guy who can troubleshoot and google stuff and occasionally hack together a simple code with heavy help from Google too.

The guy was a German (we are Czech and we communicated in English) programmer who coded the whole thing in Omron software but he also wrote his own plugin for it. All was well when he was showing us on the big screen, but when he sent us the program file so we could experiment on it (changing parameters, adding steps to the flow…) the app would crash. I finally delved into the app logs and with the help of Google I found it was because he compiled his plugin with debug flags and it worked for him because he had the VS debug DLLs installed but we didn’t.

karmiclychee @sh.itjust.works · 2 years ago

I feel a repressed memory or two stirring 😐

slazer2au@lemmy.world · 2 years ago

I took down an ISPfor a couple hours because I forgot the ‘add’ keyword at the end of a Cisco configuration line

sloppy_diffuser@sh.itjust.works · 2 years ago

That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.