Fess up. You know it was you.
One time I was deleting a user from our MySQL-backed RADIUS database.
DELETE * FROM PASSWORDS;
And yeah, if you don’t have a WHERE clause? It just deletes everything. About 60,000 records for a decent-sized ISP.
That afternoon really, really sucked. We had only ad-hoc backups. It was not a well-run business.
Now when I interview sysadmins (or these days devops), I always ask about their worst cock-up. It tells you a lot about a candidate.
I always put the where clause first since a fuck up in my early 20s lost a loans company £40k of business.
My trick is writing it as a
SELECT
statement first, making sure it’s returning the right number of records, and then switching out theSELECT
forDELETE
. Hasn’t steered me wrong yet.This.
The hero we don’t deserve.
Always skeptical of people that don’t own up to mistakes. Would much rather they own it and speak to what they learned.
Exactly!
It’s difficult because you have a 50/50 of having a manager that doesn’t respect mistakes and will immediately get you fired for it (to the best of their abilities), versus one that considers such a mistake to be very expensive training.
I simply can’t blame people for self-defense. I interned at a ‘non-profit’ where there had apparently been a revolving door of employees being fired for making entirely reasonable mistakes and looking back at it a dozen years later, it’s no surprise that nobody was getting anything done in that environment.
Incredibly short-sighted, especially for a nonprofit. You just spent some huge amount of time and money training a person to never make that mistake again, why would you throw that investment away?
This is what I was told when I started work. If you make a mistake, just admit to it. They most likely won’t punish you for it if it wasn’t out of pure negligence
BEGIN TRAN
ROLLBACK TRAN
This. My comment was going to be “what kind of maniac uses auto commit?”
I worked for a company where the testing database was also the only backup.
I was a sysadmin in the US Air Force for 20 years. One of my assignments was working at the headquarters for AFCENT (Air Forces Central Command), which oversees every deployed base in the middle east. Specifically, I worked on a tier 3 help desk, solving problems that the help desks at deployed bases couldn’t figure out.
Normally, we got our issues in tickets forwarded to us from the individual base’s Communications Squadron (IT squadron at a base). But one day, we got a call from the commander of a base’s Comm Sq. Apparently, every user account on the base has disappeared and he needed our help restoring accounts!
The first thing we did was dig through server logs to determine what caused it. No sense fixing it if an automated process was the cause and would just undo our work, right?
We found one Technical Sergeant logged in who had run a command to delete every single user account in the directory tree. We sought him out and he claimed he was trying to remove one individual, but accidentally selected the tree instead of the individual. It just so happened to be the base’s tree, not an individual office or squadron.
As his rank implies, he’s supposed to be the technical expert in his field. But this guy was an idiot who shouldn’t have been touching user accounts in the first place. Managing user accounts in an Airman job; a simple job given to our lowest-ranking members as they’re learning how to be sysadmins. And he couldn’t even do that.
It was a very large base. It took 3 days to recover all accounts from backup. The Technical Sergeant had his admin privileges revoked and spent the rest of his deployment sitting in a corner, doing administrative paperwork.
Installed a flatpak app (can’t remember which one but it wasn’t obscure or shady) and smh it broke the file system on one of my main machines :) (at least I think that’s what happened because the machine started lagging, any app refused to launch and after a reboot I got an fsck error or something like that)
Pretty run of the mill for me, so not that bad: Pushed a long-running migration during peak load hours that locked an important table for an extended period of time, effectively taking our site offline.
Also consider !ask_experienced_devs@programming.dev :)
Did you know that “Terminate” is not an appropriate way to stop an AWS EC2 instance? I sure as hell didn’t.
Explain more?
Apparently Terminate means stop and destroy. Definitely something to use with care.
Maybe there should be some warning message… Maybe a question requiring you to manually type “yes I want it” or something.
Maybe an entire feature that disables it so you can’t do it accidentally, call it “termination protection” or something
“Stop” is the AWS EC2 verb for shutting down a box, but leaving the configuration and storage alone. You do it for load balancing, or when you’re done testing or developing something for the day but you’ll need to go back to it tomorrow. To undo a Stop, you just do a Start, and it’s just like power cycling a computer.
“Terminate” is the AWS EC2 verb for shutting down a box, deleting the configuration and (usually) deleting the storage as well. It’s the “nuke it from orbit” option. You do it for temporary instances or instances with sensitive information that needs to go away. To undo a Terminate, you weep profusely and then manually rebuild everything; or, if you’re very, very lucky, you restore from backups (or an AMI).
Noob was told to change some parameters on an AWS EC2 instance, requiring a stop/start. Selected terminate instead, killing the instance.
Crappy company, running production infrastructure in AWS without giving proper training and securing a suitable backup process.
It doesn’t help that the webui used to hide stop. I think it still does.
Accidentally deleted an entire column in a police department’s evidence database 😬
Thankfully, it only contained filepaths that could be reconstructed via a script. But I was sweating 12+1 bullets.
deleted an entire column in a police department’s evidence database
Based and ACAB-pilled
And if you couldn’t reconstruct, you still had backups, right? … right?!
Oh sweet summer child
What the fuck is a “backups”?
He’s the guy that sits next to fuckups
I was still a wee IT technician, I was supposed to remove some cables from a patch panel. I pulled at least two cables that were used as ISCSI from the hypervisors to the storage bays. During production hours. Not my proudest memory.
Updated WordPress…
Previous Web Dev had a whole mess of code inside the theme that was deprecated between WP versions.
Fuck WordPress for static sites…
Plugged a server in after it had been repaired but the person whose responsibility it was insisted it would be fine - they didn’t release the FSMO roles from it, the time was an hour out, it changed the time EVERYWHERE and broke ALL THE THINGS. Not technically my fault, but i should have pushed harder for them to have demoted it before I turned it back on.
Advertised an OS deployment to the ‘All Wokstations’ collection by mistake. I only realized after 30 minutes when peoples workstations started rebooting. Worked right through the night recovering and restoring about 200 machines.
Flushed the entire AD not realizing I somehow got back into prod
Was troubleshooting a failed drive in a raid array on a small business DC/File Serv/Print/Everything else box. Replaced drive still showed failed. Moved to another bay thinking it was the slot not the drive. Accidentally hit yes when asked to initialize the array. Blew the whole thing away. It was an OLD server the customer was working on replacing, so I told them it finally gave up the ghost and I was taking it back to the office to keep working on it. I had been on the job for about 4 months and thought for SURE I was fired. Turns out we were already working on moving them to the cloud, so it ended up not being a big deal.
Found out the hard way to triple check your work when adding a new line to the proxy policy. Or, more accurately 2 lines when you only planned one, and that second one defaulted to a ‘deny all’ and resulted in dropping all web traffic out for the company…
That made for a REAL tense meeting the next day after it got deployed and people started asking WTF happened…
Forgot to turn the commercial power back on after testing the battery backups… oopsie.
It wasn’t “worst” in terms of how much time it wasted, but the worst in terms of how tricky it was to figure out. I submitted a change list that worked on my machine as well as 90% of the build farm and most other dev and QA machines, but threw a baffling linker error on the remaining 10%. It turned out that the change worked fine on any machine that used to have a particular old version of Visual Studio installed on it, even though we no longer used that version and had phased it out for a newer one. The code I had written depended on a library that was no longer in current VS installs but got left behind when uninstalling the old one. So only very new computers were hitting that, mostly belonging to newer hires who were least equipped to figure out what was going on.
I feel a repressed memory or two stirring 😐
That reminds me of when some of my former colleagues and I were on a training about programming industrial camera system that judges the quality of produced parts. I’m not really a programmer, just a guy who can troubleshoot and google stuff and occasionally hack together a simple code with heavy help from Google too.
The guy was a German (we are Czech and we communicated in English) programmer who coded the whole thing in Omron software but he also wrote his own plugin for it. All was well when he was showing us on the big screen, but when he sent us the program file so we could experiment on it (changing parameters, adding steps to the flow…) the app would crash. I finally delved into the app logs and with the help of Google I found it was because he compiled his plugin with debug flags and it worked for him because he had the VS debug DLLs installed but we didn’t.
I took down an ISPfor a couple hours because I forgot the ‘add’ keyword at the end of a Cisco configuration line
That’s a rite of passage for anyone working on Cisco’s shit TUI. At least its gotten better with some of the newer stuff. IOS-XR supported commits and diffing.