career

What to do if you fu*k up and break production code?

Shantnu Tiwari

Aug 20, 2021 • 6 min read

Question: So you come into work one day, and find everyone staring at you. Most people are avoiding your eyes, and you hear the CEO ask “Who’s this Insert your name guy? Bring him to me when he turns up.”
You sit on your computer, and find that the one line of code you changed broke every single test in the code base. And an email with your name has been sent to every manager in the company. And then you look down and see you forgot to wear trousers to work. What would you do?

Before I give you the answer, I will make a few bad jokes and pad the word count. If you want a hint at the answer, it involves growing (or shaving) a moustache and moving to Mexico.

Thanks to Sanjay Sharma for asking this question.

Don’t ask me – I write perfect code

My first question is: Why are you asking me? Do I look like the sort of man who would break code? I write perfect code, being the Picasso of Programming (PoP). Anything that flows from my fingers is a work of art. In fact, the first time I saw a debugger was after ten years of programming.

Anyway.

First, a true story. It involves my evil twin brother, Bhrantu. Bhrantu looks just like me, except he has a mustache and wears a weird Mexican hat.

Shan-mooch

He even looks like an evil twin

Unlike me, he breaks code all the time.

When Bhrantu broke every test in the whole company

So my evil twin brother, Bhrantu, who is totally not me (pinky swear) made one line of change to the code. He ran the precommit and post commit tests, and they all passed. So he was happy and checked the code in.

Next morning, he came in to work humming “Girls just wanna have fun” (as all men do, from time to time). He felt a few people staring at him. When he checked the email, he found out why.

Every single of the fifty overnight tests had failed, and Bhrantu’s name was attached to each. So all the managers and most of the engineers received an email like:

Test 0 failed. Last checkin: Bhrantu.

Test 1 failed. Last checkin: Bhrantu.

…

Test 50 failed. Last checkin: Bhrantu.

Oops.

Luckily, Bhrantu is a master of escape. He changed his name to Jose, grew a mustache, wore a fake Mexican hat, and escaped the country. He is still being followed by assassins from the test team. No one knows where he is.

And that’s one way to deal with a screw up. The actual thing I, sorry, Bhrantu did was, he walked into the team meeting and said “It was me!”

It was Jim Carrey all along...

What to do when your screw up breaks production code

In the example above, the problem only affected the overnight test system. Because in a good company, it should be physically impossible for one person to break the production code, even if such a person took an axe and smashed all the servers (because you have backup, right?)

So what do you do?

The first thing is accept responsibility. Calmly explain what happened, and try to fix it. If it is someone else’s fault, let them and your bosses know, but without blaming or being bitchy. So don’t do this:

JJ checked in this code before me, and he’s the one who broke it! He should have asked me if he was going to re-architect my design.

Do this:

JJ checked in some changes before me. Unfortunately, his changes didn’t gel with my architecture. I’m going to discuss with him how we can best move this forward.

In the 2nd example, you aren’t moaning or bitching. Instead, you are trying to solve the actual problem.

It’s the systems fault, not your fault

I don’t mean this in a hippy way. “Hey man, the police and courts are like, so corrupt man! Spread the love, man!”

Big fuck ups are always caused by bad management decisions, not programmers. Don’t believe me? Here are a few examples:

1. You check in a change, and it breaks the production server. Customers complain, and your boss says you have to stay the weekend to fix the problem.

But: Why was your code allowed to go near the production server anyway? It should have run on a test server first. A team of testers should then have tested things like UX which can’t be easily scripted.

Test machines are cheap- just buy a cheap $300 machine and load Ubuntu on it (I don’t know how Windows guys would do it). Add extra RAM and SSD drives if you want speed. Add an automated script that runs the tests on the server everytime you commit anything. Developing automated tests isn’t that hard. There are automated solutions like Jenkins and Travis CI that you can use as the infrastructure.

2. You have pre commit tests, but your change still break the overnight system

This is what happened to Bhrantu. Basically, the precommit tests didn’t test a flag that the overnight tests did. And the change that broke all tests was one which affected this flag.

So again, add that flag to your local tests. Again, this is a system problem. Screw ups will happen, just fix them as soon as you find out.

3. You have no tests. That’s what the customer is for!

Well, good luck. I admire your courage.

See what I mean that it’s always a system problem?

So what do you do when the fecal matter hits the revolving air circulation device?

You need to have a good talk with your team and boss, and you need to figure out which part of the system broke, and how you can stop it from breaking again. Fixing the immediate problem is not enough. Unless you fix the underlying problem, the screw up will happen again and again. It’s like papering over a broken wall. You aren’t fooling anyone.

You need to convince management to buy you proper testing hardware. This is what you can say:

“Hey boss. I spent more than four hours fixing this problem. So you not only lost money on the time I worked on it, but you also lost a huge opportunity, because the time I spent fixing this mess would have been better spent adding features to our code.”

This last sentence is important. Companies can lose millions if they don’t have one tiny feature the customer absolutely needs. Not only that, if you are developing software on contract, your customer can sue you for non or late delivery. Trust me, I say this from personal experience. There is nothing like a multi-million dollar suit to convince management to get off their asses and fix structural problems in testing. Say this to your boss: “Would you rather fix this problem once our biggest customer leaves, because we saved pennies by not testing?”

Just buying hardware isn’t enough. You then need to develop tests for it. And this is where most managers will balk. Writing automated tests takes time. But ask your boss this: “Would you rather spend $50,000 on tests now, or $500,000 on legal fees when we get sued for shipping broken code to customers?”

Put it in an email if the boss isn’t convinced, and CC everyone important. If the company still won’t do anything, well tough. That’s why I said at the start: This is almost always a management problem.

Trying to cut costs by cutting testing is like trying to lose weight by cutting your arm off. It works in the short term, kills you in the long term.

Automated testing for you

No matter where you work, it should not be physically possible to put code from your machine to the production server. Depending on how big you are, you need to do the following:

Big Company

You should have dedicated test machines, and a dedicated test engineers. I plan to cover this in a future blog, but if you can’t afford test engineers, hire finance and MBA students. Before the recession, thousands entered this field hoping to make a good buck, and are now crying. They usually are good with computers, and with some training, can be taught to script basic tests. Like monkeys, only cheaper.

2. Medium/small company

You may not be able to afford a full time test team (although, think again. What’s the biggest penalty for not shipping on time? If it’s in the six figures, you can afford a tester. Simple business decision even a MBA can understand).

At the very minimum, have proper servers that automatically run tests when anything in checked in. I keep repeating this, but the automated part is important.

You should never be surprised by your test results. If you are, the system is broken. Of course, tests will fail, but they should be fairly predictable. You should never have a situation where your whole website is down because someone didn’t realize that clicking a button on the web browser kills a Linux driver and cause a Segfault(ask me how I know that. Actually, don’t ask, as I start crying every time I tell that story).

3. Very small / Freelancer

At the bare minimum, you need to run your own tests on your own PC, if you can’t afford a server. You may have to pay a small amount, but it’s much cheaper than spending hours of your time debugging. Wouldn’t you rather be playing Doom, or whatever it is kids are playing nowadays?

To summarise: If your code breaks the production server, this shows a serious design fault and management failure. It is not your fault, nor the fault of any other engineer. This doesn’t mean you are not responsible. As engineers, it’s our job to guide management on these issues. Who do you think understand more about integration testing: A fifty year old head of product development, or a twenty something engineer?

So let management know, and always, always fix the underlying issue.

Click here for more Career advice