The Art and Science of Debugging Software Systems
Debugging is the process of diagnosing why a software system fails to behave correctly. As with medical diagnostics, oftentimes finding the cause for illness is the hard part, while offering the treatment requires much less work; there can be many reasons why a patient is suffering from headaches, but once you realize they are constantly dehydrated, the cure is straight-forward (“please drink more water, sir.”).
Of course, the state of medical science is never complete. We still do not understand why many things happen; and for some of the things we do understand, there is no known cure. Luckily, software systems are way simpler than the human body, so our chance of actually understanding and fixing troubling issues aren’t too bad.
Over the years I have found that a disciplined and structured approach is essential to effectively debug even simple systems. The approach that is pretty much synonymous with “structured and disciplined inquiry” is the scientific method, in this post I hope I can show how and why it is especially relevant to debugging software. Much has been written on the scientific method so allow me to briefly summarize. The scientific method is a rigorous process that we can use to expand our knowledge. The steps are:
- Define the question. Explicitly lay out, what do we want to answer? Example: Why are our users constantly getting an error message when trying to upload new images to our site on weekends?
- Form a hypothesis. Come up with a testable model that might be the answer to our question. Example: at a certain request rate our connection pool to the database is saturated, causing requests to time out which returns 502 HTTP errors to the client
- Design an experiment. Figure out a way to reproducibly and quantitatively test whether our hypothesis is correct. Example: Instrument RPS, Error Rate, In flights DB connections; Deploy to a staging environment and run a stress test.
- Run & Measure. Run the test a few times, collect result data. Example: Collect the instrumentation data to a spreadsheet
- Analyze. Check if the data aligns with our hypothesis. Example: Looking for correlation between error rate (%), RPS, and in-flight DB connections, we see the DB connections increase with RPS, then flatline with at the DB Connection Pool size at which point 502 errors increase.
- Conclude or form a new hypothesis. Either conclude our experiment 🙌 or, back to the drawing board to find a new hypothesis. Example: Our measurements seem to confirm our hypothesis 💪, we are done here!
Getting Better at Debugging
As we’ve discussed, scientific inquiry is an iterative process of coming up with hypotheses, disproving them, and replacing them with better and better hypotheses that get closer to explaining the phenomenon we are trying to understand.
Thinking about debugging this way we realize we can only be stuck debugging in one of a few ways:
- We ran out of hypotheses to test — after disproving everything we previously generated.
- We can’t think of a way to test a hypothesis that is practical, close enough to the real production system, and feasibly measurable.
- We can’t conclusively analyze the results of an experiment we ran.
Debugging is often considered a kind of “black art”, some secret intuition that only a small selection of software engineers possess that enables them to glare into the suffering soul of a failing software system and glean insight into what is making it crash.
Thinking about debugging as an application of the scientific method allows us to see it for what it is: a method, a technique, a set of skills that you can acquire and become good at. Let’s try to consider what skills are required for being a good debugger:
- Defining the question
clearly defining what we are trying to understand is crucial to conducting a successful inquiry. The key skill here is separating what you know from what you assume, which is very easy to get wrong. It is far better to ask “Why does the client see an error message when trying to change their phone number?” than to ask “why did the latest release break the update phone number flow on the server?” before we are sure that the client is performing correctly and that the problem first appeared after the last release. More than once has it happened to me that when trying to formulate the question and validating my assumptions did I realize that there was, in fact, no problem at all.
2. Forming a hypothesis
coming up with possible explanations to the phenomenon we are trying to explain is the first step into gaining any knowledge. The key skill is being able to generate an exhaustive list of ways in which our system can fail in the way we are observing. To do this, we need to develop a mental model of the system, its different components, and dependencies.
The higher the fidelity of our mental model is, the more modes of failure we can come up with! This means accepting and embracing Joel Spolsky’s Law of Leaky Abstractions: “All non-trivial abstractions, to some degree, are leaky.” Our “connection” to a database, is an object in memory (that is subject to race conditions and memory management bugs), it is a file descriptor (an OS-level resource that can be exhausted), it is a TCP connection to a remote server (which is subject to firewall rules and to network instability).
We use abstractions because we must, but to understand faulty software systems, we must reverse that process and try to see the systems in their concreteness, because in all likelihood it is the misinformed assumptions about how the abstraction is implemented that are causing us grief.
The topic of forming hypotheses deserves its own post, but a few things are worth mentioning in this context. It is generally a good idea to immerse yourself in some data (logs, metric dashboards, and exception stack traces, for example), after which hopefully you have some initial ideas of what to investigate. However, be aware of the “ Streetlight effect “ — a type of cognitive bias where you only search in the places where it is easy to look. Logs and dashboards are oftentimes scar tissue from previous problems (that were already resolved) — it is quite possible that data that is most relevant to understanding the issue you are facing is not readily available. Rolling our sleeves and digging into layers of the stack that we are not experienced and comfortable with can be scary — but fear not! You’ve come this far in your career by learning new things, and this hairy bug you’re dealing with is an opportunity to keep doing just that.
3. Designing and running experiments
Once we have a hypothesis, we need to come up with a way to test it. For example:
Hypothesis: processes running in our us-west-1 Kubernetes cluster are unable to create a network connection to our Redis instance in that region.
Experiment: open a shell in one of these machines and use
nc -zv <redis ip> 6379. If we are able to connect, great! Hypothesis invalidated, we can move on to the next one.
There are two important skills in play here. The first is learning to use all of the tools of the trade — network probing tools, OS profiling tools, application profilers, debuggers, application logging, and instrumentation, navigating any vendor’s dashboards, and observability tooling. The more tools we know how to use, the more experiments we can come up with.
The second skill is designing experiments that test no more or less than the hypothesis you are testing. In the example above, if we conclude from the fact that we managed to create a network connection to our Redis instance that our application is able to read and write data from it, we may be misleading ourselves. Being able to create a network connection to the remote Redis port is a prerequisite, but it is not a sufficient condition (you could have a misconfigured client, insufficient access permissions, or even a bug in the client or server that will prevent you from doing what you want).
4. Analyzing results
Once we run our experiments and collect result data, we need to analyze it to check whether it validates our hypothesis or debunks it. Some important skills here are working effectively with spreadsheets, understanding stack traces, reading monitoring charts, and applying some basic statistics.
TL;DR: “The 3 Rules of Debugging”
Thinking about debugging as an application of the scientific method is a useful thought experiment as we’ve seen, but I have yet to see an engineer pull out a lab notebook and start spelling out the formal steps of inquiry while trying to put out a production fire. To end with something practical, I usually present engineers with this advice:
To effectively debug an issue you should:
- Reproduce the issue in an automated way (strong preference for a local invocation, but some issues only present themselves in cloud environments, for example). By reproducing the bug, you prove that it exists, and take it from someone who has spent many long days chasing bugs that don’t, it is a very important thing to do. If you cannot reproduce it, your chances of being able to measure what’s going on in the system as it is failing are very low. Worse, unless you can reliably reproduce an issue, you cannot know if any change you apply to the system fixes the issue or not.
- Instrument the 💩 out of the app — logging, more logging, profile, more logging, debuggers — until you can see the problem happening with your eyes.
- Fix — Make a single change to the system, run your issue reproducing automation, if the problem persists, make another change, etc.
Debugging is one of the most important skills a software engineer can possess, being able to diagnose why a complex system is failing at a critical time will make you an invaluable member of your team. I hope I have managed to convince you in this post that it is not a dark art but rather a collection of pretty mundane skills that can be practiced and learned. When applied together in the correct order, these skills allow us to reliably gain a better understanding into why a system is failing in a certain way.
Originally published at https://rotemtam.com on December 28, 2020.