The Art and Science of Debugging Software Systems

  1. Define the question. Explicitly lay out, what do we want to answer? Example: Why are our users constantly getting an error message when trying to upload new images to our site on weekends?
  2. Form a hypothesis. Come up with a testable model that might be the answer to our question. Example: at a certain request rate our connection pool to the database is saturated, causing requests to time out which returns 502 HTTP errors to the client
  3. Design an experiment. Figure out a way to reproducibly and quantitatively test whether our hypothesis is correct. Example: Instrument RPS, Error Rate, In flights DB connections; Deploy to a staging environment and run a stress test.
  4. Run & Measure. Run the test a few times, collect result data. Example: Collect the instrumentation data to a spreadsheet
  5. Analyze. Check if the data aligns with our hypothesis. Example: Looking for correlation between error rate (%), RPS, and in-flight DB connections, we see the DB connections increase with RPS, then flatline with at the DB Connection Pool size at which point 502 errors increase.
  6. Conclude or form a new hypothesis. Either conclude our experiment 🙌 or, back to the drawing board to find a new hypothesis. Example: Our measurements seem to confirm our hypothesis 💪, we are done here!

Getting Better at Debugging

Photo by Joshua Aragon on Unsplash
  • We ran out of hypotheses to test — after disproving everything we previously generated.
  • We can’t think of a way to test a hypothesis that is practical, close enough to the real production system, and feasibly measurable.
  • We can’t conclusively analyze the results of an experiment we ran.
  1. Defining the question

TL;DR: “The 3 Rules of Debugging”

  1. Reproduce the issue in an automated way (strong preference for a local invocation, but some issues only present themselves in cloud environments, for example). By reproducing the bug, you prove that it exists, and take it from someone who has spent many long days chasing bugs that don’t, it is a very important thing to do. If you cannot reproduce it, your chances of being able to measure what’s going on in the system as it is failing are very low. Worse, unless you can reliably reproduce an issue, you cannot know if any change you apply to the system fixes the issue or not.
  2. Instrument the 💩 out of the app — logging, more logging, profile, more logging, debuggers — until you can see the problem happening with your eyes.
  3. Fix — Make a single change to the system, run your issue reproducing automation, if the problem persists, make another change, etc.

Wrapping up



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store