The quest for simple and robust signals in production systems
It is today easier than ever to generate a lot of observability data (metrics, logs, and traces) from your applications and forward it to a central location where you can efficiently query it. However, systems can still get pretty hard to reason about. Having an abundance of application-specific data generated from your system does not guarantee that an on-call engineer or operator will know what queries will yield meaningful insight into the current system state.
This is why frameworks like Google’s “Four Golden Signals” or Weaveworks’ RED method were created: to reduce the cognitive load of humans trying to understand systems they are not closely familiar with. The RED method, for example, prescribes three high-level signals which should be observed for each application: Request Rate, Error Rate, and Request Duration. This is a great idea, which I like very much, but it does fall short as noted by author Tom Wilkie:
“It is fair to say this method only works for request-driven services — it breaks down for batch-oriented or streaming services for instance. It is also not all-encompassing.”
In this article, I want to present two ideas we came up with at Nexar (where I work) which when merged might be used to create flexible frameworks which can cover any kind of service: the service taxonomy and platform level metrics(PLMs).
The service taxonomy
Taxonomies are systematic classifications of things. Grouping things into distinct classes helps us look at their commonalities in abstract instead of on an individual basis. When developing software, the abstractions we choose can make or break a system. Choose an abstraction too coarse and each interaction with it requires a work to peek underneath the hood to get anything useful done; choose an abstraction too fine, and you’re stuck with a system that’s inflexible, hard to test and evolve. But a good abstraction is magical: at once your system seems simple, component boundaries just fall into place; the important stuff is crystal clear and the gory implementation details are confined to their place. Having a good classification of the services in your company…