Monitoring and Observability


I am talking at a meetup in a few weeks. My topic, Overcoming Metric Fatigue with Artificial Intelligence.  The meetup is about Software Observability and Continuous Updates. Now that I am preparing my talk I am taking a deep dive into Observability and Continuous Updates… What does that mean beyond buzz words.

Software Observability

So first off we are talking about software, not hardware, so we are talking about log statements (and logs), exceptions, stack traces, and events. 

But what is observability? Once we have observability we have the related concept of monitoring and instrumentation. Let’s split this up. 

Instrumentation is for business intelligence. The goal here is to understand how people use the system and how to extract additional information from this usage in order to either improve the software, create new products, or add new processes to the pipeline (like altering a sales pipeline).  This perhaps is a problem to be solved in the datawhare house or OLAP system, statistical methods, and visual displays. We want to capture as much information as possible and then run analysis on it.  

Monitoring is for understanding the health of a system.   Here the goal is to capture the most salient information in order for a human to act on it.  We imagine a one to one correlation between an alert in monitoring and something like pager duty alert.  Rather than logging an exception, monitoring is about looking for events and certain ranges and thresholds (like 10 events a minute, or an event with a certain value).  We can engage in whitebox monitoring – an alert triggered by monitoring the internals of a system, or blackbox monitoring -an alert triggered by monitoring the external interfaces of a system (such as user interaction or systems integrations).

How to determine what to monitor can be refined by instrumentation.


Observability is related to understanding how a system is behaving internally.  

To reiterate, monitoring is about looking for threshold events and immediately actionable alerts, instrumentation is about logging information to help decisions (about product, pipelines, and monitoring), and observability is about understanding how the internal pieces of a system work together (or rather don’t work together as excepted). 

The idea is that if an event is triggered via our monitoring software, then we can use tools in the observability camp to track down, understand what is happening and fix it. Observability is the ability is to tie back a piece of data in instrumentation or monitoring to the software code itself.

Observability, is also about context. It is about examining the state of a system as it compared to its recent (or ancient) history. Where as monitoring is about an event or a thing, observability is about ranges. When observing a system, I want to see how the software is working within its historical context. When an alert is triggered via monitoring, we also want to observe what is happening within context (either in real time or in log files) in order to understand what is happening and to apply an appropriate fix later. 

Metrics and Monitoring

Where do metrics come in? Metrics are the things that we are building our alerts around.  What we are monitoring are metrics. Any time I see a systems diagram where ever I see I line I see an opportunity for a metric. These lines can increase more and more as we dive into the internals of a system. There is a great old blog post by etsy about measurement. It talks about how to measure everything. This is metric fatigue. 

For instrumentation we want to measure everything, for monitoring we only want to measure the most salient things, for observability we want to monitor things that a human brain can realistically comprehend. 

This is the subject of my talk – how we can separate out the different types of metrics. This is a notion of worlding or worldview, we can also call it phase transition or granularity. Just like we can examine a mountain by looking at it’s ranges, it’s individual rocks, or the atoms that make up the rocks, we can also look at instrumentation, observability, and monitoring as different world views of our systems.

There are all different ways of setting up observability in a system  – like in this detailed but probably out of date document from twitter.  Also all different ways to integrate as in this medium post.

My talk is about adding AI into the mix.  How to use AI to determine which metrics to observe. Once you do this tho, you need to add additional metrics and observation to gain intuition into how the AI algorithm is making its decisions. Its turtles all the way down.

Maybe next I’ll write about continuous updates.