Achieving Reliable Observability Part 1 – Making Cloud-Native Observability More Robust
I was having a conversation with a CxO level customer as part of an AIOps/Observability workshop, and from what I could tell, most are confused about how to properly operationalize cloud-native production environments – especially the monitoring/observability portion. Here is how the conversation went.
“Andy, we are thinking about getting [vendor] to use for our observability solution based on your recent research. What do you think?”
“Well, I don’t want to endorse any specific vendor, as they are all good at what they do. But let’s talk about what you want to do, and what they can do for you, so you can figure out whether or not they are the right fit for you.” The conversation continued for a while, but the last piece is worthy of being called out specifically.
“So, we will be running our production microservices in AWS in the ____ region. And we are planning to use this particular observability provider to monitor our Kubernetes clusters.”
“Couple of items to discuss. First, you realize that this particular provider you are speaking of also runs in the same region of the same cloud provider as yours, right?”
“We didn’t know that. Is that going to be a problem?”
“Not particularly. However, you may get into a ‘circular dependency’ situation.”
“What is that?”
“Well, as an enterprise architect, I always call for separation of duties as a best practice. For example, having your developer testing the code is a bad idea, having your developer figuring out how to deploy is a bad idea. In much the same way as when your production services run in the same region as your monitoring software – how would you know about a production outage if the cloud region takes a hit, and your observability solution goes down at the same time your production services do?”
“Should we dump them and go get this other solution instead?”
“No, I am not saying that. Figure out what you are trying to achieve and have a plan for it. Selection of an observability tool should fit your overall strategy.”
For those who don’t understand the above conversation, here is the reason why this scenario could be a problem.
Coming from an enterprise architecture background, we were taught, as a best practice, to operationalize production systems to avoid circular dependencies. This includes not having two services depend on each other, or not to colocate monitoring, governance and compliance systems as part of the production systems themselves. If you were to monitor your production system, you would do it from a separate and isolated sub-system (server, data center rack, sub-net, etc.) to make sure that if your production system goes down, the monitoring system doesn’t go down, too. The same goes for public cloud regions – although it’s unlikely, individual regions and services do experience outages. If your production infrastructure is running on the same services in the same region as your SaaS monitoring provider, not only won’t you be aware that your production systems are down, but you also won’t have the data to analyze what went wrong. The whole idea behind having a good observability system is to quickly know when things went bad, what went wrong, where the problem is and why it happened so you can quickly fix it. You can check out this blog where I explain this in detail.
The best practice would be to either:
- Consider an observability solution that runs in a different region than your production workloads. Better yet, consider something that runs on a different cloud provider altogether. Although it is exceptionally rare, there have been instances of cloud service outages that cross regions. The chances of both cloud providers going down at the same time would be slim.For example, if the cloud region goes down (a region-wide outage in the cloud is quite possible, and seems to be more frequent of late), then your observability systems will also be down. You wouldn’t even know your production servers are down to switch to your backup systems unless you have a “hot” production backup. Not only will your customers find out about your outage before you do, but you won’t even be able to initiate your playbook, as you are not even aware that your production servers are down.
- Consider having your observability solution in a different location/region, yet still close enough to your production services so latency is very low. Most cloud providers operate in close proximity, so it is easy to find one.
- Another option is to get a solution that gives you deployment flexibility. For example, there are a couple of observability solutions that allow you to deploy it in any cloud and observe your production systems from anywhere – both in the cloud and on-premises.
- You can also consider sending the monitoring data from your instrumentation to two observability solution locations, but that will cost you slightly more. Or, ask what the vendor’s business continuity/disaster recovery plans are. While some think the costs might be much higher, I disagree, for a couple of reasons. First, because monitoring is mainly time-series metric data, so the volume and the cost to transport is not as high as logs or traces. Second, unless your observability provider is VPC peered, the chances are your data will be routed through the internet even though they are hosted in the same cloud provider. Hence, there will not be much more additional cost. Having observability data, all the time, about your production system is very critical during outages.
- A very commonly overlooked consideration is monitoring your full-stack observability system. While it is preferable to have the monitoring instance in every region where your production services run, it may not be feasible either because of cost or manageability. On such occasions, monitor the monitoring system. You could do synthetic monitoring by checking the monitoring API endpoints (or do random data inputs and check to ensure it worked) to make sure that your monitoring system is properly watching your production system. Better yet, find a monitoring vendor that will do this on your behalf.
When/if you get the dreaded 2 a.m. call, what is your plan of action? Just think it through thoroughly before it happens and have a playbook ready, so you won’t have to panic in a crisis.