Observability Lessons Learned From the AWS East-1 Outage
Achieve Reliable Observability: Bolster Cloud-Native Observability
The recent AWS East-1 outage provides a catalyst for customers to rapidly address their AIOps and Observability capabilities, especially the monitoring/observability portion. In particular, you should be aware of the below, and be prepared for it. According to AWS, “We are seeing an impact on multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.”
If you haven’t seen it already, AWS has posted a detailed analysis of what happened here – Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region
In recent conversations with CXO’s, there appears to be great confusion on how to properly operationalize cloud native production environments. Here is how a typical conversation goes.
CXO: “Andy, we are thinking about getting [vendor] to use for our observability solution, what vendors do you think we should shortlist?”
AT: “Well, I don’t want to endorse any specific vendor, as they are all good at what they do. But let’s talk about what you want to do, and what they can do for you, so you can figure out whether or not they are the right fit for you.” The conversation continued for a while, but the last piece is worthy of being called out specifically.
CXO: “So, we will be running our production microservices in AWS in the ____ region. And we are planning to use this particular observability provider to monitor our Kubernetes clusters.”
AT: “Couple of items to discuss. First, you realize that this particular provider you are speaking of also runs in the same region of the same cloud provider as yours, right?”
CXO: “We didn’t know that. Is that going to be a problem?”
AT: “Definitely. you may get into a ‘circular dependency’ situation.”
CXO: “What is that?”
AT: “Well, from my enterprise architect perspective, we often recommend a separation of duties as a best practice. For example, having your developer testing the code is a bad idea, having your developer figuring out how to deploy is a bad idea. In much the same way as when your production services run in the same region as your monitoring software – how would you know about a production outage if the cloud region takes a hit, and your observability solution goes down at the same time your production services do?”
CXO: “Should we dump them and go get this other solution instead?”
AT: “No, I am not saying that. Figure out what you are trying to achieve and have a plan for it. Selection of an observability tool should fit your overall strategy.”
Always Avoid Circular Dependencies
Enterprise architects often recommend a best practice of avoiding circular dependencies. For instance, this includes not having two services depend on each other, or not to collocate monitoring, governance and compliance systems as part of the production systems themselves. If one were to monitor one’s production system, one would do it from a separate and isolated sub-system (server, data center rack, sub-net, etc.) to make sure that if your production system goes down, the monitoring system doesn’t go down, too.
The same goes for public cloud regions – although it’s unlikely, individual regions and services do experience outages. If one’s production infrastructure is running on the same services in the same region as one’s SaaS monitoring provider, not only won’t an enterprise be aware that their production systems are down, but the organization also won’t have the data to analyze what went wrong. The whole idea behind having a good observability system is to quickly know when things went bad, what went wrong, where the problem is and why it happened so one can quickly fix it. For more details check out this blog post
Apply These Five Best Practices Before The Next Cloud Outage
When/if one receives the dreaded 2 a.m. call, what will the organization consider for their plan of action? Just think it through thoroughly before it happens and have a playbook ready, so one won’t have to panic in a crisis. Hera are five best practices based from hundreds of client interactions:
- Place your observability solution outside your production workloads or cloud. Consider an observability solution that runs in a different region than your production workloads. Better yet, consider something that runs on a different cloud provider altogether. Although it is exceptionally rare, there have been instances of cloud service outages that cross regions. The chances of both cloud providers going down at the same time would be slim. For example, if the cloud region goes down (a region-wide outage in the cloud is quite possible, and seems to be more frequent of late), then your observability systems will also be down. You wouldn’t even know your production servers are down to switch to your backup systems unless you have a “hot” production backup. Not only will your customers find out about your outage before you do, but you won’t even be able to initiate your playbook, as you are not even aware that your production servers are down.
- Keep observability solutions physically near production systems to minimize latency. Consider having your observability solution in a different region/cloud provider/location, yet still close enough to your production services so latency is very low. Most cloud providers operate in close proximity, so it is easy to find one.
- Deploy on-premises and in the cloud options. For example, there are a couple of observability solutions that allow you to deploy it in any cloud and observe your production systems from anywhere – both in the cloud and on-premises.
- Build redundancy. You can also consider sending the monitoring data from your instrumentation to two observability solution locations, but that will cost you slightly more. Or, ask what the vendor’s business continuity/disaster recovery plans are. While some think the costs might be much higher, I disagree, for a couple of reasons. First, because monitoring is mainly time-series metric data, so the volume and the cost to transport is not as high as logs or traces. Second, unless your observability provider is VPC peered, the chances are your data will be routed through the internet even though they are hosted in the same cloud provider. Hence, there will not be much more additional cost. Having observability data, all the time, about your production system is very critical during outages.
- Monitor one’s full-stack observability system. While it is preferable to have the monitoring instance in every region where your production services run, it may not be feasible either because of cost or manageability. On such occasions, monitor the monitoring system. You could do synthetic monitoring by checking the monitoring API endpoints (or do random data inputs and check to ensure it worked) to make sure that your monitoring system is properly watching your production system. Better yet, find a monitoring vendor that will do this on your behalf.