In the rapidly evolving landscape of software development, the adoption of observability strategies has become paramount for businesses aiming to maintain efficient and resilient systems. However, this pursuit is often marred by several challenges that hinder the effectiveness of observability.
Increasingly complex environments—compounded by the proliferation of Kubernetes and microservices-based applications in multi-cloud environments—and the ever-escalating speed of deployments are just two of the hurdles that organizations face.
Traditional observability solutions contribute to the chaos with heavy code-level implementation, black box analysis and sprawling platforms that offer endless layers of functionality seemingly designed to increase vendor lock-in and further cement broken pricing models. Moreover, from an end-user standpoint, teams are swimming in mountains of data and overwhelming volumes of alerts that ultimately lead to engineering fatigue.
But there are new alternatives that can help. Enter (you guessed it) artificial intelligence (AI), which in the context of observability isn’t simply hype but a transformative force that promises to reshape strategies and alleviate the burdens imposed by the above challenges. This “promise” is not just another example of AI hype and marketing noise. I can assure you that companies like mine are well along this path of transformation that will massively benefit observability users.
Challenges in Observability: A Systematic Perspective
Let’s take a quick look at three key challenges currently obscuring the observability landscape and some practical, concrete examples of just how AI can be harnessed to overcome them.
1. Complexity and Alert Fatigue:
● Challenge: Super-complex environments, massive data volumes and traditional threshold-based analysis lead to an avalanche of alerts, many of which go unaddressed.
● AI Solution: AI-driven anomaly detection, when properly defined based on service level objectives (SLOs), helps filter out noise. By understanding ongoing work to troubleshoot real-world issues and tap into the massive database of known resolution actions, a new manner of cutting through the data and better prioritizing and informing investigation is made possible. It ensures alerts align with critical objectives, reducing unnecessary distractions and improving responsiveness.
2. Alert Handling and Escalation:
● Challenge: Alert handling inefficiencies lead to delayed responses and hindered mean time to resolution (MTTR). Teams have no hope of digging out as they struggle to keep up and perform troubleshooting.
● AI Solution: AI models, through the same types of supervised learning described above, can also prioritize responsive actions, ranking issues based on past successful resolutions and automating key tasks. Integration with Generative AI also provides a massive wellspring of contextual knowledge to help address known issues, with all of this serving to help users of all skill levels accelerate their work and MTTR.
3. Targeted and Top-down Querying:
● Challenge: Querying is one of the most time-consuming and manual observability practices, with detailed knowledge of the systems, parameters and querying languages themselves serving as obstacles.
● AI Solution: Natural language and broad-based parameter querying augmented by AI allows users to quickly create and execute analysis that quickly becomes more refined based on the AI’s system and language understanding. Not only are regular queries accelerated, but LLM integrations also allow for increased use by less technical analysts, along with line of business and other SLO-driven audiences.
The Role of AI in Observability: Unlocking Opportunities
Despite the hype and noise around AI, we believe that these types of innovations show immediate promise and that more advanced AIOps-type capabilities will eventually emerge as a standard practice in the IT industry. However, our current focus is on informing human experts (rather than letting systems make decisions autonomously), and both generative AI and LLM integrations are already making this possible.
For example, generative AI is already helping us with rapid contextualization and recommendations for alert resolution, as previously described. We are using generative AI to examine how an issue has been successfully handled before and to elevate and prioritize potential solutions. AI modeling of successful resolutions can help engineers take the right actions more efficiently and resolve issues faster, lowering MTTR.
This is particularly valuable for organizations with less experienced engineers because they can benefit from quick and easy access to “lessons learned” from the past by more experienced users. Similarly, natural language querying will be used in the future, both to expand the knowledge base and make insights accessible to larger sets of observability. There’s also a global impact for natural language search with users around the world, regardless of their native language.
Observability still has much to gain from more AI-driven anomaly detection and helping us reduce alert fatigue. Today, when we survey companies, they tell us that they do nothing about more than 50% of the alerts that they get because they simply cannot get to them all. Unfortunately, too many organizations set up alerts trying to capture every anomaly in their system; it’s a model of traditional platforms that simply does not work anymore.
The reality is that the entire IT environment is one big anomaly! A big customer joins, the cluster crashes and a pod is restarted — all of these things happen all the time, and they all generate an alert. It’s no wonder, then, that alert overload leads to numbness if not paralysis, defeating the whole purpose of observability.
AI can play an invaluable role in helping to reduce the alerts that are generated by our observability platforms so that engineering teams know they need to address the alerts they receive. But before applying AI, every organization should first define its SLOs. For example, if I have a payment service, and it’s supposed to be responsive within 200 milliseconds, and I need to have an error rate below 0.001.
With these SLOs defined, I can set up anomaly detection on these things: I want to make sure that my error rate doesn’t increase, and I want to make sure that my response time doesn’t change. With these critical parameters defined, I can then apply AI algorithms accordingly to identify and raise alerts for situations beyond those boundaries.
Ultimately, it is about cutting through the noise. We’re never going to investigate every alert. So, where do we focus? What is critical? AI helps us filter huge data sets to escalate the truly bad and de-escalate the time-consuming and non-critical.
And then there’s the issue of cost as everyone deals with the staggering price of traditional observability and APM tools. AI is already being employed to advance observability in this area — by filtering enormous datasets to identify and prioritize the data that is necessary to maintain, investigate and pay for. In all of these cases, AI is informing human experts, helping them do their work smarter and faster.
Will the next phase of AI transformation be AI-generating automated responses? That remains to be seen! We don’t know when, if ever, true AIOps will evolve into practice. In the meantime, what can CIOs do to build an effective observability strategy?
Building an Effective Observability Strategy: A CIO’s Guide
As decision-makers embark on building or refining their observability strategy that incorporates the transformational power of AI, several key considerations should be kept in mind:
● Immediate Value and Quick Wins: Identify areas where AI can deliver immediate value and contribute to quick wins, such as consolidating data effectively. How can you quickly identify the right data to observe, and who on your team is best suited to interpret results?
● Centralized Use Cases: Start with centralized use cases tailored to specific personas within the organization, ensuring observability aligns with their needs.
● Data and Cost Efficiency: Enlist a strategy that prioritizes data and cost efficiency from day one, focusing on key requirements and pain points, not everything under the sun.
● AI-Backed Querying: Leverage AI-backed querying for efficient data extraction and translation into business-level insights.
AI presents an unprecedented opportunity to revolutionize observability, turning this critical practice from a burden into a strategic asset. By embracing AI-driven solutions, everyone from technical users to C-suite executives can lead their organizations toward a more efficient, responsive and cost-effective observability strategy, ultimately enhancing the resilience of their IT operations in the face of evolving challenges.