data

By Tatsiana Kakareka, Senior Solution Architect, IBA Group

Most organizations have no shortage of observability data. What they often lack is a reliable way to connect technical signals with business outcomes.

I have watched a lot of incident reviews. After enough incident reviews, certain patterns become difficult to ignore. The engineering team comes in with a clean story about what went wrong: which service slowed down, where the latency came from, and which call timed out where. The technical explanation is usually accurate. The business implications are often left unexplored.

For more than two decades, I have seen organizations invest heavily in collecting observability data while paying far less attention to how that information is interpreted and used. Few organizations attempt to quantify the cost of that disconnect. It is larger than expected every time someone does.

Where observability loses business relevance

Observability did its job. We can answer questions about distributed systems that a monitoring tool from a decade ago could not reach. We know why a service slowed down, not just that it did. Projects like OpenTelemetry made these signals consistent across teams and stacks. The technical layer is fine.

What is not fine is what happens after the signal is generated. The question a system can now answer — why did the checkout service slow down – is not the question being asked on the next floor up. A CFO is asking how many checkouts dropped while it was slow and how that compares to the same hour last quarter. Both questions exist within the same organization, but responsibility for connecting them is often unclear.

I have sat in incident reviews where the engineering team had the dashboards, the traces, and the timeline of the outage down to the second. Everything they presented was correct. None of it survived contact with the executive team. A two-percent spike in 500 errors got reported as a two-percent spike in 500 errors because nobody in the room could convert it into orders. The discussion went back to engineering. Got fixed. The strategic conversation that should have followed did not happen.

What rarely gets measured is the cost of decisions that were never informed by the available data.

Connecting observability to business outcomes

The connection between technical and business metrics is established long before an incident review takes place. By the time someone is presenting at an incident review, the work was either done six months earlier, or it cannot be done now.

A 300-millisecond increase in checkout latency is not a business signal on its own. It becomes one when the system has been instrumented to track conversion at that endpoint, so the same data point reads as “we lost roughly X percent of paying users for every 100 milliseconds we added here.” A rising error rate in the recommendations service remains a metric until somebody correlates it with average order value, at which point it is a margin issue with a number attached. A degrading authorization API remains an incident until it is tracked against churn cohorts, at which point it is a retention risk with a forecast.

This is not exotic engineering. It requires deciding, upfront, that business events get instrumented next to technical ones and that somebody owns the queries that join them. In most cases, the obstacle is not technical capability. Engineering does not own the KPIs. The product does not own the telemetry. Responsibility often falls between teams, leaving neither side accountable for connecting the two.

We were working on a project for a global IT leader. They had a web-based information system that gathered contract-related financial and healthcare information from different sources, including project management tools, claims, ledgers, customer billing systems, and third-party orders and invoices. The application was working really well with a small number of users. But when it had to serve more than 500 concurrent users, the spreadsheet became unreadable, and the system’s availability left much to be desired.  With a monolithic approach, one has to work with the entire database. As for the microservices approach, we can work with an individual metric or a couple of metrics. Step-by-step, we achieve the required level of availability for all major metrics. Eventually, we reached the needed system’s scalability and integrated numerous contract management systems that served individual locations or divisions in a single system.

The practices that hold this together

Tooling matters, but less than the vendors selling it would suggest. A stack built around Prometheus and Grafana for metrics, Loki or Elasticsearch for logs, and Jaeger or Tempo for traces gets a team to technical sufficiency. Commercial platforms layer on convenience and faster correlation. For organizations, the difference is convenience rather than capability.

What turns the stack into something the business can use is process discipline. Structured logs that use the same field names across services. Trace identifiers that survive every hop. Latency, errors, and saturation are measured the same way in every service. New services get dashboards before they go to production, not six months later when somebody asks for them. Alerts get reviewed and pruned because an alert that has not fired in a year is just noise the team has learned to ignore.

These practices rarely attract the same attention as new tooling, despite having a greater impact on long-term reliability. The DORA research has been saying for years that mature observability practices correlate with both delivery performance and business outcomes. Observability still gets cut first when budgets tighten.

We widely introduce DevOps. This practice is not my own invention, but it is very efficient. This way, we increase the lead time, that is, the time from bug fixing to deployment, and increase observability of all processes. With the right processes, it is possible to gather data every day, not once a month or a quarter. The data are more accurate and show, for example, which systems produce failures on a regular basis.

Bringing reliability into business decision-making

The teams I have seen close this gap stopped running observability as a separate function. The same review cadence that covers financial performance covers reliability, because once the data is translated, they are the same review. A service-level objective that protects revenue gets approved by the CFO. A latency budget that protects conversion goes into next quarter’s plan alongside marketing spend. An outage report quantifies what the customer lost and gets read in the same room that approves capital.

Following the introduction of the microservices architecture, our client made a decision to redesign the systems that were not functioning properly regularly. Thus, the development roadmap was changed. In addition, the precise amount of the investment already spent and how much more was needed also became visible and clear, and the management could make more accurate and grounded decisions.

In many organizations, the technical foundations are already in place. The unresolved challenge is ownership of how that information is interpreted and acted upon.

The organizations that close this gap rarely do so through a single initiative. Over time, reliability data becomes part of the same conversations that shape investment, operational priorities, and business performance.

About the Author

Tatsiana KakarekaTatsiana Kakareka is a seasoned IT professional with extensive experience in designing high-load systems, data management solutions, predictive analytics, and process automation. She joined IBA Group in 2004 and has held multiple technical and leadership roles, advancing to Senior Solution Architect in August 2022. Tatsiana holds a Master of Science degree in Applied Mathematics.

LEAVE A REPLY

Please enter your comment!
Please enter your name here