General

A Vision for Application Observability

A ChatGPT-genarated image of a man observing an application

Contextualizing Your Application Logs and Metrics to Understand Your Applications Better

There are many platforms for logging, like Splunk, Dynatrace, and Prometheus, among others. But what do we really need to understand if our application is performing as we need it to? This depends on where the application lives and runs, though there is some similarity across architectures. This article presents a vision for making applications observable in ways that matter, are maximally useful for developers, and efficiently serve business needs. Full application observability promotes deeper knowledge of the application design and data flow across the team.

So, what do you need to know to determine if your application is working right for your users and is displaying the most up-to-date information? Here, I argue for 4 key pieces that get us there:

Documentation and demonstration of microservice flow
Co-location of logs connected to pieces of the microservice flow
Application, normally UI plus API, response metrics to capture uptime and latency
Contextual cost tracking

Cost tracking is included in the above list, but cost tracking and anomaly detection find themselves more situated on top of the other parts since they utilize observability data to draw more directed insights. For example, maybe you have had more runs of certain processes lately due to an increased user base. This cost data requires context to understand if it’s appropriately correlated with usage or other things going on in your business and application.

Can we build a system that demonstrates application functionality in its observability platform?

Figure 1: Application observability components

1. Have all your backend processes succeeded? If not, where did they fail within the context of that microservice flow?

Generally, we think of this backend process observability through traces. Still, I believe this end-to-end nature and services like AWS X-Ray overcomplicate or obfuscate a much simpler need most applications have, which is to understand if the core pieces of transformation, business logic, etc, that make up your application have succeeded and when they last ran.

A simpler event-based approach that runs when processes complete should suffice in most cases. An event structure I’ve used before that will hopefully help you is:

{“process-name”: “<name of process>”,
“service”: “<cloud or other service this is running on>”,
“account”: “<account number, etc>”,
“grain”: “<granularity of the process(es) running such as
“success-count”: “<count of successes>”,
“input-details”: “<json object of input details with dates, run-ids or other context>”,
“failure-count”: “<count of failure>”,
“timestamp”: “<timestamp>”}

The key to defining this structure for yourself is that you want to both make it easy to identify where issues exist and co-locate your logs in one place.

Figure 2: An Abstracted Microservice Flow where step 3 has failed

As shown in Figure 2, you want to know what all your microservices are, how the pieces flow together, and where the failure occurred. If you receive an event that 3 failed, the context of 1 and 2 succeeding and a timestamp representing when is critical to troubleshooting.

And, while many applications can help with bubbling up errors, they often do not provide sufficient value to justify the costs because a big part of the value comes from understanding where in the flow the issues exist. Shipping everything to a centralized SaaS can add latency, cost egress, or simply be impossible in regulated, air-gapped, or edge setups. But ultimately, context is everything. Whether or not these tools work for you is deeply business-dependent. Anecdotally, I’ve seen third-party observability tools work well when using many distributed servers or EC2 instances or many high traffic customer sites. But, many applications just aren’t that.

Figure 3: Writing to a common event bus when processes complete

This image shows an example of how you might send events that are generated from distinct microservices (of which Figure 2 shows one) to a common event bus. An event bus might have too much overhead, so it could instead make sense to use a messaging queue or pub/sub system.

Do all your processes have compatible data?

Is the data that feeds your application as up-to-date as possible? Or do you know when it isn’t and why? Many systems rely on a mix of data velocities and veracities (as well as variety and volume), and sometimes the veracity, which is deeply tied to the value, leaves something to be desired due to time alignment issues. If some data is updated monthly and another set is updated hourly, you need to know that and clearly communicate it for system validity, synchronization, and fast troubleshooting; accounting for and recording these details in your json structure can help you track your data over time. And, this sort of record-keeping can be a great foundation for justifications to improve your larger organization’s application landscape when there are dramatic synchronization issues between systems.

For example, imagine we get weekly sales data, but real-time production data. Aside from handling any granularity differences, you should take special care to ensure your data products or systems don’t have any funky downstream effects based on how data is used. Logging granularity and data inputs can help you efficiently troubleshoot data issues; granularity might be “job” or “event”, or it might be much more specific to your application.

2. What is happening in your logs?

FIgure 4: Sample log output with timestamps

Can events be drawn back to log traces? Do you see what the traceback is in the application in context? Logs should be immutable, timestamped records of discrete events in the flow. Once again, context is everything. Tieing logs to the specific step in the flow that failed will enable speedy recovery and root cause analysis.

3. How are the pieces of your application that others interact with performing?

API and Application uptime and latency let you know how your processes that impact users or other applications are performing over time. Your goal is to provide quantifiable trends that show system health and performance. Once baselines are established, you can enable alerting on thresholds or anomalies and systematically work to lower that latency. Knowing these details is useful for capacity planning and detecting performance regressions.

4. How do you determine if your costs are reasonable, at least relatively so?

To determine whether your cloud computing costs are reasonable, start by knowing what your costs are relative to each other each month, and then proceed with correlating spending with usage metrics. A healthy cost profile shows proportional growth; when utilization, like compute hours, data transfers, or active users, increases, costs rise accordingly, and when usage dips, costs follow suit. This correlation helps flag anomalies such as idle resources, poor lifecycle policies, misconfigured autoscaling, or unoptimized storage tiers. Over time, using dashboards that split up resource costs by teams, environments, or workloads allows you to refine budgets, detect drift, and ensure that spending remains both justifiable and efficient. It lets teams learn from each other on cost-efficient application design by showing what they spend relative to each other; it opens the door to data-driven efficiency conversations between and among teams.

Conclusion and Considerations

What are the core processes in your application? If it’s a complicated application, this might seem tedious, but it is worth it and much less tedious when done at project inception or when new microservices are added.

What are the pieces of information you need to genericize the events from each process? This is likely domain-specific. I hope the event structure I shared above can be a good starting point, but it is unlikely to be sufficient for your domain.

When your application fails, can you connect that failure in your observability platform to the logs and troubleshoot the stack trace efficiently?

Figure 5: A Sankey Chart of Process A’s Flow

A very smart person I work with discusses this vision as the “pizza tracker approach” to application observability. Can we use Sankey diagrams or other visuals to show the application flow? Figure 5 shows what that might look like. Can you link the overall failure easily to specific log errors from the observability tool or dashboard? That might be done using process IDs or just thoughtful JSON logging structures.

Can we build a system that demonstrates application functionality in its observability platform? I have been thinking about what underlies good application observability, and the conclusion I have come to is that it should be self-documenting. I should see the flow in Figure 5 and know that Process A has 4 pieces, 1 and 2 must run first, they have x and y purposes, then 3 runs by using data from 1 and 2. Finally, 4 should run after that. These pieces might not all run back-to-back. There might be no trigger explicitly between them, but the results of 3 depend on 1 and 2, and 4 depends on 3.

Do you have any thoughts on application observability as a concept? Please reach out to us at New Math Data or comment below.

AWS

software-architecture

software-development

Observability

Performance Testing

Cloud Architecture

Data Engineering

AWS Offerings

Cloud Migration

Data Science & Analytics

Generative AI

MLOps

Financial Services

Energy & Utilities

Healthcare & Life Sciences

Education

Careers

Resource Documents

Leadership

About Us

Events

A Vision for Application Observability

Contextualizing Your Application Logs and Metrics to Understand Your Applications Better

1. Have all your backend processes succeeded? If not, where did they fail within the context of that microservice flow?

Do all your processes have compatible data?

2. What is happening in your logs?

3. How are the pieces of your application that others interact with performing?

4. How do you determine if your costs are reasonable, at least relatively so?

Conclusion and Considerations

Table of Contents

Services

Industries

Resources

About