April 20, 2021
Application Monitoring Best Practices
Monitoring is essential to ensure your applications' reliable operation and a vital practice of the DevOps movement. Learn more about the role of monitoring in the development workflow and the best practices for monitoring your applications.
As a developer, you might be responsible for developing features and ensuring that they run reliably. But why is monitoring so important, and what are the best practices for monitoring your application?
In this article, you will learn about:
- What monitoring is
- Why monitoring is important
- How monitoring relates to agile development methodologies
- How monitoring fits into your development workflow
- What you should monitor in your application
- How to monitor and valuable tools for monitoring
- How monitoring relates to distributed tracing
Note: The article uses the terms backend, application, and service interchangeably. They all refer to the same thing.
What is monitoring?
Monitoring is the practice of collecting, processing, aggregating, and visualizing real-time quantitative data about a system. In the context of an application, the measured data might include request counts, error counts, request latency, database query latency, and resource utilization.
For example, suppose you were developing new search functionality for an application and introduced a new API endpoint that queries. You might be interested in measuring the amount of time taken to serve such search requests and track how it performs when the concurrent load on that endpoint increases. You might then discover that the latency increases when users search specific fields due to a missing index. Monitoring can help you detect such anomalies or performance bottlenecks.
Why is monitoring important?
There are several reasons why monitoring is important – understanding the reasons informs your choices regarding implementation and choice of tools. From a high level, monitoring helps you to ensure the reliable operation of your application.
A more thorough exploration of the reasons includes:
- Alerting: when your application fails, you usually want to fix it as soon as possible. Alerting is possible by the real-time data about your system that you are monitoring. You can define alerts when a monitored metric value has exceeded a threshold, indicating an error requiring developer intervention. In turn, you're able to respond swiftly and reduce the downtime experienced by your users and potential revenue loss.
- Debugging: when your application fails, you don't want to be groping in the dark. Monitoring assists you in finding the root cause for the failure and helps you resolve the issue.
- Analyzing long-term trends: being able to see the extent to which your application utilizes resources over time with relation to active user growth can assist with capacity planning and scaling endeavors. Moreover, monitoring can provide insight into the relationship between new features and user adoption.
Web applications tend to grow in complexity over time. Even supposedly simple apps can be cumbersome to understand once deployed when considering how they'll function under load. Moreover, layers of abstraction and external libraries' usage obscure the app's underlying mechanics and failure modes. Monitoring provides you with x-ray-like vision into the health and operation of your application.
Monitoring is an indispensable tool in customer-centric SaaS companies. SaaS companies often guarantee their service's reliability using uptime expectations in the service-level agreement (SLA). A service-level agreement defines what a customer should expect from the SaaS service provider and acts as a legal document. For example, many cloud services guarantee a 99.99% uptime which equates to 52.60 minutes of acceptable downtime per year. Monitoring essentially allows you to reduce risk and track downtime while keeping it as low as possible (with alerting).
How does monitoring relate to agile development methodologies?
These days, engineering teams are increasingly adopting agile development methodologies, focusing on delivering incremental changes frequently and relying on automation tools to allow continuous delivery of changes.
Agile development methodologies enable development teams to reduce the time to market by embracing change and frequently deploying.
This is typically achieved by automating the repeatable aspects of the development workflow and creating a real-time feedback loop that allows engineers to analyze the impact of their changes on production systems. In such environments, monitoring serves as one of the essential tools for creating the real-time feedback loop.
How does monitoring fit into your development workflow?
A helpful way to understand how monitoring fits into the development workflow is by looking at the DevOps infinity loop. The DevOps philosophy brings developing software with operating software closer – two disciplines that were traditionally seen as separate concerns handled by different teams.
The merging of these two disciplines broadens engineering teams' responsibility and empowers them to own the entire development lifecycle from development to operating production deployments. The DevOps principle "you build it, you run it" captures the essence of this idea.
Practically speaking, monitoring is a concern that is addressed during three stages:
- Operating phase after you deploy changes
During the planning phase, you will identify some initial Service Level Indicators (SLI) derived from the Service Level Agreement (SLA). The SLIs are measurable metrics about your application that indicate whether you are meeting the SLA. For example, if your SLA states 99.9% uptime, the corresponding SLI will be monthly or yearly downtime for your service.
Note that setting SLI goals, while often derived from the SLA, is also influenced by your application's architecture and the metrics that you decide to measure and track.
During the development phase, the application code is instrumented with monitoring logic which exposes metrics such as the application's internal performance, load, and error counts. For example, when building an application that exposes a GraphQL API, you might instrument the code to collect metrics such as request counts (grouped by HTTP response code) and request latency. Upon each request, the instrumentation code increments the request count and tracks the request latency.
After your application code is tested and deployed to production, you use a monitoring tool (or service) to collect, store and visualize the information exposed by the instrumentation in the application. Typically such tools also come with alerting functionality that you can configure to send alerts when a failure requires developer intervention.
Visualizing the collected metrics gives you an overview of your application's health and internal condition in real-time. For example, drawing on the previous example, you might create a dashboard with graphs to visualize requests per minute, request latency, and system resource utilization (CPU, disk, network I/O, and memory). Additionally, you might set up an alert for when request latency is above a certain threshold.
In summary, you should think about monitoring throughout the development workflow.
What to monitor?
When setting up monitoring, there are two things that you want monitoring to help you answer: what is broken and why. In other words, you want to monitor for both things the indicate symptoms and their potential causes.
For example, if you were to monitor only HTTP response codes, you would be able to alert when there are problems with your application. However, this kind of monitoring won't help you answer the question of why requests are failing.
Another aspect to consider when deciding what to monitor is the broader goal of the application and how it fits into the business's goals. For example, you might know from your product or analytics team that there's a user dropoff that might be linked to slow responses. For the business, this can mean revenue loss. In such situations, you want to set Service Level Objectives (SLO) which defines the expectations for your application, e.g., serve requests under 500ms, and define a corresponding metric that you monitor.
This is where SLOs and SLIs come together. While the SLOs define your goals, the SLIs are the corresponding measurements that you want to monitor in your application.
Ideally, the monitoring data you collect should be actionable. If you collect too many metrics, your signal-to-noise ratio will decrease, making it harder to debug production problems. If you cannot use a metric to define drive alerts or provide a bird's eye view of the overall health of a system, consider removing it.
Black-box and White-box monitoring
There are two kinds of monitoring: black-box and white-box, and both play a critical role in your monitoring setup.
Black-box monitoring is when you measure externally visible behavior as observed by your users. Another way to look at it is that black-box monitoring probes the external characteristics of a service and helps answer questions such as: how fast was the service able to respond to the client request? Did it return the correct data or response code?
While black-box monitoring helps understand your application's state, it doesn't reveal much about the internal causes of the problem.
White-box monitoring is concerned with your application's internals and includes metrics exposed directly from your application code. White-box metrics should be determined in a way that the cause for an issue is identifiable. Examples include:
- Errors and exceptions
- Request rates
- Database query latency
- Count of pending database queries
- Latency in communication with other services.
Applications deployed to the cloud typically involve communication with other data sources and services, especially in Microservices architectures. Given the myriad of potential failure modes that such a system is bound to, you should also consider monitoring metrics that allow you to evaluate the effect these have on your service. If you own those other services, consider monitoring them too, because a problem in one service might cause a symptom in another.
Several methodologies assist in choosing what to monitor. Follow along to learn more.
The RED method – Rate, Errors, Duration
The RED method defines three key metrics that you should measure in your application/service:
- Request Rate: the number of requests per second your application is serving.
- Request Errors: the number of failed requests per second.
- Request Duration: distributions of the amount of time, i.e., duration each request takes.
Note that these metrics refer to client/user requests. In an application that uses a database, you could also apply these metrics to database queries and measure database query counts, errors, and durations.
The three metrics give you an overview of the load on your service, the number of errors, and the relationship between load and latency.
In summary, the RED method works well for request-driven applications such as API backends. The RED method was heavily influenced by Google's four golden signals approach, which we cover next.
The Four Golden Signals
The four golden signals approach defines the following four categories of metrics you can measure in an application:
- Latency: the time it takes to serve requests categorized by whether the request was successful or not.
- Traffic: A measure of the load on your application, e.g., requests per second.
- Errors: The rate of requests that fail.
- Satuation: A measure of how loaded your application is. This can be hard to measure unless you know the upper bounds of load that your application can handle by running load tests. Therefore, a proxy measure of resource utilization is used, for example, CPU load and memory consumption.
This method was conceived by Google's Site Reliability Engineering teams and popularized by the Site Reliability Engineering book.
How to monitor?
The previous sections laid the theoretical foundations of monitoring. In this section, you will learn about the tools and platforms for monitoring.
The landscape of tools and platforms for monitoring is rapidly expanding. Moreover, new terms are coined, which can be confusing to understand and make it hard to compare solutions. Therefore, it's helpful to think about the five stages of a metric:
- Instrumentation: exposing internal service metrics in your application code in a format that the monitoring tool can digest.
- Metrics collection: a mechanism by which the metrics are collected by or sent to the monitoring tool.
- Metrics storage and processing: the way that metrics are stored and processed in order to provide you with insights over time.
- Metrics visualization: a mechanism to visually represent the metrics in a human-readable way; typically as part of a dashboard
- Alerting: a mechanism to notify you when anomalies or failures are detected by the metrics data and require your intervention.
The implementation of the five stages largely depends on the monitoring tool you choose. Monitoring tools can be divided into two categories: self-hosted or managed, and there are several trade-offs to consider.
Self-hosted monitoring tools come with the overhead of another infrastructure component to manage. Because they serve such an important role, failure of the monitoring tool can lead to unnoticed downtime. Moreover, your architecture may not be complex enough to justify self-hosting. On the other hand, monitoring data can be sensitive, and depending on the security policies, using a third-party hosted service may not be an option.
Sidenote: OpenTelemetry is an effort to create a single, open-source standard and a set of technologies to capture and export metrics, traces, and logs from your applications and infrastructure for analysis to understand your software's performance and behavior. The standard has reached v1.0 in February 2021 – so expect some rough edges. But overall, it could simplify the developer workflows necessary to instrument and collect metrics and provide more platform/tool interoperability.
Now, let's look at some of the tools and services and how they relate to the five stages.
Prometheus, Grafana & Alert manager
Prometheus is an open-source monitoring system and alerting toolkit. Internally it has a time series database where metrics are stored. It also has a querying language for querying and visualizing metrics.
Grafana is an open-source visualization tool that supports many data sources, including Prometheus. With Grafana, you can create dashboards to visualize metrics.
Typically, Prometheus is self-hosted; however, in recent years, hosted Prometheus services have come out, reducing the overhead associated with running it.
Prometheus encourages using the pull model, whereby Prometheus pulls metrics by making an HTTP call to the metrics endpoint that the instrumentation code in your application exposes.
The Prometheus ecosystem consists of multiple components:
- The Prometheus server which scrapes and stores time-series data
- Client libraries for instrumenting application code
- Prometheus Alertmanager to handle alerts
Practically, monitoring with Prometheus looks as follows:
- Instrument your application using the client libraries available for many popular languages, including Go, Node.js, and more. The client libraries provide you with three metric types: counter, gauge, and histogram, which you instantiate in your code based on what you want to measure. For example, you might add a request counter and increment it every time a request comes in. The last step is to create a metrics HTTP endpoint which Prometheus routinely scrapes.
- Deploy Prometheus server and configure the instrumented services to be scraped.
- Deploy Prometheus Alertmanager and configure alerts.
- Deploy Grafana, add Prometheus as a data source, and set up dashboards to visualize the metrics you're tracking.
Prometheus routinely makes an HTTP request to the configured services' metrics endpoints and stores the information in its time-series database. Every time it pulls metrics, it checks the metrics against the alerting rules. If an alert condition has been met, Prometheus triggers an alert.
Prometheus is extremely powerful and is well suited for a Microservices architecture. However, it requires running several infrastructure components (Prometheus, Alertmanager, Grafana) and might be overkill if you have a relatively simple architecture.
Sentry is an open-source application monitoring platform that provides error tracking, monitoring, and alerting functionality. Sentry is unique in how it allows you to monitor both your frontend and backend, providing insight into the whole stack. Moreover, it helps you fix production errors and optimize your application's performance.
Sentry takes a holistic approach to monitoring under the premise that errors typically begin with code changes. In practice, Sentry collects data about both the internals of your application, e.g., unhandled errors and performance metrics, as well as metadata about releases, i.e., deployments. This approach is broader than that of a typical monitoring system and allows you to link errors and performance degradations to specific code changes.
In contrast to Prometheus, Sentry uses the push model to push errors and metrics to the Sentry platform.
Practically, monitoring with Sentry's hosted platform looks as follows:
- Create a Sentry account
- Instrument your backend with Sentry's language-specific SDK. Once you initialize the SDK, it sends unhandled errors and collects metrics that you define about your application.
- Set up alerts to notify you when errors occur or when metrics are above a threshold.
Sentry supports two kinds of alerts: metric and issue alerts. Metric alerts are triggered when a given metrics crosses a threshold you define. Issue alerts are triggered whenever Sentry catches an uncaught error in the application.
You can configure alerts to notify you via email or via one of the supported integrations.
New Relic is an observability platform that provides features and tools to analyze and troubleshoot problems across your entire software stack.
Note: The line between observability and monitoring is often blurry. Because the two are closely related, it's worth clarifying the distinction between the two. Observability can be seen as a superset of monitoring which includes –in addition to monitoring– traces and logs. While monitoring is used to report the overall health of systems, observability provides highly granular insights into the behavior of systems along with rich context, ideal for debugging purposes.
New Relic's observability functionality is broader than just monitoring and includes four essential data types of observability:
- Metrics: numeric measurements about your application as defined earlier in the article
- Events: domain-specific events about your application. For example, in an e-commerce application, you might emit an
OrderConfirmedevent whenever a user makes an order.
- Logs: The logs emitted by your application
- Traces: Data of causal chains of events between different components in a microservices ecosystem.
Practically, monitoring with New Relic looks as follows:
- Create a New Relic account
- Instrument your application with the New Relic agent which sends metrics and traces to New Relic
- Define alerts and configure dashboards on NewRelic
NewRelic supports many different integrations that allow you to collect data from various programming languages, platforms, and frameworks, making it attractive if your architecture is complex and consists of components and services written in different languages.
How does monitoring relate to distributed tracing?
While the focus of this article is monitoring, it's worth mentioning distributed tracing as it often comes up in the context of monitoring and observability.
In a Microservices architecture, it's common for requests to span multiple services. Each service handles a request by performing one or more operations, e.g., database queries, publishing events to a message queue, and updating the cache. Developers working with such architectures can quickly lose sight of the global system behavior, making it hard to troubleshoot problems.
Distributed tracing is a method used to profile and monitor applications, especially those built using a Microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance.
It does so by assigning external requests a unique external request-id, which gets passed to all services involved in handling the request. All involved services record information (start time, end time) about the requests and operations performed. The recorded information is collected by the tracing tool, which visualizes this information.
Distributed tracing complements monitoring with a subtle but fundamental distinction. While monitoring helps ensure the reliability of specific services, distributed tracing can help you understand and debug the relationship between services. In other words, tracing is suitable for debugging microservices architecture, where the relationships between services can lead to bottlenecks and errors.
In this article, you learned about the best practices for monitoring your application. Beginning with the foundations, then delving into how monitoring fits into the development workflow, what to monitor, and the tools to help you with this.
Choosing a monitoring tool or platform can be tricky. Therefore it's crucial to understand the principles behind monitoring, as it allows you to make more informed choices.
Monitoring alone will not make your application immune to failure. Instead, it will provide you a panoramic view of system behavior and performance in production, allowing you to see the impact of any failure and guide you in determining the root cause.
In summary, monitoring is a critical aspect of software development and an essential skill in enabling rapid development while ensuring the reliability and performance of your application.