Is Datadog an Effective Tool for Implementing Observability?

Written by Shuwethaa Prakesh.

Observability is an important consideration for organisations to understand the performance and reliability of their systems. This can be achieved through monitoring metrics, logs, and traces of the system to determine what are the key issues preventing the system from performing at an optimal level. Datadog is a product that provides this through various capabilities such as: infrastructure monitoring, application performance monitoring (APM), log management, and real-user monitoring (RUM). In this blog, we will cover the different capabilities of Datadog while also covering its limitations to determine whether it is a viable option for observability.

Capabilities

Infrastructure

Datadog can monitor and analyse the performance of hosts, containers, and processes through infrastructure monitoring. An example of a feature covered by Datadog is infrastructure mapping, which can be used to provide an overall view of hosts and containers in one dashboard, providing a visual analysis of the CPU usage of these through the use of colour.

Containers

In addition to an overview of hosts, Datadog also provides infrastructure monitoring for containers. The containers page displays real-time metrics through a table-view, outlining specific metrics such as: total CPU usage, memory usage, bytes sent and received, and when the container was started.

Processes

For enterprises, Datadog also includes the ability to view what processes are being run on infrastructure. Through this, a user can determine how much of a resource each process is taking for each host or container, thereby allowing them to optimise any part of a service that is consuming too many resources. A benefit to Datadog over other observability software is that it provides real-time data collection at a 2 second resolution, which provides more detail than most observability software which often have a rate of around 10 seconds. This makes Datadog a strong candidate to consider when choosing an observability software.

It is also incredibly easy to configure overall infrastructure monitoring, all that is required is to install the Datadog Agent. Once complete, additional monitoring for containers and processes must be enabled through the Datadog Agent’s configuration YAML file.

Logs

Datadog log management sends logs through to the dashboard to collect, process, archive, and monitor them. Most products traditionally require users to pay for a daily volume of logs which is often costly for businesses. To manage this, users filter the logs before collecting them which can lead to some valuable data being accidentally filtered out. Datadog has separated logs and their indexing, which is normally used for sorting creating two separate layers: one for ingestion, and one for indexing. The benefit of this is that more flexibility is now offered where filters can be chosen, logs can be archived, and it can cater specifically to the organisation’s needs. This is a specific feature they provide called “Logging without Limits”. This can assist with costs as the organisation will have control over what logs to index rather than indexing them all. Datadog also has the capability to integrate with common logging frameworks such as Rsyslog, Syslog-ng, NXlog, FluentD, and Logstash.

Metrics

Metrics help track an environment over time through values such as error rates, latency, CPU, and disk usage. Utilising metrics is an important pillar in observability, as it can assist in providing a general overview of a system’s health.

Datadog provides around 700 different technologies that are integrated with metrics out of the box. In order to configure these, Datadog has specific documentation to follow, which generally involves configuring a YAML file for the technology in the Datadog agent folder and then editing the Datadog agent YAML file as well.

For custom metrics, the user can define metrics about certain aspects of an application to provide more insight than the standard integrations. This is useful as users can tailor the metrics for specific scenarios and therefore gain deeper visibility into the performance and behaviour of their applications. For example, jobs can be monitored by sending metrics through for a successful or failed request. The most straightforward way to do this is through DogStatsD. This is a metrics aggregation service included with the Datadog agent that uses the StatsD protocol and adds the following Datadog-specific extensions: service checks, tagging, events, and histogram metric type. The metrics collected are periodically sent over UDP, meaning that an application does not experience an interruption if DogStatsD becomes unavailable.

An example of metrics received through Datadog. This is an out-of-the-box metric for Wildfly.

Application Performance Monitoring (APM)

In Datadog, APM includes a variety of features that provide in-depth information into the overall performance and behaviour of applications and systems. The instrumentation of APM is automatic for some programming languages and frameworks. This enables the user to start monitoring their applications easily without adding any instrumentation code. All that is required is having the Datadog agent available and libraries that handle the instrumentation themselves.

The homepage for Datadog APM displays a list of applications as above.

Additionally, APM captures distributed traces, which provide a detailed view of the flow of requests through an application stack. These traces are represented through the use of flame graphs, which can be utilised to determine performance bottlenecks and pinpoint the cause of latency in different parts of an application. The traces also allow a user to see the flow of requests between microservices. This provides an understanding of how requests flow through different components and can consequently highlight any dependencies between these components, allowing troubleshooting to be easier as a request’s timeline can be followed from one microservice to another.

Within distributed traces, Datadog includes granular information about the database queries, execution path, timing, and more. This high level of visibility allows a user to gain a variety of insights into how code behaves in different production environments, helping to determine the exact cause of any errors or latency.

The service summary page of an application.

Datadog also collects metrics involved with the resource use and performance of an application such as CPU usage, memory usage, latency, and error rates. These can be used in conjunction with traces to analyse and make note of performance bottlenecks or high error rates to optimise a service.

Another draw of Datadog's APM is that it includes alerting capabilities. A user can easily set-up alerts for when metrics surpass a specified threshold or for anomalies detected in the application metrics. This can be done through the Datadog page, demonstrating its ease of use. Alerts allow for teams to have faster response times for incidents or issues in an application as the real-time metrics can trigger the alert instantly.

Universal Tagging

Datadog provides “unified service tagging”, which essentially means that metrics can be linked together through three tags that Datadog has reserved: service, env (environment), and version. In addition to this, other custom tags can also be created by a user to help navigate search results when filtering through metrics, traces, or logs. Having a means of navigation is helpful because users can engage in more efficient troubleshooting through faster data retrieval. It also helps an organisation to set and maintain a consistent standard of what tags to utilise and how to label them. Furthermore, it is relatively simple to configure universal tagging for a service, reducing the barrier to entry for newer users of Datadog.

A more detailed look into a specific trace. It shows the tags set and the flame graph bar for the trace requests.

Real-User Monitoring (RUM)

Real User Monitoring (RUM) is a Datadog feature that allows for observability on the frontend side of an application. In order to activate it, the user is required to inject some configuration into the application code so that performance data can be collected.

Datadog’s RUM capability monitors the performance of an application through collecting metrics such as page load times, network latency, error rates, and user actions (e.g. clicks and scrolls). These metrics can then be segmented based on numerous categories such as browser, device type, or location which can allow an individual to analyse trends relevant to and improve performance for these segments.

The general layout for a RUM dashboard.

In addition to these general metrics, a useful feature of RUM is geographical insights. This involves a visual of a map that displays where users who are interacting with the application are located. This is valuable because performance issues can be addressed for specific regions.

Service-Level Objectives

To understand Service Level Objectives (SLOs), we must first cover what Service Level Indicators (SLIs) are. SLIs are a quantitative measurement of a service’s performance, and can include latency, availability, and performance. SLOs are specific targets for SLIs set by a business. These are generally expressed as a percentage over a period of time. For example, a business may aim to achieve an average latency of less than 2 seconds for checkout requests 99% of the time.

The set-up for SLOs is fairly straight-forward and can be done through Datadog’s user interface. A user can configure it by selecting the SLO status page from the menu and following the prompts. Datadog allows for teams to create alerts when performance falls below or does not meet the defined SLOs, so that businesses can have prompt incident response times.

Dashboards

Datadog dashboards provide a visual representation of data to help users track trends and monitor key performance indicators (KPIs). This allows a business to stay on top of the overall performance and health of an application and maintain an overall knowledge of the impact of various incidents and fixes enacted by their application teams.

New dashboards can be configured by selecting “New Dashboard” from the Dashboard List page, and then further customised to a specific layout. Alternatively, a benefit to Datadog is that it provides dashboards for various technologies and integrations out-of-the-box. These can then be cloned and used as a guideline on how to customise the dashboard to match the business’ goals.

An example of an out-of-the-box dashboard. This is for an Oracle Database.

Limitations:

Integrations

While the set-up for Datadog is relatively simple, it is restricted by what technologies it can cover. For example, a service using IBM mainframes cannot be monitored through Datadog software even with the custom metrics offered by Datadog. Additionally, from experience in a previous project, most services that run on Windows Powershell had to be monitored through custom metrics and could not collect standard APM metrics, causing higher costs than a basic integration.

Cost

The largest disadvantage of Datadog is its cost to run. As previously mentioned, Windows Powershell applications required custom metrics for effective monitoring. Datadog bills custom metrics specifically on the cardinality of the metrics, which means that reporting more unique metrics will increase the cost. Therefore, without proactive management, these costs can easily increase.

Logging can also pose a concern to the cost if it isn’t properly managed. This is because all data from logs are ingested and billed by Datadog. Even if these logs are not actively utilised by the team, it still costs money for Datadog to process, store, and index the logs. For companies with usage-based products, this price can easily increase rapidly once the logs pass their agreed commitments and require the company to pay on demand.

Additionally, the pricing model is quite complex to navigate, as there are numerous products that have costs described in different ways. For example, one product may be defined as cost per 1K events while another could be per GB or even per 100K events. This can make it difficult to translate the usage to cost and predict how much will be spent on a particular application using Datadog.

Conclusion

Datadog’s ease of use and consolidation of metrics provides an individual with access to a vast amount of data to improve the performance and reliability of a service, making it a very good option to use. To manage the cost, a user will probably have to spend some time to pinpoint areas of possible cost wastage and therefore maximise the benefits that Datadog has to offer. Out of the various observability products to choose from, Datadog is one of the strongest products on offer.