OpenTelemetry from a bird’s eye view: a few noteworthy parts of the project
OpenTelemetry provides you with a set of tools, integrations, APIs, and SDKs in different languages to more easily increase the observability of your application. We figured that, since we’re working on an OpenTelemetry agent extension called Mesmer, we could show you the project from a developer’s perspective and point you to the parts that could be interesting, especially if you’re just getting started with the project. Let’s dig in!
What is Telemetry Data?
Just to have a common understanding of the term: telemetry data contains all sorts of information on the use and performance of applications and application components. Analyzing it can give you the answers to the following (example) questions:
- How often do we use a particular feature? Is the feature popular or maybe no one uses it?
- Is my production environment application even working or has it broken and we are losing money?
- What is the HTTP request rate for a particular endpoint? Can the application handle the demand?
- How long does an SQL query take? Does it require optimization?
- What are the steps my application takes to fulfil a request? How long does each step take?
- Are there any errors reported in the application? What’s the reason for them?
- … (insert your own diagnostic-related question here)
Being able to answer questions of this sort can bring a lot of value to your application. No wonder telemetry-providing solutions are popular. The answers to such questions can help not only the developers who wrote the application but also their managers or even shift the business direction of the company they work at.
What are examples of tools for collecting the Telemetry Data?
Telemetry Data can be categorized into three main types of signals: metrics, traces, and logs. There are plenty of tools that let you collect and display these signals. For example
- Prometheus. This is a tool for storing and collecting metrics in a time-series database. It also lets you set alarms based on the gathered data, query, and displays the time series in a human-readable form.
- Jaeger or Zipkin. These are tools for collecting and displaying distributed traces.
- Graylog or Logstash. Their goal is to collect logs into one central place so that you can query them easily.
All of these tools have their own APIs, are installed differently, and don’t have a standard set of conventions, naming, etc. This can become a maintenance problem when you need all of the signals. So there’s a need to mitigate the burden of working with multiple tools for telemetry collection and here’s where the OpenTelemetry project jumps in.
How can OpenTelemetry help?
First of all – OpenTelemetry is not a telemetry backend like the tools above. You can think of it as being “one layer above”. It’s a set of tools that aims to facilitate the use of the backends. How does it make using such tools easier?
OpenTelemetry aims to be the lingua franca when it comes to collecting telemetry data. The project goal is to provide ways to collect metrics, traces, and logs in multiple programming languages in the same way. So when you decide to use OpenTelemetry, the promise is that the same naming, concepts, abilities, conventions, etc. will be there in every language you use in your project. It brings universality to the telemetry world (in a way, similar to Kubernetes trying to bring universality to the world of application deployment or Apache Beam to the world of BigData Processing).
Something worth stressing is that the APIs and SDKs for all the signals are independent of the backends you use for collecting the signals. You don’t have to worry about tight coupling with the solution you choose for storing the telemetry data (eg. Prometheus). The same goes for the setup: in your project, you always set up OpenTelemetry the same way and there are standard ways for connecting with the telemetry backends downstream (described below).
Next, OpenTelemetry gives you 2 modes of operation: you can either use the OpenTelemetry API to manually instrument the telemetry collection from your application or you can use automatic instrumentation techniques that have already been implemented for some languages. In the case of the former, you have to write some code to collect the telemetry data but it’s you who decides what specific data is collected. In the case of the latter, at least on the JVM platform, you just add a Java Agent to your application that is able to detect the libraries you use and enrich them with metrics and traces (using bytecode manipulation techniques). A set of standard signals will be collected for you. If you happen to use libraries already instrumented by OpenTelemetry, you literally just need to add the agent to your application and set up the telemetry data collection to have the data displayed in the backend of your choice.
What is the OpenTelemetry Specification and why should I care?
The OpenTelemetry project does not start with code and tools. To be able to develop universal concepts in any language it is crucial to have a single source of truth. That’s what the OpenTelemetry Specification is for. It holds a language-agnostic description of all the cross-language requirements and their data models.
For example: from the specification repository you will get to know how to implement metrics support according to the OpenTelemetry standards. You will get to know what the main metric concepts mean (such as Meter Provider, Measurement, or Counter), what they consist of, what the design goals are, etc. Every time you want to understand a monitoring-related concept that is not language-specific, you can be sure it’s described in the specification repository. It’s also a great starting point when it comes to learning in general.
The community can propose changes to the specifications that have to go through the review process. The specifications are versioned and every specification document has a status assigned according to strictly defined rules. All of this effort is necessary to provide a standard modus operandi so there’s as little chaos as possible.
Besides documenting the cross-cutting concepts, the repository holds a Compliance Matrix which shows which language supports what parts of the specification. Be sure to check it out whenever you wish to adopt OpenTelemetry.
OpenTelemetry APIs and SDKs
Based on the Specification, the APIs and SDKs are implemented. There’s a noteworthy distinction between the two:
- APIs consist of all the abstractions used for instrumentation, clearly decoupled from their actual implementations. The APIs do not contain the working functionality (they are only there to define what is going to be collected).
- SDKs consist of all the parts that actually implement the APIs and provide the working functionality for collecting and exporting all the signal data.
Such decoupling is important because of the following reasons:
- The third-party tools and library developers import only the API and should not know anything about the final implementations. They have the least knowledge about how to add OpenTelemetry to their product (that’s good!).
- It’s the final library/tool user (application owner) that chooses and configures the SDK according to their needs. In particular, they can even choose not to add the SDK to their app, thus resigning from using the OpenTelemetry signals provided by the library author.
- There are multiple ways of doing the same thing expressed by the API. For example, one can decide to export OpenTelemetry data directly to Prometheus (by using the Prometheus exporter SDK) or via the OTLP protocol (by using the OTLP exporter SDK).
- The API and SDK parts can also have different maturity levels. For example, the Java Metrics SDK was announced as Stable quite recently. To see the maturity levels for all APIs and SDKs make sure you take a look at the specific component repository.
The Exporters and SDK autoconfiguration
An important part of the SDK is the exporters. After collecting the telemetry signals from your application, either directly (using the manual instrumentation approach) or indirectly (using the auto-instrumentation), you have to actually emit them. To do that, you have to use an SDK exporter and configure it to send data to a particular destination. Such a destination can be a telemetry backend of your choice (such as Prometheus, New Relic or Jaeger) or an OpenTelemetry collector (explained below).
For the JVM platform, the OpenTelemetry contributors have made developers’ lives even easier by creating SDK Autoconfiguration. This is a tool I totally recommend instead of configuring the exporter and the SDK manually. It comes with sensible defaults so it might be that you only have to alter a few properties and you’re good to go.
You can export the signals directly to the backend of your choice directly from the OpenTelemetry exporter in your application. This is in fact the quickest way to get value from OpenTelemetry since such a setup is quite straightforward. This is, however, not the only option. You can export your telemetry data from your application to a tool called “the collector”.
The OpenTelemetry collector allows you to receive, process, and export telemetry data in a vendor-agnostic way. The goal here is to unify the process of telemetry data processing so that you don’t have to set up communication with multiple telemetry tools separately. You just configure one collector that is able to communicate with all the tools, and run it alongside your application (eg. in a docker container). Your telemetry signals are propagated for you in protocols each downstream backend will understand.
Another great benefit of the collector is that it provides you with some extra functionality you normally might not have, such as retrying, batching or encryption.
You typically configure an OpenTelemetry collector with a YAML file that consists of the following sections:
- Receivers: these allow you to declare in what format and where you want to receive the metrics from
- Processors: these can be run on data that you have just received and transform them
- Exporters: these allow you to send the data to the backend of your choice.
- Extensions: some additional functionalities of the collector that are not strictly related to the telemetry data processing
After you declare the above, you can form the telemetry data processing pipelines in the “services” section.
An example collector configuration could look like this (see the comments for clarification):
receivers: # This section declares what protocols and entry points will be used by the collector otlp: # usually, when using only OpenTelemetry in your app, you receive the telemetry data in OTLP format... protocols: grpc: kafka: # ... however that doesn't mean you can't have other sources of data, such as kafka in this case. protocol_version: 2.0.0 exporters: # This section declares where the data will go prometheus: endpoint: 0.0.0.0:8889 namespace: promexample logging: # note you can simply log the data to the collector's console window for debugging purposes loglevel: debug jaeger: endpoint: jaeger-all-in-one:14250 processors: # Additional processing can be added batch: # better compresses the data and reduces the number of outgoing connections required for transmitting data timeout: 5s # Send the batch after 5s no matter what memory_limiter: # prevents "out of memory" errors check_interval: 1s limit_mib: 4000 attributes: # you can also manipulate attributes actions: - key: my.custom.attribute action: insert value: foobar extensions: health_check: # this will allow the status of the collector to be checked pprof: # for collector performance profiling endpoint: :1888 service: # Grande finale: defines what you will actually use. Construct the data processing pipelines from receivers, processors and exporters extensions: [ pprof, health_check ] pipelines: metrics: receivers: [ otlp, kafka ] processors: [ batch, attributes, memory_limiter ] # the metrics attributes will be processed by the collector here... exporters: [ prometheus, logging ] traces: receivers: [ otlp ] processors: [ batch, memory_limiter ] # ...but we skipped processing the attributes for traces exporters: [ jaeger, logging ]
The OpenTelemetry project is part of the Cloud Native Computing Foundation – a foundation that associates multiple Open Source projects (for example Kubernetes or Prometheus) and gathers developers from around the world to work on them. OpenTelemetry was accepted by the foundation in May 2019 and since then it has been in the “incubation stage”. OpenTelemetry itself has a moderately large community of developers – in the GitHub organization you can find 135 people but of course, that does not mean they are the only contributors.
To reach the community, you can use the Slack channel, or the mailing lists – see this link to get access. From my experience, the community is quite responsive but this may vary depending on the language you use.
We have touched on some basic terms when it comes to telemetry in general. We have very briefly meditated on the values and goals of the project to see how the values are eventually brought to daylight. Some hints have been included. Although OpenTelemetry is still being developed and here and there, it is still rough around the edges, make sure that you check it out as a candidate for your next project. It might be that it already fulfils your requirements in a stable way and will take some of the monitoring burdens off your shoulders.
Extra: what is an OpenTelemetry Agent Extension?
At the beginning of this article, I mentioned Mesmer, which is an OpenTelemetry agent extension but I didn’t explain what it is. A longer explanation can be found in my previous blog post. In short, it is “a plugin” that you can attach to a JVM OpenTelemetry Agent to enrich it with new functionalities. In the case of the Mesmer project, the “new functionalities” are new metrics that you normally would not have when running your app with a bare OpenTelemetry Agent. Currently, this extension mostly supports Akka but more metrics are in the pipeline. If this sounds useful to you, be sure to check it out too!