Observability

Observability

Observability1 refers to the ability to understand the working of your software, generally based on being able to analyze the telemetry that it generates.

An ambient mesh enables observability of the performance and health of the services that use it through telemetry data generated by ztunnel and waypoint proxies.

Ambient mesh telemetry

Telemetry data is generated by ztunnel and waypoint proxies, when enabled, in three primary categories (or signals):

  • Metrics: Istio generates a set of service metrics based on the four “golden signals” of monitoring (latency, traffic, errors, and saturation).
  • Access Logs. As traffic flows into a service within a mesh, Istio can generate a full record of each request, including source and destination metadata. This information enables operators to audit service behavior down to the individual workload level.
  • Distributed Traces. Istio generates distributed trace spans for each service, providing operators with a detailed understanding of call flows and service dependencies.

Metrics

Metrics provide a way of monitoring and understanding behavior in aggregate. The two layers of ambient mesh generate different metrics:

  • ztunnel generates TCP metrics for all service traffic
  • Waypoint proxies generates metrics for all traffic, including request and response metrics for HTTP, HTTP/2 and gRPC traffic.

To monitor service behavior, Istio generates metrics for all service traffic in, out, and within an Istio service mesh. These metrics provide information on behaviors such as the overall volume of traffic, the error rates within the traffic, and the response times for requests.

In addition to monitoring the behavior of services within a mesh, it is also important to monitor the behavior of the mesh itself. Istio components export metrics on their own internal behaviors to provide insight on the health and function of the mesh control plane.

Access logs

Access logs provide a way to monitor and understand behavior from the perspective of an individual workload instance.

Ambient mesh can generate access logs for service traffic in a configurable set of formats, providing operators with full control of the how, what, when and where of logging.

Distributed traces

Distributed tracing provides a way to monitor and understand behavior by monitoring individual requests as they flow through a mesh. Traces empower mesh operators to understand service dependencies and the sources of latency within their service mesh.

Ambient mesh supports distributed tracing through waypoint proxies. The proxies automatically generate trace spans on behalf of the applications they proxy, requiring only that the applications forward the appropriate request context.

Observability with and without waypoints

ztunnel only has access to L4 data. For many cases, this may provide enough sufficient observability: you can see traffic flowing between Service A and Service B, but you cannot see the HTTP method, or use distributed tracing. If you need more observability for a particular use case, you can deploy a waypoint to help diagnose a problem, and remove it afterwards.

Support for HTTP observability in ztunnel is available in Gloo Mesh, an enterprise distribution of ambient mesh.

OpenTelemetry and using a collector

OpenTelemetry is the standard for open source observability.

Istio’s logs and metrics are published in standard formats, which can be scraped with Prometheus or similar tooling. Waypoint proxies support sending traces to a number of a number of different backends, including the OTLP format.

For production use, we recommend the use of an OpenTelemetry collector, which offers maximum flexibility for performing transformations, and allows authorization for backends to be done in one place, rather than in each proxy.

Dashboards

Telemetry data can be analyzed using dashboards.

  • Kiali is an observability console for Istio with service mesh configuration and validation capabilities. It helps you understand the structure and health of your service mesh by monitoring traffic flow to infer the topology and report errors.

  • Gloo Mesh, an enterprise distribution of ambient mesh, comes with an insights engine that automatically analyzes your Istio setups for health issues. Then, Gloo shares these issues along with recommendations to harden your Istio setups. The insights give you a checklist to address issues that might otherwise be hard to detect across your environment.

  • Istio publishes dashboards for Grafana which use Prometheus metrics to show the health of Istio components as well as services running in the mesh.

Dive into observability

Explore the following sections to learn about ambient mesh:


  1. The OpenTelemetry project offers a primer to the concept of observability↩︎