HTTP observability in ztunnel

HTTP observability in ztunnel

Support for HTTP observability in ztunnel is available in Gloo Mesh, an enterprise distribution of ambient mesh.

Gloo Mesh includes an enhanced version of ztunnel which is able to provide HTTP observability directly. This ensures telemetry is available even when workloads are not using waypoint proxies. If you are deploying a waypoint exclusively to get HTTP observability, it is recommended to instead used ztunnel’s metrics, which come with a substantially reduced overhead.

Configuration

Depending on how you installed Gloo Mesh, you may already have HTTP observability enabled. To check the current configuration:

$ istioctl ztunnel-config all -ojson | jq .config.l7Config
{
  "access_log": {
    "enabled": true,
    "skip_connection_log": false
  },
  "enabled": true,
  "metrics": {
    "enabled": true
  },
  "tracing": {
    "enabled": false,
    "otlp_endpoint": "http://gloo-telemetry-collector.gloo-mesh:4317"
  }
}

This is also logged during ztunnel startup.

If the output from the above is null (i.e. you don’t see any l7Config entries) ensure you are using Gloo Mesh images.

To enable access logs and metrics with default settings:

$ kubectl set env ds/ztunnel -n istio-system L7_ENABLED=true

To customize installation entirely, you can use the following values during installation:

Value Description
l7Telemetry.enabled Globally enable or disable HTTP telemetry. Both this option and the individual telemetry types must be enabled to be enabled
l7Telemetry.metrics.enabled Enables or disables HTTP metrics
l7Telemetry.accessLog.enabled Enables or disables HTTP access logs
l7Telemetry.accessLog.skipConnectionLog If enabled, connections that are found to only have HTTP requests will log the TCP connection log at the ‘debug’ level (which is typically disabled, so effectively it does not log it). This is particularly useful when dealing with short-lived connections, where logging both TCP connections and HTTP requests causes excessive noise. Note: if the connection does not carry HTTP, the TCP connection event will always be logged. If disabled (default), both HTTP requests and TCP connections will be logged.
l7Telemetry.distributedTracing.enabled Enables or disables HTTP tracing
l7Telemetry.distributedTracing.otlpEndpoint OTLP endpoint to send traces to. For example http://opentelemetry-collector:4317

Logs

HTTP access logs have mostly same format as TCP access logs, with a few variations.

  • HTTP logs are logged per HTTP request, while TCP logs are per connection.

  • While TCP logs have the bytes_sent and bytes_recv attributes, HTTP logs have method, path, protocol, response_code, host, and user_agent. For example: method=GET path="/productpage" protocol=HTTP1 response_code=200 host="productpage:9080" user_agent="curl/8.10.1"

  • In addition to logging HTTP requests, TLS information will also be added for TLS requests. This adds the tls.sni and tls.alpn attributes. For example: tls.sni=example.com tls.alpn=h2.

As ztunnel does not terminate TLS from applications, traffic will either be classified as TLS or HTTP. If the traffic is HTTPS traffic, it will only be seen as TLS.

Metrics

In addition to TCP level metrics, ztunnel will also provide the following HTTP metrics:

  • istio_requests_total: Indicates the total count of HTTP requests. The response_code label distinguish the result.
  • istio_request_duration_milliseconds: Indicates the distribution of the duration of each HTTP request.
  • istio_request_bytes: Indicates the distribution of HTTP request sizes.
  • istio_response_bytes: Indicates the distribution of HTTP response sizes.

Performance and safety

ztunnel telemetry is specifically designed to be highly performant and safe.

Capturing telemetry does not modify requests at all. If a request cannot be parsed (whether due to being invalid HTTP, bugs, etc), the connection is not impacted (telemetry collection is, however, disabled for the remainder of the connection).

Requests processing occurs after requests are forwarded (ensuring this processing time is outside the critical path) and typically takes ~100ns per request. Even under heavy load, this has been observed to have a less than 1% overhead in request latency and throughput.