Outlier detection

Documentation

Resilience

Outlier detection

Outlier detection is supported at the edge of a cluster using a gateway, or when a workload is enrolled in the waypoint layer.

Outlier detection is a form of passive health checking. That is, unlike health checks that proactively monitor endpoints, with outlier detection the “checking” is performed in the context of a request.

When the proxy detects that a specific workload is not healthy, it can stop sending it requests, for a specified period. We say the workload is ejected. Configuring outlier detection involves specifying both aspects:

What constitutes an unhealthy workload. For example, a specific number of consecutive 5xx errors over a specific time window.
Parameters that govern the ejection algorithm, including the ejection duration, and conditions that override a decision to perform an ejection. For example, if too few instances remain in the load balancing pool.

In the following scenario, you will explore a simple example whereby a single replica is ejected, to make things easy to test. The subsequent fault injection scenario provides a second, more realistic example involving multiple replicas.

Setup

Set up a cluster

You should have a running Kubernetes cluster with Istio installed in ambient mode. Ensure your default namespace is added to the ambient mesh:

$ kubectl label ns default istio.io/dataplane-mode=ambient

Deploy a waypoint

If you don’t already have a waypoint installed for the default namespace, install one:

$ istioctl waypoint apply -n default --enroll-namespace --wait

For more information on using waypoints, see Configuring waypoint proxies.

Turn on waypoint logging

Waypoint access logging is off by default, and can be turned on using Istio’s Telemetry API.

So that you can inspect the logs and see requests returning a 503 response code, turn on logging for the waypoint:

$ kubectl apply -f - <<EOF
---
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: enable-access-logging
  namespace: default
spec:
  accessLogging:
    - providers:
      - name: envoy
EOF

Deploy sample services

To test outlier detection, you will deploy a service, httpbin, and a curl client.

$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/httpbin/httpbin.yaml
$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/curl/curl.yaml

Configure outlier detection

There are two main aspects to configuring outlier detection:

How to tell when a workload is considered unhealthy
How long should the workload be quarantined

Outlier detection uses the term “ejection” to indicate the removal of a workload from the pool of load-balancing endpoints that a waypoint uses. The ejection time follows an algorithm based on a “base” ejection time, and a multiplier, which is increased or decreased as a function of the health of the application.

Configuration of outlier detection is done in a DestinationRule.

Specify the threshold for considering a workload unhealthy to be three consecutive 500 (5xx) errors, and set the base ejection time to 15 seconds:

$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 3
      baseEjectionTime: 15s
      maxEjectionPercent: 100
EOF

ℹ️

We have set the configuration field maxEjectionPercent to 100 to support a demo scenario, whereby a single deployed replica can be ejected, even though it constitutes 100% of the workloads backing the httpbin service — something you probably do not want to do in a production setting.

The feature can be tested by calling a failing endpoint three times in succession to trigger ejection.

Test outlier detection

In one terminal, tail the waypoint’s logs:

$ kubectl logs --follow deploy/waypoint

In a second terminal, run the below commands, and confirm each assertion:

A call to httpbin should succeed (no failures, no ejections yet):
```
$ kubectl exec deploy/curl -- curl -s httpbin:8000/json
```

Trigger ejection by sending three consecutive failing calls:

$ for i in {1..3}; do kubectl exec deploy/curl -- curl -s httpbin:8000/status/500; done

A call to httpbin should fail, assuming the request is sent within ejection period (15s):

$ kubectl exec deploy/curl -- curl -s httpbin:8000/json

Note the UH response flag (No Healthy Upstream) in the logs:

[2024-12-07T22:17:20.424Z] "GET /json HTTP/1.1" 503 UH no_healthy_upstream - "-" 0 19 0 - "-" "curl/8.11.0" "fd944d8d-d8fb-4aa7-a0b8-1941dd467af9" "httpbin:8000" "-" inbound-vip|8000|http|httpbin.default.svc.cluster.local - 10.43.203.172:8000 10.42.0.9:34678 - default

The same message is shown in the response:

no healthy upstream

This is validation that our lone workload has been ejected, since no endpoints remain in the proxy’s load balancing pool.

A call to httpbin should succeed after ejection time elapses (wait > 15s):
```
$ kubectl exec deploy/curl -- curl -s httpbin:8000/json
```

Triggering more ejections on the workload will cause a multiplier to increase, and ejection times will increase. Similarly, after a period of the workload remaining healthy, the multiplier will decrease.

Configure metrics collection

By default, Istio configures Envoy to record a minimal set of statistics to reduce the overall CPU and memory footprint. Outlier detection metrics are not included by default, and you must configure Istio to include them.

Using a Helm values file to configure Istio’s global mesh settings, you can include metrics that pertain to circuit breaking and outlier detection:

---
meshConfig:
  defaultConfig:
    # enable stats for circuit breakers, request retries, upstream connections, and request timeouts globally:
    proxyStatsMatcher:
      inclusionRegexps:
        - ".*outlier_detection.*"
        - ".*upstream_rq_retry.*"
        - ".*upstream_rq_pending.*"
        - ".*upstream_cx_.*"
      inclusionSuffixes:
        - "upstream_rq_timeout"

Save the above content to a file named mesh-config.yaml.

To configure metrics collection, run the helm upgrade command for the istiod chart, providing the additional configuration file as an argument:

$ helm upgrade istiod istio/istiod --namespace istio-system \
  --set profile=ambient \
  --values mesh-config.yaml \
  --wait

The waypoint must then be restarted in order to pick up the updated configuration:

$ kubectl rollout restart deploy waypoint

You are ready to observe some metrics.

Deploy Prometheus and display the metrics

Deploy Prometheus to the cluster:

$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/addons/prometheus.yaml

See Configure and view metrics for more information on configuring and viewing metrics.

Once Prometheus is deployed and ready, connect to its dashboard:

$ istioctl dashboard prometheus

Look for metrics such as envoy_cluster_outlier_detection_ejections_enforced_total and envoy_cluster_outlier_detection_ejections_enforced_consecutive_5xx.

Outlier detection - total ejections counter

Through these and other outlier-detection related metrics, development teams can be kept aware of ejections, a possible indication of issues with their applications.

Outlier detection events

Envoy can be configured to log outlier-detection specific events. In Istio, use the helm configuration field global.proxy.outlierLogPath (set to, say, /dev/stdout) to turn on the feature.

ℹ️

This feature was made configurable for waypoints only recently. Be sure to use the latest version of Istio.

Below are some example messages showing an ejection event caused by consecutive 500-type errors, and a subsequent corresponding “uneject” event returning the workload to the pool:

{
  "type": "CONSECUTIVE_5XX",
  "timestamp": "2024-12-07T19:30:13.724Z",
  "cluster_name": "inbound-vip|8000|http|httpbin.default.svc.cluster.local",
  "upstream_url": "envoy://connect_originate/10.42.0.8:8080",
  "action": "EJECT",
  "num_ejections": 1,
  "enforced": true,
  "eject_consecutive_event": {}
}
{
  "type": "CONSECUTIVE_5XX",
  "timestamp": "2024-12-07T19:30:37.384Z",
  "secs_since_last_action": "23",
  "cluster_name": "inbound-vip|8000|http|httpbin.default.svc.cluster.local",
  "upstream_url": "envoy://connect_originate/10.42.0.8:8080",
  "action": "UNEJECT",
  "num_ejections": 1,
  "enforced": false
}

Clean up

Delete the Prometheus deployment:

$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/addons/prometheus.yaml

Delete the DestinationRule:

$ kubectl delete destinationrule httpbin

Deprovision the waypoint:

$ istioctl waypoint delete -n default waypoint

Deprovision the sample applications:

$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/httpbin/httpbin.yaml
$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/curl/curl.yaml

Circuit breakers Fault injection