Fault injection

Documentation

Resilience

Fault injection

A way to harden a system is to actively exercise failure scenarios in production. This is known as chaos engineering or chaos testing. This type of hardening was made popular by Netflix and its suite of open-source chaos testing projects.

Why wait until a failure happens to find out how the system reacts? We are better off scheduling failures during working hours, when teams are available to deal with a situation if something goes wrong.

We can use chaos testing to verify a variety of failure scenarios:

Does a circuit breaker actually function? Does it properly fall back when an upstream service is unavailable?
Does failover actually work? When a service is unavailable locally, do we see apps fall back to a remote service?
Do retries succeed in masking transient errors in an upstream service?
Does outlier detection actually function? When workloads get into an unhealthy state, are they quarantined and are calls routed to remaining healthy instances?
Does autoscaling trigger when load crosses a specified threshold?
How does latency affect the availability of my system?
Do I have enough spare capacity to handle the request load when a specified percentage of workload instances are down or unavailable (either for an upgrade or due to a partial outage)?

Istio provides a mechanism to inject both latencies and errors by instrumenting the waypoints that are in the path between microservices.

The Kubernetes Gateway API does not currently support fault injection, and retries are only available in the Experimental channel. Therefore, you will configure fault injection with Istio’s legacy VirtualService resource. The routing features of the API are not used in these examples, and mixing VirtualService routing with HTTPRoute objects is not recommended.

Let us explore a fault injection scenario.

About the scenario

Bookinfo consists of the set of microservices: productpage, details, reviews, and ratings. productpage aggregates information that it obtains from details and reviews.

The below scenario takes advantage of the differences between the multiple implementations of the reviews service, specifically the fact that reviews-v2 and reviews-v3 call the upstream ratings service while reviews-v1 does not.

Setup

Set up a cluster

You should have a running Kubernetes cluster with Istio installed in ambient mode. Ensure your default namespace is added to the ambient mesh:

$ kubectl label ns default istio.io/dataplane-mode=ambient

Configure metrics collection

Using a Helm values file to configure Istio’s global mesh settings to include metrics for circuit breaking and outlier detection:

meshConfig:
  defaultConfig:
    # enable stats for circuit breakers, request retries, upstream connections, and request timeouts globally:
    proxyStatsMatcher:
      inclusionRegexps:
        - ".*outlier_detection.*"
        - ".*upstream_rq_retry.*"
        - ".*upstream_rq_pending.*"
        - ".*upstream_cx_.*"
      inclusionSuffixes:
        - "upstream_rq_timeout"

Save the above content to a file named mesh-config.yaml.

Run the helm upgrade command for the istiod chart, providing the additional configuration file as an argument:

$ helm upgrade istiod istio/istiod --namespace istio-system \
  --set profile=ambient \
  --values mesh-config.yaml \
  --wait

Deploy a waypoint

If you don’t already have a waypoint installed for the default namespace, install one:

$ istioctl waypoint apply -n default --enroll-namespace --wait

For more information on using waypoints, see Configuring waypoint proxies.

Turn on waypoint logging

Waypoint access logging is off by default, and can be turned on using Istio’s Telemetry API.

To be able to see that some requests return a 503 response code, turn on logging for the waypoint:

$ kubectl apply -f - <<EOF
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
  name: enable-access-logging
  namespace: default
spec:
  accessLogging:
    - providers:
      - name: envoy
EOF

Deploy sample services

To explore fault injection, you will use Istio’s Bookinfo sample application, and a curl client.

$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/bookinfo/platform/kube/bookinfo.yaml
$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/curl/curl.yaml

Configure a delay

Configure latency of four seconds in calls to the ratings service:

$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: ratings
spec:
  hosts:
  - ratings
  http:
  - fault:
      delay:
        fixedDelay: 4s
        percentage:
          value: 100.0
    route:
    - destination:
        host: ratings
EOF

Call the reviews service several times:

$ kubectl exec deploy/curl -- curl reviews:9080/reviews/123

Observe latency when calling the reviews service when the selected endpoint is either reviews-v2 or reviews-v3, and no delay when reviews-v1 handles the request.

Tail the waypoint’s logs. You should see a “DI” (Delay Injected) Envoy response flag when the waypoint proxies calls to the ratings service:

[2024-12-11T15:46:47.143Z] "GET /ratings/0 HTTP/1.1" 200 DI via_upstream - "-" 0 48 4001 1 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "4fa44f13-b508-4a99-8911-e4787acd0f33" "ratings:9080" "envoy://connect_originate/10.42.0.31:9080" inbound-vip|9080|http|ratings.default.svc.cluster.local envoy://internal_client_address/ 10.43.177.113:9080 10.42.0.29:47910 - -

Configure a timeout in the client

Use a VirtualService to configure a timeout of one second when calling the reviews service:

$ kubectl apply -f - <<EOF
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
    timeout: 1s
EOF

Observe a 504 “Gateway timeout” response when calling reviews when v2 and v3 endpoints are selected:

$ kubectl exec deploy/curl -- curl -s -v reviews:9080/reviews/123

Can we configure both a timeout and retry policy such that when calls to v2 or v3 take too long, we fall back to v1?

Configure clients to retry

In a worse case scenario:

A request could be routed to v2 and fail with a 504 after waiting 1s.
After an interval between retries, the first retry attempt could be routed to v3, and also fail with a 504 after another second elapses.
A second retry attempt will choose v1 (failed endpoints are excluded when retrying), which will succeed.

For this worse case scenario, in total, a client will be waiting perhaps somewhere north of 2 seconds.

Here is a VirtualService configuration that specifies a per-try timeout of 1s, with two retry attempts, while ensuring that we retry on 504 type responses, which the gateway-error value includes.

$ kubectl apply -f - <<EOF
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: reviews
spec:
  hosts:
  - reviews
  http:
  - route:
    - destination:
        host: reviews
    retries:
      attempts: 2
      perTryTimeout: 1s
      retryOn: gateway-error
EOF

Test this configuration by sending 10 requests to productpage and noting that:

All requests succeed
All reviews are returned by reviews-v1
Some requests return immediately; others take longer, due to one or two retry attempts

$ for i in {1..10}; do
    kubectl exec deploy/curl -- curl -s productpage:9080/productpage | grep reviews-
  done

The above configuration works, but is not optimal.

Augment with outlier detection

It would be better to use outlier detection to eject the v2 and v3 workloads, which will automatically cause requests to be sent to v1, and minimize retries:

$ kubectl apply -f - <<EOF
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: reviews
spec:
  host: reviews
  trafficPolicy:
    outlierDetection:
      consecutiveGatewayErrors: 1
      baseEjectionTime: 15s
      maxEjectionPercent: 100
EOF

In one terminal, monitor ejections through metrics and the watch command:

$ watch "kubectl exec deploy/waypoint -- pilot-agent request GET stats | grep ejections_enforced_total"

Send another 10 requests, and note how most return without delay, due to ejections causing all requests to be routed to the healthy endpoint, reviews-v1:

$ for i in {1..10}; do
    kubectl exec deploy/curl -- curl -s productpage:9080/productpage | grep reviews-
  done

Also note that the number of outlier detection ejections have gone up:

Every 2.0s: kubectl exec deploy/waypoint -- pilot-agent request GET stats | grep ejections_enforced_total

cluster.inbound-vip|9080|http|reviews.default.svc.cluster.local;.outlier_detection.ejections_enforced_total: 2

Finally, return the ratings service to regular latency:

$ kubectl delete virtualservice ratings

After about 15 seconds, the ejected workloads will be returned to the load balancing pool.

Send a series of requests to productpage:

$ for i in {1..10}; do
    kubectl exec deploy/sleep -- curl -s productpage:9080/productpage | grep reviews-
  done

The output will show that we are back to the normal state where all three versions of the reviews service are handling requests.

Clean up

Delete the DestinationRule and VirtualServices:

$ kubectl delete destinationrule reviews
$ kubectl delete virtualservice reviews
$ kubectl delete virtualservice ratings

Deprovision the waypoint:

$ istioctl waypoint delete -n default waypoint

Deprovision the sample applications:

$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/bookinfo/platform/kube/bookinfo.yaml
$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.26/samples/curl/curl.yaml

Outlier detection Security