Fault injection
A way to harden a system is to actively exercise failure scenarios in production. This is known as chaos engineering or chaos testing. This type of hardening was made popular by Netflix and its suite of open-source chaos testing projects.
Why wait until a failure happens to find out how the system reacts? We are better off scheduling failures during working hours, when teams are available to deal with a situation if something goes wrong.
We can use chaos testing to verify a variety of failure scenarios:
- Does a circuit breaker actually function? Does it properly fall back when an upstream service is unavailable?
- Does failover actually work? When a service is unavailable locally, do we see apps fall back to a remote service?
- Do retries succeed in masking transient errors in an upstream service?
- Does outlier detection actually function? When workloads get into an unhealthy state, are they quarantined and are calls routed to remaining healthy instances?
- Does autoscaling trigger when load crosses a specified threshold?
- How does latency affect the availability of my system?
- Do I have enough spare capacity to handle the request load when a specified percentage of workload instances are down or unavailable (either for an upgrade or due to a partial outage)?
Istio provides a mechanism to inject both latencies and errors by instrumenting the waypoints that are in the path between microservices.
The Kubernetes Gateway API does not currently support fault injection, and retries are only available in the Experimental channel. Therefore, you will configure fault injection with Istio’s legacy VirtualService resource. The routing features of the API are not used in these examples, and mixing VirtualService routing with HTTPRoute objects is not recommended.
Let us explore a fault injection scenario.
About the scenario
Bookinfo consists of the set of microservices: productpage
, details
, reviews
, and ratings
. productpage
aggregates information that it obtains from details
and reviews
.
The below scenario takes advantage of the differences between the multiple implementations of the reviews
service, specifically the fact that reviews-v2
and reviews-v3
call the upstream ratings
service while reviews-v1
does not.
Setup
Set up a cluster
You should have a running Kubernetes cluster with Istio installed in ambient mode. Ensure your default namespace is added to the ambient mesh:
$ kubectl label ns default istio.io/dataplane-mode=ambient
Configure metrics collection
Using a Helm values file to configure Istio’s global mesh settings to include metrics for circuit breaking and outlier detection:
meshConfig:
defaultConfig:
# enable stats for circuit breakers, request retries, upstream connections, and request timeouts globally:
proxyStatsMatcher:
inclusionRegexps:
- ".*outlier_detection.*"
- ".*upstream_rq_retry.*"
- ".*upstream_rq_pending.*"
- ".*upstream_cx_.*"
inclusionSuffixes:
- "upstream_rq_timeout"
Save the above content to a file named mesh-config.yaml
.
Run the helm upgrade
command for the istiod
chart, providing the additional configuration file as an argument:
$ helm upgrade istiod istio/istiod --namespace istio-system \
--set profile=ambient \
--values mesh-config.yaml \
--wait
Deploy a waypoint
If you don’t already have a waypoint installed for the default
namespace, install one:
$ istioctl waypoint apply -n default --enroll-namespace --wait
For more information on using waypoints, see Configuring waypoint proxies.
Turn on waypoint logging
Waypoint access logging is off by default, and can be turned on using Istio’s Telemetry API.
To be able to see that some requests return a 503 response code, turn on logging for the waypoint:
$ kubectl apply -f - <<EOF
apiVersion: telemetry.istio.io/v1
kind: Telemetry
metadata:
name: enable-access-logging
namespace: default
spec:
accessLogging:
- providers:
- name: envoy
EOF
Deploy sample services
To explore fault injection, you will use Istio’s Bookinfo sample application, and a curl
client.
$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/bookinfo/platform/kube/bookinfo.yaml
$ kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/curl/curl.yaml
Configure a delay
Configure latency of four seconds in calls to the ratings
service:
$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: ratings
spec:
hosts:
- ratings
http:
- fault:
delay:
fixedDelay: 4s
percentage:
value: 100.0
route:
- destination:
host: ratings
EOF
Call the reviews
service several times:
$ kubectl exec deploy/curl -- curl reviews:9080/reviews/123
Observe latency when calling the reviews
service when the selected endpoint is either reviews-v2
or reviews-v3
, and no delay when reviews-v1
handles the request.
Tail the waypoint’s logs. You should see a “DI” (Delay Injected) Envoy response flag when the waypoint proxies calls to the ratings
service:
[2024-12-11T15:46:47.143Z] "GET /ratings/0 HTTP/1.1" 200 DI via_upstream - "-" 0 48 4001 1 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36" "4fa44f13-b508-4a99-8911-e4787acd0f33" "ratings:9080" "envoy://connect_originate/10.42.0.31:9080" inbound-vip|9080|http|ratings.default.svc.cluster.local envoy://internal_client_address/ 10.43.177.113:9080 10.42.0.29:47910 - -
Configure a timeout in the client
Use a VirtualService to configure a timeout of one second when calling the reviews
service:
$ kubectl apply -f - <<EOF
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
timeout: 1s
EOF
Observe a 504 “Gateway timeout” response when calling reviews
when v2 and v3 endpoints are selected:
$ kubectl exec deploy/curl -- curl -s -v reviews:9080/reviews/123
Can we configure both a timeout and retry policy such that when calls to v2 or v3 take too long, we fall back to v1?
Configure clients to retry
In a worse case scenario:
- A request could be routed to v2 and fail with a 504 after waiting 1s.
- After an interval between retries, the first retry attempt could be routed to v3, and also fail with a 504 after another second elapses.
- A second retry attempt will choose v1 (failed endpoints are excluded when retrying), which will succeed.
For this worse case scenario, in total, a client will be waiting perhaps somewhere north of 2 seconds.
Here is a VirtualService configuration that specifies a per-try timeout of 1s, with two retry attempts, while ensuring that we retry on 504 type responses, which the gateway-error value includes.
$ kubectl apply -f - <<EOF
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
retries:
attempts: 2
perTryTimeout: 1s
retryOn: gateway-error
EOF
Test this configuration by sending 10 requests to productpage
and noting that:
- All requests succeed
- All reviews are returned by
reviews-v1
- Some requests return immediately; others take longer, due to one or two retry attempts
$ for i in {1..10}; do
kubectl exec deploy/curl -- curl -s productpage:9080/productpage | grep reviews-
done
The above configuration works, but is not optimal.
Augment with outlier detection
It would be better to use outlier detection to eject the v2
and v3
workloads, which will automatically cause requests to be sent to v1
, and minimize retries:
$ kubectl apply -f - <<EOF
---
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
outlierDetection:
consecutiveGatewayErrors: 1
baseEjectionTime: 15s
maxEjectionPercent: 100
EOF
In one terminal, monitor ejections through metrics and the watch
command:
$ watch "kubectl exec deploy/waypoint -- pilot-agent request GET stats | grep ejections_enforced_total"
Send another 10 requests, and note how most return without delay, due to ejections causing all requests to be routed to the healthy endpoint, reviews-v1
:
$ for i in {1..10}; do
kubectl exec deploy/curl -- curl -s productpage:9080/productpage | grep reviews-
done
Also note that the number of outlier detection ejections have gone up:
Every 2.0s: kubectl exec deploy/waypoint -- pilot-agent request GET stats | grep ejections_enforced_total
cluster.inbound-vip|9080|http|reviews.default.svc.cluster.local;.outlier_detection.ejections_enforced_total: 2
Finally, return the ratings
service to regular latency:
$ kubectl delete virtualservice ratings
After about 15 seconds, the ejected workloads will be returned to the load balancing pool.
Send a series of requests to productpage
:
$ for i in {1..10}; do
kubectl exec deploy/sleep -- curl -s productpage:9080/productpage | grep reviews-
done
The output will show that we are back to the normal state where all three versions of the reviews
service are handling requests.
Clean up
Delete the DestinationRule and VirtualServices:
$ kubectl delete destinationrule reviews
$ kubectl delete virtualservice reviews
$ kubectl delete virtualservice ratings
Deprovision the waypoint:
$ istioctl waypoint delete -n default waypoint
Deprovision the sample applications:
$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/bookinfo/platform/kube/bookinfo.yaml
$ kubectl delete -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/curl/curl.yaml