Troubleshooting

Troubleshooting

Install and setup issues

Before doing anything else, make sure you read and follow

Failure to correctly configure your environment will result in issues.

ztunnel is not capturing my traffic

First, check the pod for the ambient.istio.io/redirection. This indicates that the CNI node agent has enabled redirection.

$ kubectl get pods shell-5b7cf9f6c4-npqgz -oyaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    ambient.istio.io/redirection: enabled

If the annotation is missing: the pod was not enrolled in the mesh.

  1. Check the logs of the istio-cni-node pod on the same node as the pod for errors. Errors during enablement may be blocking the pod from getting traffic from ztunnel.
  2. Check the logs of the istio-cni-pod on the same node to verify it has ambient enabled. The pod should log AmbientEnabled: true during startup. If this is false, ensure you properly installed Istio with --set profile=ambient.
  3. Check the pod is actually configured to have ambient enabled. The criteria are as follows:
    • The pod OR namespace must have the istio.io/dataplane-mode=ambient label set
    • The pod must not have the sidecar.istio.io/status annotation set (which is added automatically when a sidecar is injected)
    • The pod must not have istio.io/dataplane-mode=none set.
    • The pod must not have spec.hostNetwork=true.

If the annotation is present: this means Istio claims it enabled redirection for the pod, but apparently it isn’t working.

  1. Check the iptables rules in the pod. Run a debug shell and run iptables-save. You should see something like below:
# iptables-save
# Generated by iptables-save v1.8.10 on Wed Sep 25 22:06:16 2024
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:ISTIO_OUTPUT - [0:0]
:ISTIO_PRERT - [0:0]
-A PREROUTING -j ISTIO_PRERT
-A OUTPUT -j ISTIO_OUTPUT
-A ISTIO_OUTPUT -d 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_OUTPUT -p tcp -m mark --mark 0x111/0xfff -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -o lo -j ACCEPT
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -p tcp -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15001
-A ISTIO_PRERT -s 169.254.7.127/32 -p tcp -m tcp -j ACCEPT
-A ISTIO_PRERT ! -d 127.0.0.1/32 -p tcp -m tcp ! --dport 15008 -m mark ! --mark 0x539/0xfff -j REDIRECT --to-ports 15006

The exact contents may vary, but if there is anything relating to Istio here, it means iptables rules are installed.

  1. Check if ztunnel is running within the pod network. This can be done with netstat -ntl. You should see listeners on a few Istio ports (15001, 15006, etc):
$ netstat -ntl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.1:15053         0.0.0.0:*               LISTEN
tcp6       0      0 ::1:15053               :::*                    LISTEN
tcp6       0      0 :::15001                :::*                    LISTEN
tcp6       0      0 :::15006                :::*                    LISTEN
tcp6       0      0 :::15008                :::*                    LISTEN
  1. Check the logs of ztunnel. When sending traffic, you should see logs like info access connection complete .... Note that these are logged when connections are closed, not when they are opened, so you may not see logs for your application if they use long-lived connections.

Pod fails to run with Failed to create pod sandbox

For pods in the mesh, Istio will run a CNI plugin during the pod ‘sandbox’ creation. This configures the networking rules. This may intermittently fail, in which case Kubernetes will automatically retry.

This can fail for a few reasons:

  • no ztunnel connection: this indicates that the CNI plugin is not connected to ztunnel. Ensure ztunnel is running on the same node and is healthy.
  • failed to add IP ... to ipset istio-inpod-probes: exist: this indicates Istio attempted to add the workload. This can be caused by a race condition in the Kubernetes IP allocation, in which a retry can resolve the issue. On Istio 1.22.3 and older, there was a bug causing this to not recover; please upgrade if so. Other occurrences of this may be a bug.

ztunnel fails with failed to bind to address [::1]:15053: Cannot assign requested address

This is fixed in Istio 1.23.1+, please upgrade. See issue.

ztunnel fails with failed to bind to address [::1]:15053: Address family not supported

This indicates your kernel does not support IPv6. IPv6 support can be turned off by setting IPV6_ENABLED=false on ztunnel.

ztunnel traffic issues

Understanding logs

When troubleshooting traffic issues, the first step should always be to analyze the access logs in ztunnel. Note that there may be two ztunnel pods involved in a request (the source and destination), so its useful to look at both sides.

Access logs by default log on each connection completion. Connection opening logs are available at debug level (see how to set log level).

An example log looks like:

2024-09-25T22:08:30.213996Z     info    access  connection complete     src.addr=10.244.0.33:50676 src.workload="shell-5b7cf9f6c4-7hfkc" src.namespace="default" src.identity="spiffe://cluster.local/ns/default/sa/default" dst.addr=10.244.0.29:15008 dst.hbone_addr=10.96.99.218:80 dst.service="echo.default.svc.cluster.local" dst.workload="waypoint-66f44865c4-l7btm" dst.namespace="default" dst.identity="spiffe://cluster.local/ns/default/sa/waypoint" direction="outbound" bytes_sent=67 bytes_recv=518 duration="2ms"
  • The src/dst addr, workload, namespace, and identity represent the information about the source and destination of the traffic. Not all information will be available for all traffic:
    • identity will only be set when mTLS is used.
    • dst.namespace and dst.workload will not be present when traffic is sent to an unknown destination (passthrough traffic)
  • dst.service represents the destination service, if the call was to a service. This is not always the case, as an application can reach a Pod directly.
  • dst.hbone_addr is set when using mTLS. In this case, hbone_addr represents the target of the traffic, while dst.addr represents the actual address we connected to (for the tunnel).
  • bytes_sent and bytes_recv indicate how many bytes were transferred during the connection.
  • duration indicates how long the connection was open
  • error, if present, indicates the connection had an error, and why.

In the above log, you can see that while the dst.service is echo, the dst.workload (and dst.addr) are for waypoint-.... This implies the traffic was sent to a waypoint proxy.

Traffic timeout with ztunnel

Traffic is blocked, showing a log, with errors like below:

error   access  connection complete      direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="io error: deadline has elapsed"
error   access  connection complete      direction="outbound" bytes_sent=0 bytes_recv=0 duration="10002ms" error="connection timed out, maybe a NetworkPolicy is blocking HBONE port 15008: deadline has elapsed"
  • For the connection timed out error, this means the connection could not be established. This may be due to networking issues reaching the destination. A very common cause (hence the log) is to have a NetworkPolicy or other firewall rule blocking port 15008. Istio mTLS traffic is tunneled over port 15008, so this must be enabled (both on ingress and egress).
  • For the more generic errors like io error: deadline has elapsed, this generally is the same root causes as above. However, if traffic works without ambient, it is unlikely to be a typical firewall rule, as the traffic should be sent identically as without ambient enabled. This likely indicates an incompatibility with your Kubernetes setup.

Readiness probes fail with ztunnel

After enabling ambient mode, pod readiness probes fails. For example, you may see something like below:

  Warning  Unhealthy               92s (x6 over 4m2s)   kubelet                  Readiness probe failed: Get "http://1.1.1.1:8080/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Ambient mode intends to not capture or impact any readiness probe traffic. It does this by applying a SNAT rule in the host, to rewrite any traffic from kubelet as coming from 169.254.7.127, then skipping redirection for any traffic matching this pattern.

Readiness probe failures that start when enabling ambient mode typically indicate an environmental issue with this traffic rewrite.

For instance:

Traffic fails with timed out waiting for workload from xds

When traffic is sent from a pod, ztunnel must first get information about the pod from Istiod (over the XDS protocol). If it fails to do so after 5s, it will reject the connection with this error.

Istiod is generally expected to return information substantially sooner than with 5s. If this error happens intermittently, however, it may indicate this is not happening. This could be caused by istiod being overloaded, or possible modifications that increase PILOT_DEBOUNCE_AFTER (which can slow down updates).

If the issue persistently happens, it is likely a bug; please file an issue on the Istio repo.

Traffic fails with unknown source

This indicates ztunnel was unable to identify the source of traffic. In Istio 1.23, ztunnel would attempt to map the source IP of traffic to a known workload. If the workload has multiple network interfaces, this may prevent ztunnel from making this association.

Istio 1.24+ does not require this mapping.

Traffic fails with no healthy upstream

This indicates traffic to a Service had no applicable backends.

We can see how ztunnel views the Service’s health:

$ istioctl zc services
NAMESPACE    SERVICE NAME         SERVICE VIP    WAYPOINT ENDPOINTS
default      echo                 10.96.99.1     None     3/4

This indicates there are 4 endpoints for the service, but 1 was unhealthy.

Next we can look at how Kubernetes views the service:

$ kubectl get endpointslices
NAME          ADDRESSTYPE   PORTS    ENDPOINTS                           AGE
echo-v76p9    IPv4          8080     10.244.0.20,10.244.0.36 + 1 more... 7h50m

Here, we also see 3 endpoints.

If Kubernetes shows zero healthy endpoints, it indicates there is not an issue in the Istio setup, but rather the service is actually unhealthy. Check to ensure its labels select the expected workloads, and that those pods are marked as “ready”.

If this is seen for the kubernetes service, this may be fixed in Istio 1.23+ and Istio 1.22.3+.

If this is seen for hostNetwork pods, or other scenarios where multiple workloads have the same IP address, this may be fixed in Istio 1.24+.

Traffic fails with http status: ...

ztunnel acts as a TCP proxy, and does not parse users HTTP traffic at all. So it may be confusing that ztunnel reports an HTTP error.

This is the result of the tunneling protocol (“HBONE”) ztunnel uses, which is over HTTP CONNECT. An error like this indicates ztunnel was able to establish an HBONE connection, but the stream was rejected.

When communicating to another ztunnel, this may be caused by various issues:

  • 400 Bad Request: the request was entirely invalid; this may indicate a bug
  • 401 Unauthorized: request was rejected by AuthorizationPolicy rules
  • 503 Service Unavailable: the destination is not available

When communicating with a waypoint proxy (Envoy), there is a wider range of response codes possible. 401 for AuthorizationPolicy rejection and 503 as a general catch-all are common.

Traffic fails with connection closed due to connection drain

When ztunnel shuts down an instance of a proxy, it will close any outstanding connections. This will be preceded with a log like inpod::statemanager pod delete request, shutting down proxy for the pod.

This can happen:

  • If the Pod is actually deleted. In this case, the connections are generally already closed, though.
  • If ztunnel itself is shutting down.
  • If the pod was un-enrolled from ambient mode.

ztunnel logs HBONE ping timeout/error and ping timeout

These logs can be ignored. They are removed in Istio 1.23.1+. See issue for details.

ztunnel is not sending egress traffic to waypoints

Consider a ServiceEntry such as:

apiVersion: networking.istio.io/v1
kind: ServiceEntry
metadata:
  name: example.com
  labels:
    istio.io/use-waypoint: my-waypoint
spec:
  hosts:
  - example.com
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: DNS

Unlike a typical Service, this will not necesarily have two components needed for traffic capture to work:

  • It will not have a stable Service IP address known to Istio (example.com may have many, changing, IPs).
  • We do not have DNS set up to return such a stable IP address, if one did exist.

Istio has two features to resolve these issues:

  • values.pilot.env.PILOT_ENABLE_IP_AUTOALLOCATE=true enables a controller that will allocate an IP address for the ServiceEntry and write it into the object. You can view it in the ServiceEntry itself:
    status:
      addresses:
      - host: example.com
        value: 240.240.0.3
      - host: example.com
        value: 2001:2::3
  • values.cni.ambient.dnsCapture=true enables ztunnel to handle DNS, which allows it to respond with the above IP addresses in response to a query to example.com. Note that you will need to restart client workloads after changing this setting.

Together, this enables egress traffic to traverse a waypoint. To troubleshoot this:

  1. Ensure the ServiceEntry has an IP address in the status.
  2. Check the pod is getting this IP address in DNS lookups.
  3. Check whether this IP shows up as the destination IP address in ztunnel.

Waypoint issues

Traffic is not going through the waypoint

First, we will want to see some signs that indicate traffic is traversing a waypoint:

  1. Requests sent to the waypoint will generally go through Envoy’s HTTP processing, which will mutate the request. For example, by default headers will be translated to lowercase and a few Envoy headers are injected:

    x-envoy-upstream-service-time: 2
    server: istio-envoy
    x-envoy-decorator-operation: echo.default.svc.cluster.local:80/*

    Note this is not always the case, as traffic may be set as TCP.

  2. Waypoint access logs, if enabled, will log each request. See here to enable access logs.

  3. ztunnel access logs, if enabled, will log each request. See here for an example log to a waypoint.

Traffic can be sent to a service or directly to a workload. While sending to a service is typical, see the ztunnel access logs to identify the type of traffic. Similarly, a waypoint can be associated with a service, a workload, or both. Mismatches between these can cause the waypoint to not be utilized.

Cilium with bpf-lb-sock requires bpf-lb-sock-hostns-only to be set, or all traffic will be incorrectly treated as direct-to-workload traffic. (issue).

Next, we can check if ztunnel is configured to send to a waypoint:

$ istioctl zc services
NAMESPACE    SERVICE NAME         SERVICE VIP  WAYPOINT ENDPOINTS
default      echo                 10.96.0.1    waypoint 1/1
default      no-waypoint          10.96.0.2    None     1/1
$ istioctl zc workloads
NAMESPACE  POD NAME                     ADDRESS     NODE  WAYPOINT     PROTOCOL
default    echo-79dcbf57cc-l2cdp        10.244.0.1  node  None         HBONE
default    product-59896bc9f7-kp4lb     10.244.0.2  node  waypoint     HBONE

This indicates the echo Service, and the product-59896bc9f7-kp4lb Pod are bound to waypoint. If ztunnel is configured to use the waypoint for the destination but traffic isn’t going to the waypoint, it is likely traffic is actually going to the wrong destination. Check the ztunnel access logs to verify the destination service/workload and ensure it matches.

If the waypoint shows as None, ztunnel isn’t programmed to use a waypoint.

  1. Check the status on the object. This should give an indication whether it was attached to the waypoint or not. (Note: this is available in 1.24+, and currently only on Service and ServiceEntry)

    $ kubectl get svc echo -oyaml
    status:
      conditions:
      - lastTransitionTime: "2024-09-25T19:28:16Z"
        message: Successfully attached to waypoint default/waypoint
        reason: WaypointAccepted
        status: "True"
        type: istio.io/WaypointBound
    
  2. Check what resources have been configure to use a waypoint:

    $ kubectl get namespaces -L istio.io/use-waypoint
    NAME                           STATUS   AGE    USE-WAYPOINT
    namespace/default              Active   1h     waypoint
    namespace/istio-system         Active   1h
    

    You will want to look at namespaces in all cases, services and serviceentries for service cases, and pods and workloadentries for workload cases.

    This label must be set to associate a resource with a waypoint.

  3. If the label is present, this may be cause by the waypoint being missing or unhealthy. Check the Gateway objects and ensure the waypoint is deployed.

    $ kubectl get gateways.gateway.networking.k8s.io
    NAME       CLASS            ADDRESS   PROGRAMMED   AGE
    waypoint   istio-waypoint             False        17s
    

    Above shows an example of a waypoint that is deployed, but is not healthy. A waypoint will not be enabled until it becomes healthy at least once. If it is not healthy, check the status for more information.

    If the Gateway isn’t present at all, deploy one!

Common information

Running a debug shell

Most pods have low privileges and few debug tools available. For some diagnostics its helpful to run an ephemeral container with elevated privileges and utilities. The istio/base image can be used for this, along with kubectl debug --profile sysadmin.

For example:

$ kubectl debug --image istio/base --profile sysadmin --attach -t -i shell-5b7cf9f6c4-npqgz

Setting log level

To view the current log level, run:

$ istioctl zc log ztunnel-cqg6c
ztunnel-cqg6c.istio-system:
current log level is info

To set the log level:

$ istioctl zc log ztunnel-cqg6c --level=info,access=debug
ztunnel-cqg6c.istio-system:
current log level is hickory_server::server::server_future=off,access=debug,info

To set at ztunnel pod startup, configure the environment variable:

$ kubectl  -n istio-system  set env ds/ztunnel RUST_LOG=info