Istio: What happens when control plane is down?


Hi folks!

I made some experiments on Istio by taking down some control plane components and observing what happens with the applications and the service mesh. Below you’ll find my notes.

Pilot

Pilot is responsible for the traffic management feature of Istio, and it also is responsible for updating all sidecars with the very latest mesh configuration.

When Pilot starts it listens on port 15010 (gRPC) and 8080 (legacy HTTP).

When the application sidecar (Envoy, Istio-Proxy) starts, it connects to pilot.istio-system:15010, gets initial config and keeps connected.
Whenever pilot detects a change in the mesh (it monitors kubernetes resources), it pushes new configuration to sidecars via this gRPC connection.

– If pilot goes down, this gRPC connection between pilot and sidecar is lost, and sidecars try to reconnect to pilot indefinitely.
– Traffic is not affected if pilot is down, because all configuration pushed to the sidecar lives in sidecar memory.
– Changes in the mesh (such as new pods, rules, services, etc) won’t reach sidecars, because pilot is not there to listen for changes and forward those to sidecars.
– Once pilot is back, sidecars connect (because they are always trying to reconnect) to it and grab the latest mesh config.

Mixer Policy

Policy enforces network policy.

Mixer reads configuration on startup and also monitors kubernetes for changes. Once new config is detected, mixer loads them into its memory.

Sidecars check (call) the mixer policy pod for every request targeting the service application.

If the mixer policy pod is down, then all requests to the service will fail with a “503 UNAVAILABLE:no healthy upstream” error – because sidecar couldn’t connect to the policy pod.

In Istio 1.1 there’s a new [global] setting (policyCheckFailOpen) that allows a “Fail Open” policy, i.e., if mixer policy pod is not reachable, all requests will succeed instead of failing with a 503 error. By default this config is set to false, i.e., “Fail Close”.

While mixer is down, everything we do in the mesh (like adding rules, changing any config, etc) won’t have effect in apps until mixer is up again.

Mixer Telemetry

Telemetry provides telemetry information to addons.

Sidecars call Telemetry pod after each request is completed, providing telemetry information to adapters (Prometheus, etc). It does that in batches of 100 requests or 1 second (in default configuration), whatever comes first, in order to avoid excessive calls to the Telemetry pod.

If Telemetry pod is down, sidecars log an error (in pod stderr) and discard the telemetry information. Requests are not affected by that, like they are if Policy pod is down. Once Telemetry pod is up again, it starts receiving telemetry information from sidecars.

Other notes

It’s worth to note that Istio allows a custom installation of its control plane components. For example, if you do not need Policy, you can entirely disable mixer policy. This modularity is getting better in Istio 1.1. For more info on that, check out the docs.

Also, pilot, mixer policy and mixer telemetry work fine in a HA setup, with multiple replicas running at the same time. In fact, the default configuration comes with a HorizontalPodAutoscaler that ranges from 1 to 5 for those pods. (Curious? See this and this).

3 thoughts on “Istio: What happens when control plane is down?”

Leave a Reply

Your email address will not be published.