Istio, mTLS, debugging a 503 error


Hello folks. In this blog post I’ll share with you a problem that I had while trying out the Circuit Breaking tutorial in the Istio documentation. I’ll follow all the steps I did while troubleshooting this issue, and hopefully it will be useful for someone out there. It was for me at least, I learned a bit more about Istio in the process.

The steps of the task are quite simple:
1) Install a couple of pods (one serving httpbin + one having curl to communicate with the httpbin service)
2) Create a DestinationRule object so that calls to httpbin service are limited (Circuit Breaking)
Pretty straightforward, no? So, let’s the fun begin.

Let’s install the httpbin and helper pods:

$ kubectl create ns foo
$ kubectl apply -f <(istioctl kube-inject -f samples/httpbin/httpbin.yaml) -n foo
$ kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo

$ kubectl -n foo get pod,svc
NAME                           READY     STATUS    RESTARTS   AGE
pod/httpbin-6bbb775889-wcp45   2/2       Running   0          35s
pod/sleep-5b597748b4-77kj5     2/2       Running   0          35s

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/httpbin   ClusterIP   10.105.25.98           8000/TCP   36s
service/sleep     ClusterIP   10.111.0.72            80/TCP     35s

From the helper pod, let’s invoke the httpbin service, using curl:

$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl http://httpbin:8000/get
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Content-Length": "0", 
    "Host": "httpbin:8000", 
    "User-Agent": "curl/7.35.0", 
    "X-B3-Sampled": "1", 
    "X-B3-Spanid": "b5d006d3d9bf1f4d", 
    "X-B3-Traceid": "b5d006d3d9bf1f4d", 
    "X-Request-Id": "970b84b2-999b-990c-91b4-b6c8d2534e77"
  }, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin:8000/get"
}

So far, so good. Next step in the Circuit Breaking task is to add a DestinationRule:

$ cat <<EOF | kubectl -n foo apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
    outlierDetection:
      consecutiveErrors: 1
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 100
EOF

Now let’s retry the httpbin service:

$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl http://httpbin:8000/get
upstream connect error or disconnect/reset before headers

Oops, something went wrong. Let’s make curl be more verbose:

$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get
* Hostname was NOT found in DNS cache
*   Trying 10.105.235.142...
* Connected to httpbin (10.105.235.142) port 8000 (#0)
> GET /get HTTP/1.1
> User-Agent: curl/7.35.0
> Host: httpbin:8000
> Accept: */*
> 
< HTTP/1.1 503 Service Unavailable
< content-length: 57
< content-type: text/plain
< date: Tue, 28 Aug 2018 12:26:54 GMT
* Server envoy is not blacklisted
< server: envoy
< 
* Connection #0 to host httpbin left intact
upstream connect error or disconnect/reset before headers

Hmmm, error 503… Why? According to the Circuit Breaking task, it should work fine. We only added a rule that sets the maximum number of TCP connections to 1, and indeed, with the curl command above we spawned just one connection. So, what’s wrong?

First thing that came into my mind was to issue the command to verify if Circuit Breaking was in effect:

$ kubectl -n foo exec -it -c istio-proxy sleep-5b597748b4-77kj5 -- curl localhost:15000/stats | grep httpbin | grep pending
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_overflow: 0
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_total: 5

So, the value 0 for the upstream_rq_pending_overflow variable confirms that no call has been trapped by the Circuit Breaking.

Explanation of the above command:
Istio sidecar (Envoy container named istio-proxy) exposes (locally) the port 15000, which is accessible via HTTP and has some utilities, such as printing some statistics about the service.
So, in the command above we executed curl (curl localhost:15000/stats) within the sidecar container (-c istio-proxy) of the helper pod (sleep-5b597748b4-77kj5), filtering the output by the service we want to investigate (| grep httpbin) then filtering for the circuit breaker pending state (| grep pending).

In order to confirm that the guilty guy is that DestinationRule, let’s delete it and try again:

$ kubectl -n foo delete DestinationRule httpbin
destinationrule.networking.istio.io "httpbin" deleted
$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get
...
< HTTP/1.1 200 OK
...

Adding it back:

...
< HTTP/1.1 503 Service Unavailable
...

So, it seems the DestinationRule is the bad guy here. But why? We need to investigate a bit more. Hey! What if we check the Envoy (istio-proxy sidecar) logs? Let’s do it:

$ kubectl -n foo logs -c istio-proxy sleep-5b597748b4-77kj5 -f
# In other terminal, issue the curl command (kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get)
# Then we see in the log:
[2018-08-28T13:06:56.454Z] "GET /get HTTP/1.1" 503 UC 0 57 0 - "-" "curl/7.35.0" "19095d07-320a-9be0-8ba5-e0d08cf58f52" "httpbin:8000" "172.17.0.14:8000"

It doesn’t help. The log tells us that Envoy is getting the 503 error from the server. So, let’s check the logs for the server side (httpbin):

$ kubectl -n foo logs -c istio-proxy httpbin-94fdb8c79-h9zrq -f
# In other terminal, issue the curl command (kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get)
# Log is empty...

What? We don’t see anything in the log output. It’s like the request is not reaching the server. So, what to do now?… Hey! What if we could increase the log verbosity? Maybe the request is coming but it’s just not being outputted? Let’s see.

Remember I told above about the Envoy’s port 15000 being exposed locally to the service pod? We used it to grab statistics. Let’s take a look at it to find out what else it offers:

$ kubectl -n foo exec -it -c istio-proxy httpbin-94fdb8c79-h9zrq -- curl http://localhost:15000/help
admin commands are:
  /: Admin home page
  /certs: print certs on machine
...
  /logging: query/change logging levels
...

Hey! It seems we found what we were looking for: /logging. Let’s use it:

$ kubectl -n foo exec -it -c istio-proxy httpbin-94fdb8c79-h9zrq -- curl http://localhost:15000/logging?level=trace
active loggers:
  admin: trace
...

The command above set all Envoy loggers to the trace level, the finest one. For more information about this administrative interface, checkout the Envoy docs. Now, let’s retry retrieving the Envoy server log, hopefully with the trace level we will get something (in fact, we get lots of logs!):

$ kubectl -n foo logs -c istio-proxy httpbin-94fdb8c79-h9zrq -f
# In other terminal, issue the curl command (kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get)
# Now we see in the logs (I filtered some content not relevant)
[debug][filter] external/envoy/source/extensions/filters/listener/original_dst/original_dst.cc:18] original_dst: New connection accepted
[debug][main] external/envoy/source/server/connection_handler_impl.cc:217] [C31] new connection
[trace][connection] external/envoy/source/common/network/connection_impl.cc:389] [C31] socket event: 2
[trace][connection] external/envoy/source/common/network/connection_impl.cc:457] [C31] write ready
[debug][connection] external/envoy/source/common/ssl/ssl_socket.cc:111] [C31] handshake error: 2
[trace][connection] external/envoy/source/common/network/connection_impl.cc:389] [C31] socket event: 3
[trace][connection] external/envoy/source/common/network/connection_impl.cc:457] [C31] write ready
[debug][connection] external/envoy/source/common/ssl/ssl_socket.cc:111] [C31] handshake error: 1
[debug][connection] external/envoy/source/common/ssl/ssl_socket.cc:139] [C31] SSL error: 268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST
[debug][connection] external/envoy/source/common/network/connection_impl.cc:133] [C31] closing socket: 0

Wow, now that seems interesting! We can see that the request is indeed coming to the server, but it’s failing due to a handshake error and Envoy is closing the connection. The question now is: Why a handshake error? Why is SSL involved at all?

When speaking of SSL in the context of Istio, we remember of Mutual TLS. Then I went to Istio docs, trying to find something relevant to my problem. Reading the Security tutorial Task opened my eyes!

I found out that I had installed Istio with mutual TLS activated!

Let’s do some checking:

$ kubectl get MeshPolicy default -o yaml
apiVersion: authentication.istio.io/v1alpha1
kind: MeshPolicy
metadata: ...
spec:
  peers:
  - mtls: {}

$ kubectl -n istio-system get DestinationRule default -o yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata: ...
spec:
  host: '*.local'
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL

Those outputs above show that mTLS is installed in the cluster. Those objects only exist when mTLS is on.

OK, looking back at my installation scripts, I realized that I really messed up and installed Istio with mTLS on. However, the question is still there: Why the httpbin service is failing? Knowing that mTLS is active in the mesh, and reading the documentation, it’s not hard to deduce that the server is expecting a TLS connection and the client is issuing a plain text one. We changed the question again: Why is the client (sleep pod) connecting to the server (httpbin pod) using plain text?

Again, looking at the documentation we find the answer. The way mTLS works in Istio is simple: There is a default DestinationRule object (called “default”, as we can see in the command above) that instructs all traffic in the mesh to go through TLS. However, when we created our own DestinationRule, for the purpose of the Circuit Breaking task, we did overwrite that default configuration with our own, which has no TLS at all! This is stated at the TLS documentation for Istio:

Don’t forget that destination rules are also used for non-auth reasons such as setting up canarying, but the same order of precedence applies. So if a service requires a specific destination rule for any reason – for example, for a configuration load balancer – the rule must contain a similar TLS block with ISTIO_MUTUAL mode, as otherwise it will override the mesh- or namespace-wide TLS settings and disable TLS.

So, it’s clear now what we should do: Modify the DestinationRule for the Circuit Breaking task to include the TLS block (lines 20-21 below):

cat <<EOF | kubectl -n foo apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
    outlierDetection:
      consecutiveErrors: 1
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 100
    tls:
      mode: ISTIO_MUTUAL
EOF
destinationrule.networking.istio.io/httpbin configured

Now let’s try our httpbin service:

kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get
...
< HTTP/1.1 200 OK
...

\o/

Now I can continue with my Circuit Breaking tutorial!

Some lessons learned:
– Confirm whether you are using mTLS or not; Enabling it opens the door to obscure errors 🙂
– DestinationRules have precedence order: More specific ones overwrite global ones
– We can sometimes make good use of the Sidecar’s administrative interface (local port 15000)
– Always read the docs 🙂

See you next time!

Leave a Reply

Your email address will not be published.