Istio, mTLS, debugging a 503 error

Hello folks. In this blog post I’ll share with you a problem that I had while trying out the Circuit Breaking tutorial in the Istio documentation. I’ll follow all the steps I did while troubleshooting this issue, and hopefully it will be useful for someone out there. It was for me at least, I learned a bit more about Istio in the process.

The steps of the task are quite simple:
1) Install a couple of pods (one serving httpbin + one having curl to communicate with the httpbin service)
2) Create a DestinationRule object so that calls to httpbin service are limited (Circuit Breaking)
Pretty straightforward, no? So, let’s the fun begin.

Let’s install the httpbin and helper pods:

$ kubectl create ns foo
$ kubectl apply -f <(istioctl kube-inject -f samples/httpbin/httpbin.yaml) -n foo
$ kubectl apply -f <(istioctl kube-inject -f samples/sleep/sleep.yaml) -n foo

$ kubectl -n foo get pod,svc
NAME                           READY     STATUS    RESTARTS   AGE
pod/httpbin-6bbb775889-wcp45   2/2       Running   0          35s
pod/sleep-5b597748b4-77kj5     2/2       Running   0          35s

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/httpbin   ClusterIP   10.105.25.98           8000/TCP   36s
service/sleep     ClusterIP   10.111.0.72            80/TCP     35s

From the helper pod, let’s invoke the httpbin service, using curl:

$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl http://httpbin:8000/get
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Content-Length": "0", 
    "Host": "httpbin:8000", 
    "User-Agent": "curl/7.35.0", 
    "X-B3-Sampled": "1", 
    "X-B3-Spanid": "b5d006d3d9bf1f4d", 
    "X-B3-Traceid": "b5d006d3d9bf1f4d", 
    "X-Request-Id": "970b84b2-999b-990c-91b4-b6c8d2534e77"
  }, 
  "origin": "127.0.0.1", 
  "url": "http://httpbin:8000/get"
}

So far, so good. Next step in the Circuit Breaking task is to add a DestinationRule:

$ cat <<EOF | kubectl -n foo apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
    outlierDetection:
      consecutiveErrors: 1
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 100
EOF

Now let’s retry the httpbin service:

$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl http://httpbin:8000/get
upstream connect error or disconnect/reset before headers

Oops, something went wrong. Let’s make curl be more verbose:

$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get
* Hostname was NOT found in DNS cache
*   Trying 10.105.235.142...
* Connected to httpbin (10.105.235.142) port 8000 (#0)
> GET /get HTTP/1.1
> User-Agent: curl/7.35.0
> Host: httpbin:8000
> Accept: */*
> 
< HTTP/1.1 503 Service Unavailable
< content-length: 57
< content-type: text/plain
< date: Tue, 28 Aug 2018 12:26:54 GMT
* Server envoy is not blacklisted
< server: envoy
< 
* Connection #0 to host httpbin left intact
upstream connect error or disconnect/reset before headers

Hmmm, error 503… Why? According to the Circuit Breaking task, it should work fine. We only added a rule that sets the maximum number of TCP connections to 1, and indeed, with the curl command above we spawned just one connection. So, what’s wrong?

First thing that came into my mind was to issue the command to verify if Circuit Breaking was in effect:

$ kubectl -n foo exec -it -c istio-proxy sleep-5b597748b4-77kj5 -- curl localhost:15000/stats | grep httpbin | grep pending
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_active: 0
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_failure_eject: 0
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_overflow: 0
cluster.outbound|8000||httpbin.foo.svc.cluster.local.upstream_rq_pending_total: 5

So, the value 0 for the upstream_rq_pending_overflow variable confirms that no call has been trapped by the Circuit Breaking.

Explanation of the above command:
Istio sidecar (Envoy container named istio-proxy) exposes (locally) the port 15000, which is accessible via HTTP and has some utilities, such as printing some statistics about the service.
So, in the command above we executed curl (curl localhost:15000/stats) within the sidecar container (-c istio-proxy) of the helper pod (sleep-5b597748b4-77kj5), filtering the output by the service we want to investigate (| grep httpbin) then filtering for the circuit breaker pending state (| grep pending).

In order to confirm that the guilty guy is that DestinationRule, let’s delete it and try again:

$ kubectl -n foo delete DestinationRule httpbin
destinationrule.networking.istio.io "httpbin" deleted
$ kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get
...
< HTTP/1.1 200 OK
...

Adding it back:

...
< HTTP/1.1 503 Service Unavailable
...

So, it seems the DestinationRule is the bad guy here. But why? We need to investigate a bit more. Hey! What if we check the Envoy (istio-proxy sidecar) logs? Let’s do it:

$ kubectl -n foo logs -c istio-proxy sleep-5b597748b4-77kj5 -f
# In other terminal, issue the curl command (kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get)
# Then we see in the log:
[2018-08-28T13:06:56.454Z] "GET /get HTTP/1.1" 503 UC 0 57 0 - "-" "curl/7.35.0" "19095d07-320a-9be0-8ba5-e0d08cf58f52" "httpbin:8000" "172.17.0.14:8000"

It doesn’t help. The log tells us that Envoy is getting the 503 error from the server. So, let’s check the logs for the server side (httpbin):

$ kubectl -n foo logs -c istio-proxy httpbin-94fdb8c79-h9zrq -f
# In other terminal, issue the curl command (kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get)
# Log is empty...

What? We don’t see anything in the log output. It’s like the request is not reaching the server. So, what to do now?… Hey! What if we could increase the log verbosity? Maybe the request is coming but it’s just not being outputted? Let’s see.

Remember I told above about the Envoy’s port 15000 being exposed locally to the service pod? We used it to grab statistics. Let’s take a look at it to find out what else it offers:

$ kubectl -n foo exec -it -c istio-proxy httpbin-94fdb8c79-h9zrq -- curl http://localhost:15000/help
admin commands are:
  /: Admin home page
  /certs: print certs on machine
...
  /logging: query/change logging levels
...

Hey! It seems we found what we were looking for: /logging. Let’s use it:

$ kubectl -n foo exec -it -c istio-proxy httpbin-94fdb8c79-h9zrq -- curl http://localhost:15000/logging?level=trace
active loggers:
  admin: trace
...

The command above set all Envoy loggers to the trace level, the finest one. For more information about this administrative interface, checkout the Envoy docs. Now, let’s retry retrieving the Envoy server log, hopefully with the trace level we will get something (in fact, we get lots of logs!):

$ kubectl -n foo logs -c istio-proxy httpbin-94fdb8c79-h9zrq -f
# In other terminal, issue the curl command (kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get)
# Now we see in the logs (I filtered some content not relevant)
[debug][filter] external/envoy/source/extensions/filters/listener/original_dst/original_dst.cc:18] original_dst: New connection accepted
[debug][main] external/envoy/source/server/connection_handler_impl.cc:217] [C31] new connection
[trace][connection] external/envoy/source/common/network/connection_impl.cc:389] [C31] socket event: 2
[trace][connection] external/envoy/source/common/network/connection_impl.cc:457] [C31] write ready
[debug][connection] external/envoy/source/common/ssl/ssl_socket.cc:111] [C31] handshake error: 2
[trace][connection] external/envoy/source/common/network/connection_impl.cc:389] [C31] socket event: 3
[trace][connection] external/envoy/source/common/network/connection_impl.cc:457] [C31] write ready
[debug][connection] external/envoy/source/common/ssl/ssl_socket.cc:111] [C31] handshake error: 1
[debug][connection] external/envoy/source/common/ssl/ssl_socket.cc:139] [C31] SSL error: 268435612:SSL routines:OPENSSL_internal:HTTP_REQUEST
[debug][connection] external/envoy/source/common/network/connection_impl.cc:133] [C31] closing socket: 0

Wow, now that seems interesting! We can see that the request is indeed coming to the server, but it’s failing due to a handshake error and Envoy is closing the connection. The question now is: Why a handshake error? Why is SSL involved at all?

When speaking of SSL in the context of Istio, we remember of Mutual TLS. Then I went to Istio docs, trying to find something relevant to my problem. Reading the Security tutorial Task opened my eyes!

I found out that I had installed Istio with mutual TLS activated!

Let’s do some checking:

$ kubectl get MeshPolicy default -o yaml
apiVersion: authentication.istio.io/v1alpha1
kind: MeshPolicy
metadata: ...
spec:
  peers:
  - mtls: {}

$ kubectl -n istio-system get DestinationRule default -o yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata: ...
spec:
  host: '*.local'
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL

Those outputs above show that mTLS is installed in the cluster. Those objects only exist when mTLS is on.

OK, looking back at my installation scripts, I realized that I really messed up and installed Istio with mTLS on. However, the question is still there: Why the httpbin service is failing? Knowing that mTLS is active in the mesh, and reading the documentation, it’s not hard to deduce that the server is expecting a TLS connection and the client is issuing a plain text one. We changed the question again: Why is the client (sleep pod) connecting to the server (httpbin pod) using plain text?

Again, looking at the documentation we find the answer. The way mTLS works in Istio is simple: There is a default DestinationRule object (called “default”, as we can see in the command above) that instructs all traffic in the mesh to go through TLS. However, when we created our own DestinationRule, for the purpose of the Circuit Breaking task, we did overwrite that default configuration with our own, which has no TLS at all! This is stated at the TLS documentation for Istio:

Don’t forget that destination rules are also used for non-auth reasons such as setting up canarying, but the same order of precedence applies. So if a service requires a specific destination rule for any reason – for example, for a configuration load balancer – the rule must contain a similar TLS block with ISTIO_MUTUAL mode, as otherwise it will override the mesh- or namespace-wide TLS settings and disable TLS.

So, it’s clear now what we should do: Modify the DestinationRule for the Circuit Breaking task to include the TLS block (lines 20-21 below):

cat <<EOF | kubectl -n foo apply -f -
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: httpbin
spec:
  host: httpbin
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1
      http:
        http1MaxPendingRequests: 1
        maxRequestsPerConnection: 1
    outlierDetection:
      consecutiveErrors: 1
      interval: 1s
      baseEjectionTime: 3m
      maxEjectionPercent: 100
    tls:
      mode: ISTIO_MUTUAL
EOF
destinationrule.networking.istio.io/httpbin configured

Now let’s try our httpbin service:

kubectl -n foo exec -it -c sleep sleep-5b597748b4-77kj5 -- curl -v http://httpbin:8000/get
...
< HTTP/1.1 200 OK
...

\o/

Now I can continue with my Circuit Breaking tutorial!

Some lessons learned:
– Confirm whether you are using mTLS or not; Enabling it opens the door to obscure errors 🙂
– DestinationRules have precedence order: More specific ones overwrite global ones
– We can sometimes make good use of the Sidecar’s administrative interface (local port 15000)
– Always read the docs 🙂

See you next time!

Istio, Service Mesh, Updates

Hello folks! This blog is so quiet lately… I just wanted to give you an update on what I am doing these days. So…

Istio Logo

Istio project

Since December2017 /January 2018 I’ve switched teams at Red Hat, and started working with Istio. Istio is a tool/platform that helps us to deliver micro-services, in a number of different aspects. I intend to write more posts on that. Meanwhile, you can find out more about Istio on their website. Trust me, it’s a wonderful tool to give a look at, if you’re somehow involved or planning to enter in this micro-service world, be you dev or ops :).

So far our work has been mainly in understanding the benefits of Istio when used on top of Kubernetes and Openshift (Red Hat’s improved and supported version of Kubernetes). This means that we have been mostly involved in downstream tasks, like creating CentOS and RHEL RPM binaries and container images, although we are already contributing with upstream, community effort. The plans are to get much more involved in the community in a soonish timeframe.

Events

Last december I attended Kubecon US in Austin, Texas. It was my first immersion on Istio, given there were a IstioCon, a pre-event of Kubecon with a bunch of talks and workshop. Last May I attended Kubecon EU, in Copenhagen, Denmark. All talks are freely available on CNCF channel on YouTube. Check them out! It is definitively a great material to keep up with latest and greatest stuff that’s happening out there, not only in Istio world, but in the whole Kubernetes ecosystem.

Last April I attended a great Brazilian conference – The Developers Conference (TDC) – in Florianópolis and spoke about Container Ecosystem – alternatives to Docker (CRI-O, Buildah, Podman, etc).

In July I’ll be speaking at TDC again, this time in São Paulo. I’ll give two talks about Istio and one on Containers Ecosystem, mostly the same I gave on previous TDC in Florianópolis. This event is already such a success that organizers had to duplicate some tracks in order to be able to handle so many attendees. Therefore, my talks about Istio on the microservices track (Introduction to Service Mesh with Istio) and devops track (How Istio facilitates A/B deployments on micro-services) will be duplicated and I will give two of each, giving a total of 5 talks! I hope to have enough swags for so many talks!

In August I’ll be traveling to Boston, MA to attend the very first Devconf to be held in the US! I’ll be speaking about Istio there as well.

There is more to come, I’d guess, due to the amount of hype and momentum that Istio/Service Mesh is having everywhere.

Plans

So, my plans are to blog regularly about the cool stuff we are doing around Istio, and how it can be used to improve the micro-services deployments, making admins and developers’ lives easier!

See you soon!

Fedora 26 not connecting to wireless

This is a quick hint in case you suffer from the same issue I had while installing Fedora 26 (alpha).

The installer didn’t manage to connect to my wireless router, a D-Link one. More specifically, it was not getting an IP address from the router. Some problem with DHCP it seems.

If that’s the case, open a terminal (press the Windows key, then type “Terminal” and hit Enter), and type the following command:

sudo sh -c 'echo "send dhcp-client-identifier = hardware;" > /etc/dhcp/dhclient.conf'

Then reconnect to your wireless network. After the first boot – when the system is installed – repeat the command above, just once.

By the way, this is not a new issue. I experienced this on Fedora 25, but curiously at the time the installer (the live system actually) worked fine. Just the installed system suffered from it. Now, with F26, it happened since the beginning. Here’s the bugzilla entry: https://bugzilla.redhat.com/show_bug.cgi?id=1154200.

Hope that helps, happy Fedora!

GNOME Release Party – Success! Hands on soon?

Hey guys, last Saturday we had our [first?] GNOME Release Party here in São Paulo, Brazil. It was very, very productive and most importantly, fun!

Georges already blogged about it, check it out!

Also, we had a feeling that another event, with a more hacking focus would be interesting. So, I setup a meetup in order to see if there are enough people interested in such a hands-on event 🙂

Updates on Timezone extension

Hey, this is a quick post to show the improvements the Timezone extension for GNOME Shell has received since it was born a couple of weeks ago.

  • Support avatars from Gravatar.com and Libravatar.
  • Support GitHub profile. We fetch people’s name, city and avatar from there.
  • Ability to specify a location for people.json file. We support even remote files (e.g.: http://example.com/people.json). This way a whole team can share this file and use a single copy of it!

Here’s a simple people.json file showing the new features:

[
  {
    "github": "jwendell",
    "tz": "America/Sao_Paulo"
  },
  {
    "name": "Niel",
    "gravatar": "niel@example.com",
    "city": "Cape Town",
    "tz": "Africa/Johannesburg"
  }
]

Screenshots:

Summary at the bottom. Clicking will open the preferences dialog
Preferences dialog

Timezone extension for GNOME Shell

I’d like to share with you my first GNOME Shell extension, Timezone: It helps people like me, who work in a distributed (global) team and need to know where and when their coworkers are.

screenshot

It reads a JSON file containing all people data and renders them into a nice overview organized by timezone. See all relevant information on its page: https://github.com/jwendell/gnome-shell-extension-timezone.

Please, install, test it and report any issue you find. Hope it’s useful to anyone as it is for me 🙂

New job: Red Hat

Hello!

It’s with great pleasure that I announce that since last December I am working at Red Hat! I’ve already updated my Linkedin profile but forgot to blog about it 🙂 .

It’s not a secret to anyone that I always admired Red Hat and that I think it is one of the best places to work for people like me, who loves Open Source. So, I consider this a big step in my life and career!

First day
First day in São Paulo office

I’m working in the Cloud Enablement team, which is responsible to bring Middleware products (JBoss, etc) to Red Hat’s PaaS, Openshift. That’s called xPaaS. It’s a entire new world to me: Docker, Kubernetes, Openshift, JBoss, etc… and at the same time it’s a very exciting moment.

I’m still living in São Paulo, but working from home now, which gives me some opportunities 🙂 .

Raleigh
Last week in Red Hat Headquarters

In other news, I’m now an Emeritus Member of the GNOME Foundation. This means that I’m quite a few time without substantial contributions to the project. Sad 🙁 . The good news is that working at Red Hat I’m closer than ever to contribute to Open Source projects! I hope to be back to the game soon 🙂

Hey, talking about cloud… Old but gold:

Systemd journal: What does “systemd-journal: Suppressed N messages from /system.slice/…” mean?

Under pressure, my system was generating lots of messages like this:

# systemctl -l status systemd-journald.service
Jun 24 13:47:23 localhost systemd-journal[155]: Suppressed 15460 messages from /system.slice/...

This means the system is generating lots of messages and journal is configured to drop some of them. This is called rate limit, and is useful to not overload the logging system.

Sometimes, however, you just want to get all messages, or perhaps increase these limits. This can be achieved by setting the variables RateLimitInterval and RateLimitBurst inside the config file /etc/systemd/journald.conf.

Quoting the man page for journal: To turn off any kind of rate limiting, set either value to 0.