Tuesday, May 15, 2018

Service isolation in Zuul

These days I am working on a company-wide API gateway. The idea is to hide all microservices behind this gateway. We build this gateway based on Netflix Zuul. When building this kind of company-wise infrastructure a typical question is how to isolate the resource between multiple users. In the field of database or virtualization, it is called multi-tenant.

Why do we need service isolation in Zuul.

This is a GitHub issue about Zuul:

Q:
We are planning using Zuul in production as an API gateway. I am not sure the behavior of Zuul in this specific scenario.
Suppose there are two back-end services A and B sitting behind Zuul. Service A is a slow one but with a lot of traffic. What will happen to the clients visiting Service B through Zuul?
On the clients' side, Service B will be unavailable because Service A slow down the Zuul so there is no resource to handle the request for Service B. I'm not sure about this, and any advice or experience will be much appreciated.

A:
At the end of the day, yes if Zuul runs out of threads to handle incoming requests because a service downstream is behaving badly, yes you wont be able to proxy any other requests to other services. However that is what we integrate Hystrix with Zuul so that you can protect Zuul from a situation like this. You could also do things like adjust timeout values to make sure that slow requests timeout after an appropriate amount of time.


In a nutshell, we need to reduce the blast radius and optimize our resource usage since the API gateway is a critical path of the whole system.

Failure isolation

According to the response of the community, if we are using the Ribbon-based routing filter in Zuul, we can easily isolate any failing microservice from other services. 
Failure isolation when service B is dead
Fig 1. Failure isolation when service B is dead

When every HTTP request is wrapped by a Ribbon command (it is Hystrix command under the hood) and a microservice is dead, instead of time out every request, the Hystrix will open the circuit for this microservice and keep the circuit open until this microservice is alive again.

Resource isolation

According to the response of the Spring community, once every HTTP request is wrapped by a Hystrix command and with proper Hystrix command key set up, we can isolate failures between microservices. But Hystrix can not help to isolate resources between each microservice.

What does resource mean in our context? 

Every system has limited resources. In our case, our Zuul is a web application running in a web container. Most of the web containers like Tomcat, Jetty, Undertow, they maintain a worker thread pool to handle HTTP requests. During the life cycle of an HTTP request, a worker thread is dedicated to this HTTP request. We treat the worker thread pool as our main resource in Zuul, and we want to distribute this resource among multiple microservices based on some algorithm and configuration.

From Zuul's perspective, it doesn't differentiate HTTP requests between microservices and we don't want to end up with an overcomplicated customized thread pool. We use the JDK semaphore to isolate the resource usage between each microservice.

The idea is that suppose we have N threads in the worker thread pool, and there are 3 microservices behind Zuul. In order to share these N threads evenly between these 3 services, we set up 3 semaphores each with the number of N/3. Every HTTP request for each microservice needs to acquire the corresponding semaphore before being routed. If there are too many requests for one microservice, the Zuul will simply return HTTP 490(Too Many Requests) without actually routing the request.
Fig 2. Resource isolation when service B is overflowed

In order to fully use the hardware resource, we also introduce another factor called the over-sell-factor, so the sum of numbers of all semaphores will equal over-sell-factor × thread-number.

Test

In order to validate the service isolation, I did a simple test in our QA environment. I set up 2 microservices and one Zuul instance with 16 worker threads. I used the Apache AB to run the load test, the commands are:
  • ab -n 10000 -c 20  http://service-a.elasticbeanstalk.com/info
  • ab -n 10000 -c 9 http://service-b.elasticbeanstalk.com/info
Both service-a and service-b are CNAME points to Zuul.
So for service A, the currency is 20 and for service, B is 9. We run the commands at the same time and here is the result. From the output of Apache AB, some requests for service A failed because of the isolation while all the requests for service B succeeded. 


Fig 3. Result with service isolation on

Work in the future

Currently, we have two strategies to distribute resources, one is the fair and the other one is weighted one. But ideally, this kind of distribution should adjust dynamically according to the current status and load of the System. 

About me

I am Sicheng, I started working as a back-end engineer in Berlin 8 months ago. Before that, I was a senior engineer of the middle-ware technology team in Alibaba in China.

Labels: , ,