How we use ELB for automatic failover

AWS to the rescue once again

Elastic Load Balancers (ELB) have existed in Amazon Web Services (AWS) for quite some time now. As you might guess, they are commonly used to balance incoming HTTP(S) traffic across a collection of instances running your app in EC2.

ELBs can perform health checks

ELBs can do much more than just traffic balancing. They can also perform health checks on their target EC2 instances, and detect whether or not an instance and the app running on it are healthy. This is important for the load balancer to know, because if anything is wrong with the instance, it can intelligently stop routing traffic to it until the issue is resolved.

What is a health check?

The health check itself is simply an HTTP(S) GET request to a port on which an application is listening. If the application is functioning correctly, it responds to the health check with a 200 status code. If something is wrong with the app, instance, or network, the health check will either get a non-200 status code, or no response at all, in which case the check will fail and depending on the configuration, the instance may be deemed unhealthy.

Automatic failover with Auto Scaling

Here’s where things get powerful. We can combine ELB health checks with Auto Scaling groups to identify failing instances and cycle them out automatically, with zero downtime.

Each main app we run is deployed to its own cluster of instances managed by an Auto Scaling Group (ASG) and an ELB. A “cluster” doesn’t necessarily mean there are large number of instances in the group. Instead, it’s a group of instances running the same app that has the ability to automatically change its instance count when required, such as when workload changes or instances fail.

In short, the ASG controls how many instances are part of the cluster, and when to change that instance count.

When the ASG boots an instance, it verifies the health of that instance to make sure it has booted and is functioning correctly. We configure the ASG so it adds its instances to the ELB and then uses the ELB’s health checks for verifying instance health. There are two outcomes of this verification:

Health check passed

The instance passes the health check. The ELB marks it as “In Service” and the ASG keeps it running. If the ELB has incoming traffic, it will start sending it to the instance at this point.

Health check failed

The instance fails the health check. Something has gone wrong with the instance or the app on it, so the ELB marks it as “Out of Service” and the ASG shuts it down after the grace period, replacing it with a new one.

The instance will continue being monitored and if it starts failing health checks, the ELB will respond by marking it as unhealthy, stop routing traffic to it, and wait for the ASG to replace it.

Creating good health checks

The health checks can be as simple as responding to the ELB check with a 200 status code. At a minimum this is often sufficient, but much of the benefit of having a dedicated health check endpoint is being able to run more sophisticated checks inside the app itself.

Many of our apps have external dependencies such as connections to databases, servers and APIs, so each time the ELB pings a health check endpoint, we run a series of checks on all of the essential dependencies inside the app, making sure we only respond with a 200 if everything checks out.

Protip: ELB healthchecks can monitor any kind of app

Your app doesn’t have to be an HTTP server for this to work!

We also use this clustering technique on non-server apps, such as message queue workers, task schedulers and backend processing apps. For these types of apps, we embed an HTTP server that listens on a port, and configure the cluster in exactly the same way: create an ELB, point the health check to that port, and use that health check in the ASG. The only difference is the ELB isn’t actually being used to load balance incoming HTTP traffic. Instead, it’s there purely to run health checks.

 

Never miss a post