Minimizing downtime on Amazon AWS

by Gloria Quintanilla in


As we argued in another article, being fast is the secret to scalability. Automation makes you speedy. It helps ruling out as many commodity tasks as possible. So, we decided to share a few tricks about increasing your uptime with automation.

Downtime is bad. It moves your focus away from creating your awesome product to the arduous task of fixing broken things. It is a complete waste of time, whether it is because of unexpected outages, crashes or your own software update procedure.

Our cloud printing company Peecho runs on Amazon Web Services. Every week, we deploy multiple new versions of our entire system. Still, our Pingdom statistics show a 99.96% uptime over the past year. The following write-up by Marcel Panse shows our efforts to minimize downtime with AWS, based on some best practices and an automated deployment procedure of instances within an auto-scaling group.

Lightning zaps your cluster

Divine intervention is hard to predict. Volcanoes erupt, nuclear power plants flood and sometimes even Amazon goes down. Luckily, the AWS infrastructure is split into self-sufficient chunks to become more robust during catastrophes.

Most apps live in a single AWS region. For example, everything Peecho is hosted in the European AWS region. However, to ensure maximum uptime in case of emergencies, all our applications run in at least two different availability zones. This is called a multi-AZ approach. Availability zones are like separate hosting co-locations in a single region. If one of those burns to cinders, the other two should still work. This means that your multi-AZ app stays available if somebody incidentally trips a wire - or when lightning strikes.

Cloud servers die, too

Even in the dark dungeons of Amazon, physical servers finally wear out and die. Such an unfortunate event may cause one of your running instances to get terminated as collateral damage. If you applied the rule of running at least two instances under a load balancer, you are fine. When one instance dies, the other instance will still be accepting connections.

The killed instance can be resurrected manually, but that is hardly scalable and at least rather annoying if it happens at night. The good news is, that you may be able to continue sleeping after all. This superior peace-of-mind can be achieved by using the really cool AWS auto-scaling feature to automatically invoke a new EC2 instance if the number of active instances gets below a certain number.

You can find everything you need to know about auto-scaling in the excellent book Programming Amazon EC2 by Jurg van Vliet and Flavia Paganelli. It is a must-read before attempting this kind of stuff.

Automated updates with auto-scaling

Catastrophes do not happen that often. In almost all software projects, scheduled outages account for most of the downtime. The popularity of iterative development only makes it worse. In the eyes of your users, force majeure events may let you escape liability - but we find scheduled failures really hard to explain to our marketing department. There is no other option than to automate the deployment procedure.

For starters, you need at least two servers in a load-balanced set-up. This way, you can deploy your new code to server A, while server B keeps running - and deploy server B once A is back and ready with the new version. Again, you can simply use a load balancer to achieve this.

Now, it gets interesting. Leveraging the AWS services fully to be as cost-efficient as possible, our cloud system is elastic. That means it scales up and down on demand using auto-scaling, and there is no way to predict the number of active "machines" at any given moment: it could be two, or ten, or hundred - and then back to two. In case of our processing engines, we even like to stick to zero as a minimum number, but that is an entirely different story.

This elasticity complicates automated software updates without downtime. However, if you execute the next steps, you can get the update automation up and running.

  • Create a self-provisioning AMI;
  • Create a template;
  • Configure a scaling group;
  • Programmatically achieve remote version listing;
  • Programmatically achieve remote deployment;
  • Secure it.

As an example case, we will take a look at the Printcloud administration interface - the console that controls our system for routing print orders to production facilities. It is a relatively simple web application, running from a standard CentOS Linux AMI and created with Java, Spring, Hibernate and Jquery.

Creating a self-provisioning AMI

The EC2 instances need to be self-provisioning to accomplish automated updates within an auto-scaling group. This means when the instance gets created by the auto-scaling feature, it automatically updates itself to the correct version. The instance should be able to start, download the latest version, deploy it locally and start everything without user interaction. Therefore, we run an Ant deployment script every time one of our servers starts.

The script needs to know which environment it is supposed to deploy to. In our case, that is a choice between test, acceptance or production environments. To this end, the script first retrieves an environment property called sys.env. This property is entered when launching a new instance in the user-data field. You can retrieve this from Ant using a URL:

<property url="http://169.254.169.254/latest/user-data"></property>

This 169.254.169.254 IP address is an internal reference that you can use in all your instances to retrieve instance specific data like instance-id or user-data.

Next up is to get to know which version of the software to download and deploy. We store the accurate version number in an S3 bucket, because it is easily accessible from all servers. Again, the version number in S3 sits in a one-line property file containing something like version.number=123. You can load the properties file from Ant the same way as before, using the property URL tag. Just replace the URL with the complete S3 URL. Take note: the file should be publicly accessible, but keep it read-only.

When you have both the version number and the environment properties you can start downloading the correct build from your build server - we use Atlassian Bamboo. Another good practice is to create a environment independent WAR file and create different zip files with environment-specific configurations. This way you don't have to create multiple WAR files, which are slow to build and take lots of space, too.

In short, this is what the deployment script does.

  • Download latest version and configuration from Bamboo;
  • Stop the Tomcat server;
  • Clean webapps and work directories;
  • Unzip the download in the correct Tomcat folders;
  • Start the Tomcat server.

Creating a template

After your server has become self-provisioning, you can create the actual AMI. Be careful: if you create an AMI from the AWS console, it will restart your server. Instead, you could use elasticFox, a Firefox plugin to create an AMI from running instances. Use the option to create the AMI without restarting the instance. You could name the resulting AMI something like 'website-project-2011-08-30'.

When the AMI is ready, you can start creating a template for it. We normally use Ylastic to do this, but note that it is a paid service. It is worth it, though. For the geeks, it is also possible to create a template from command line, again explained in the book of Jurg and Flavia. When creating the template you have to specify the user data. Create a template named 'test-website-project' and specify the user data - with sys.env=test.

The auto-scaling group

Next, you need to create an auto-scaling group. Again, we use the Ylastic UI interface to do this, but it can be done from command line as well. We specified the scaling group with a minimum of 2 servers and a maximum of 2 servers. This means that when a server gets terminated, a new instance will get launched automatically - using the template you have created before. If, for some reason, there are too many instances, then one will get terminated. You can do lots of really cool stuff using these auto-scaling rules, like scaling up and down along with traffic or CPU usage across instances under the load balancer.

Deploying

Potentially, we can now have any number of running instances in a single auto-scaling group. Let's assume that these instances need to be updated to a next version of the software. Of course, manually deploying the new versions using SSH or even terminating all of the instances and updating them automatically would not suffice. Also, the update procedure should normally take place sequentially, to prevent downtime. We decided to build a system where we could deploy automatically to all instances one by one directly from our admin user interface.

Load balancer setup

Let's assume that you enter the Printcloud user interface. Because of the load balancer, you normally don't know exactly which instance you are connected to. So, every instance should be able to perform the update sequence. A single monitoring instance with a master-slave model is unadvisable, because that would introduce a single point of failure. Luckily, Amazon gives us an API to collect the data we need from anywhere and a cool Java SDK to make it even easier.

Configure the AWS load balancer client as follows.

<bean id="awsCredentials" class="com.amazonaws.auth.BasicAWSCredentials">
  <constructor-arg value="${platform.aws.aws_access_key}" />
  <constructor-arg value="${platform.aws.aws_secret_key}" />
</bean>

<bean id="amazonElasticLoadBalancingClient"
	class="com.amazonaws.services.elasticloadbalancing.AmazonElasticLoadBalancingClient">
  <constructor-arg ref="awsCredentials" />
  <property name="endpoint" value="elasticloadbalancing.eu-west-1.amazonaws.com"/>
</bean>

Here is the code to list all instances in a load balancer.

DescribeLoadBalancersRequest describeLoadBalancersRequest 
	= new DescribeLoadBalancersRequest().withLoadBalancerNames(loadBalancerName);
DescribeLoadBalancersResult describeLoadBalancers 
	= elbClient.describeLoadBalancers(describeLoadBalancersRequest);
LoadBalancerDescription loadBalancerDescription 
	= describeLoadBalancers.getLoadBalancerDescriptions().get(0);
List<Instance> instances = loadBalancerDescription.getInstances();

Use the Instance object to fetch the public DNS attibute.

DescribeInstancesRequest describeInstancesRequest 
	= new DescribeInstancesRequest().withInstanceIds(instance.getInstanceId());
DescribeInstancesResult describeInstances 
	= ec2Client.describeInstances(describeInstancesRequest);
String publicDnsName 
	= describeInstances.getReservations().get(0).getInstances().get(0).getPublicDnsName();

Version listing

Basically, the load balancer should supply a list of all instances running in the load balancer. After receiving the public DNS from the instances, you need to figure out which version they are running - so a globally available version property can be injected during the build phase. To create a list of all instance versions in the load balancer, a URL requesting the version is called on each instance. An example: http://public-DNS-of-instance.com/webapp/version.

The upgrade mechanism

There are two scenarios. The first is when there are no database changes, or all database changes are backwards compatible. In that case, updates will not break any running code. Now, all instances should be updated one by one. This sequence results in zero downtime.

The second scenario involves a database change that would break the currently running code, like a column name change. In this case, all instances should be updated at once to minimize downtime. This will result in about 10 to 30 seconds of downtime, mostly depending on the start-up time of the Tomcat server.

We can trigger the appropriate upgrade scenario directly from our admin interface. The admin interface knows which servers to deploy, so our code triggers the instances to deploy depending on the scenario you choose. It triggers another instance to deploy by simply calling an URL on the instance, like http://public-DNS-of-instance.com/webapp/upgrade.

By the way, we use the excellent Spring restTemplate to create GET and POST requests on URLs.

Security

Both the version URL and the upgrade URL should be protected. If you don't do this, evil hackers could start upgrading your servers - although it seems a nice gesture, we prefer to do that ourselves.

To secure the upgrade features we added 2 more parameters to both the URLs: a timestamp and a secret. The timestamp is just a simple UNIX-based timestamp. The receiving instance checks if the timestamp is not older than a couple of minutes. This way the URL is only usable for a maximum of a couple of minutes. To make sure nobody can fiddle with the timestamp we hash the timestamp using a secret key that is only available on the instances itself. The receiving instance creates a hash with his secret key on the timestamp it received and the hash should of course match the secret you received.

That's all, folks

People like shiny stuff with colors and such, so to wrap things up here is a screenshot of our pretty admin interface deployment section.

Transient

We wish you eternal uptime.