Monorail, Our Continuous Deployment System

Iván Guardado
Audiense Engineering
7 min readFeb 7, 2017

--

One thing that has become clear in the software industry over the last few years is that the earlier you put code into production the better. Early code deployment helps save development time by getting early feedback from clients and users, and avoids wasting time with work on things that will never be implemented.

There are many things that can go wrong in a deployment: inconsistent static files cache, a file that it isn’t uploaded to the right server, a service which is not restarted, etc. However, a deployment should not be feared, treat it with great care and invest time in it to get the most automated system you can to reduce the possibility of human error.

A deployment should not be feared, treat it with great care and invest time in it

Audiense has always deployed to production multiple times per day and we’ve been improving our automatic deployment system as the complexity of our architecture has grown. Our latest creation is Monorail, a Continuous Deployment (CD) system which adapts perfectly to our development workflow.

The most important thing about a CD system is not to simply have an automatic deployment system but to have a sufficiently effective flow to deploy code to production both quickly and without error. What’s the point of having a fantastic unattended deployment system if then we aren’t able to deploy more than once per week?

Workflow before Monorail

These were the steps a task passed through from being ready to develop to being in production:

It’s a typical flow that probably looks familiar to you. Although this worked well for us and allowed us to deploy multiple times per day, there were also some red flags that caused us to waste time and which were prone to introduce errors to the platform.

The QA process consisted of: a Pull Request that a teammate had to review and accept; the automatic launch of a suite of tests; and finally when all was ready, local testing by our CTO who gave the green light for production. This last step was the obvious bottleneck that sometimes caused bugs to be put into production because of time limitations for thorough testing.

Another red flag was the deployment process itself, or more specifically, the deployment launch process. This launch process required us to configure all the services that were affected by the code to be deployed, which consequently required a vast knowledge of the application code and the ability to be able to infer all dependencies. To give you an idea, at that moment we had about 50 EC2 instances to which code was deployed and 27 different services were running all with different NodeJS versions. All this had to be manually configured by our CTO! How crazy is that?

Definition of Awesome

As we work remotely we usually meet up in our Córdoba office every 3 months to align our goals and to take decisions about how to improve our platform and workflow. One thing we do is review our definition of awesome, which is basically a board where we make a note of things that would be awesome to have and what the next steps are to reach these goals.

During one of these meetups, our CTO thought the time had come to give a bit of loving care and attention to our deployment process to remove the red flags and to make us all a lot happier and relaxed. This was his wish list for what he thought would be an awesome deployment system:

  • Bulletproof code review and QA process
  • Commit to the base branch is the same as deploying
  • Automatic versioning
  • Automatically infer where to deploy
  • Automatically detect what issues are to be deployed
  • Block the train if there are unresolved production dependencies
  • Deploy train every 30 minutes
  • Automatically notify issues deployed and their collaborators
  • Everyone is responsible for their own deploy and follow up
  • Generate public release notes based on issues labeled
  • Plan B to force deploy

And this is how Monorail was born.

Workflow changes

As already mentioned, the most important thing is not just to have the most automated deployment system possible but to have adapted the development flow so that any issue can pass from the ready status to production without risk. To achieve this we knew we’d have to make some changes:

QA Improvements

One of the things that worried us most was how to make sure that bugs were not uploaded to production. In Monorail we wanted a commit to the base branch to mean that the code would be put into production with the next train so we needed to be very sure before merging our branches.

In order to be able to test a development branch in an environment close to production without having to deal with the typical problems associated with a local environment, our systems team performed a bit of black magic to give us a slack command that in a short time provided us with an EC2 instance with the branch code installed. This allowed us to include a QA team to test thoroughly and in-depth our work in these on-demand sandboxes. One of our red flags was now suddenly much less red!

Ownership

This is a very important point when you have a continuous deployment system. Every developer of our team must:

  1. Ensure that their tasks pass through all the defined workflow steps.
  2. Ensure that their tasks can be safely deployed.
  3. Ensure that their tasks are deployed correctly and track progress.

If you don’t have trust in your developers then nothing much of quality can be achieved.

At the end of the day it’s all about having trust in every member of your team. If you think that with the team you currently work with you could never apply this kind of workflow because they don’t give you that kind of confidence, then you should probably change the way you hire. If you don’t have trust in your developers then nothing much of quality can be achieved.

Identifying where to deploy

Without doubt this was one of the most difficult problems to solve. Despite knowing that the future lies with microservices, we currently have a monolithic application (although it’s true that within it there are different services with completely separate entry points). Consequently, it’s not always easy to know which services are affected by a change in one of our data models. After giving this much thought we came to the conclusion that it was better to stay away from the automatic approach and make this the responsibility of the developer. Our philosophy of carrying out many small deployments on a daily basis also helps to avoid problems as the smaller the changes are, the easier it is to identify those services which might be affected.

Finally, we decided that the way to identify the services affected by a code branch was to specify them using GitHub labels applied to the Pull Request. This can then be reviewed as one step more in the QA process and ensures we always deploy to the right services.

Finally our train arrived

Once we implemented the changes we needed in our workflow we were ready to start the development of the tool that would make our CTO’s dream a reality and make our deployments automatic and with the minimum amount of human intervention as possible.

The reason we call it Monorail and why we refer to it as a deployment train is inspired by a video about the Spotify engineering culture. In the video they define how their deployment train passes every so often, takes onboard all the developed features and deploys them automatically. In our case the train passes hourly from 10 am to 7 pm every weekday.

Monorail consists of a series of scripts that are integrated into our existing deployment system, and a private service to communicate with GitHub. These are some of its functions that make our day-to-day life a lot easier:

It controls the status of the Pull Requests to ensure they all have the deploy labels, that they have passed the QA process and that there is nothing that should be executed in production before merging (this is identified using labels which we call ‘deploy notes’).

15 minutes before the train arrives Monorail, via Slack, notifies us of all the GitHub issues that are to be deployed, to which services and with which version of NodeJS.

After each deployment, Slack again informs us of the issues deployed and mentions the collaborators so they can track the updates in production

It automatically carries out version bumping and creates a new release on GitHub with all the information about the affected issues.

As specified in the original wish list we can easily force Monorail into action whenever needed using a Slack command.

Recap

Perhaps, after explaining how we changed our internal workflow to include Monorail, it is not too clear exactly what the daily routine of a developer in Audiense is like from the moment of starting a task until it is in production. It’s as simple as:

  1. Code
  2. Launch the Sandbox server for the branch
  3. Create the Pull Request on GitHub, apply the right labels, link to the sandbox and reference the issue.
  4. Wait for QA
  5. Merge
  6. Repeat!

--

--

I love solving business problems applying technology solutions in a smart and scalable way. http://ivanguardado.com