Growing a Rails Application: How We Made Deploy Fast Again
TL;DR; We brought our deploy time down from 10 minutes to 50 seconds.
When I joined PagerDuty over a year ago, our application consisted of essentially a single Rails site. We’ve changed the architecture of our system since then to be more distributed and service oriented, but there is still an ever-growing Rails app at the center of it all to manage user preferences and schedules.
As is too often the case with Rails, the application had become very large and this started to cause multiple problems; deploy time in particular was causing me grief. In the beginning, deploying code to production would take roughly 30 seconds. It then got to a point where deploys could take somewhere between 6 to 10 minutes.
This use to be a problem because 1) it massively slowed down our development and 2) deploys weren’t fun anymore.
We’ve put some effort into bringing our deploy time down and we would like to share with you what we learned and how we did it.
The Stack
We currently use:
- Ruby on Rails 3.2.8
- CoffeeScript & SASS compiled by the Rails’ asset pipeline
- Capistrano 2.9.0
- Ruby 1.9.3
First, Measure Everything
The first step to optimizing any code is to actually measure where you’re wasting time. We customized the default config of Capistrano to help provide us with a clear sense of what was taking so long.
We published most of our reusable capistrano recipes and you can take advantage of them here: https://github.com/PagerDuty/pd-cap-recipes
One of these extensions appends a performance report at the end of each capistrano run. Full code here
Here’s what a performance report used to look like.
** Performance Report
** ==========================================================
** production 0s
** multistage:ensure 0s
** git:validate_branch_is_tag 25s
** hipchat:trigger_notification 0s
** db:check_for_pending_migrations 2s
** deploy
** ..deploy:update
** ....hipchat:set_client 0s
** ....hipchat:notify_deploy_started 18s
** ....deploy:update_code
** ......db:symlink 3s
** ......newrelic:symlink 3s
** ......bundle:install 4s
** ......deploy:assets:symlink 1s
** ......deploy:finalize_update 4s
** ......deploy:assets:precompile 230s
** ....deploy:update_code 264s
** ....deploy:symlink
** ......git:update_tag_for_stage 3s
** ....deploy:symlink 5s
** ..deploy:update 288s
** ..deploy:cleanup 3s
** ..newrelic:notice_deployment 2s
** ..deploy:restart 1s
** ..unicorn:app:restart 1s
** ..deploy:bg_task_restart 0s
** ..deploy:bg_task_stop 4s
** ..deploy:bg_task_start 24s
** ..bluepill:rolling_stop_start 124s
** ..deploy:cron_update 2s
** ..deploy_test:web_trigger 14s
** ..cap_gun:email 0s
** ..hipchat:notify_deploy_finished 0s
** deploy 470s
With this report, it was much easier for me to tell what was taking a long time and what could be optimized.
The following is a breakdown of each slow Capistrano recipe and what we did to make it faster.
Sanity Checks
At PagerDuty, we always deploy git tags rather than revisions. The git:validate_branch_is_tag task is a sanity check to validate that the SHA we’re deploying is actually a git tag. Why is it taking a lengthy 25s? We realized that because we would never delete old tags, simply pruning the old tags sped this up to 4s.
This improvement is not the most significant or interesting, but it shows the usefulness of the performance report. Without it, it was difficult to see that this task was taking longer than needed since the 25s were lost in the noise of the Capistrano output.
Assets
The PagerDuty website is very asset-heavy. We have a lot of CoffeeScript and SASS code that needs to get compiled to JavaScript and CSS, as well as many 3rd party libraries (e.g. Backbone.js, jQuery) that get compressed on each deploy.
Rails handles all of this for us, but this process is fairly slow.
It used to take 200+ seconds to compile and bundle everything. But something we realized by looking at our deploy history is that only a small fraction of deploys actually modify the assets. So there should be no need to recompile everything each time. Rails is pretty specific about where assets can be stored. By combining this knowledge and source control, we can determine if asset recompilation is needed.
The interesting code is this:
def assets_dirty?
r = safe_current_revision
return true if r.nil?
from = source.next_revision(r)
asset_changing_files = ["vendor/assets/", "app/assets/", "lib/assets", "Gemfile", "Gemfile.lock"]
asset_changing_files = asset_changing_files.select do |f|
File.exists? f
end
capture("cd #{latest_release} && #{source.local.log(current_revision, source.local.head)} #{asset_changing_files.join(" ")} | wc -l").to_i > 0
end
If any of the files in the directories that can contain assets have changes, we consider the assets dirty and we recompile them. In our case, this only happens on a small minority of deploys, so this allows for an very interesting speedup.
Background Jobs
The other slow part is restarting the background workers. These workers perform various tasks in the PagerDuty infrastructure, including actually sending alerts to our users.
The slowest task was bluepill:rolling_stop_start. Bluepill is a process manager that restarts any worker in case they crash, or consume too much CPU or memory.
These workers are fairly slow to start and since they are critical to our notification pipeline, we don’t want to shut all of them off at once and lose the ability to send alerts for a few seconds. What we use to do is partition all of our machines into 3 groups and restart the worker processes one group at a time.
This was a synchronous and very slow process.
We realized that there is no reason to do this process synchronously during the deploy. As long as the process restarted correctly, we didn’t need to wait on them. To help, we started using Monit, which we have found to be a robust and powerful solution.
The thing with Monit was that it runs on each host, but is unaware of the other hosts, so our rolling deploy strategy needed to be updated. Now, instead of partitioning the servers themselves, we partition the actual processes on each host. So if we have 3 worker process running on each host, we shut down one of the old ones and start a new one. Once the new one is running, we repeat the process for each other old process.
In the unlikely event that the restart fails, Monit is hooked into our monitoring infrastructure and we get paged to resolve the issue.
Tests
The last task I wanted to optimize was the deploy_test:web_trigger task. This task acts as a smoke test for our deploys. It creates a new PagerDuty incident and assigns it to the deployer. The deployer makes sure the phone call gets through and that it can resolve the incident.
This was slow because the test script needs to load the entire Rails environment. The fix was again to not do things synchronously. Using screen, we can easily run this script in the background.
namespace :deploy_test do
desc 'Create an incident for a service with an escalation policy that will call the user who just deployed'
task "web_trigger", :roles => :test, :on_error => :continue do
username = `git config user.username`.strip
run "cd #{current_path} && RAILS_ENV=#{rails_env} ./script/deploy/test_incident.sh #{username}", :pty => true
end
end
#!/bin/bash
screen -m -d bundle exec rails runner -e $RAILS_ENV script/deploy/test_incident.rb $1
The final results
** Performance Report
** ==========================================================
** production 0s
** git:validate_branch_is_tag 4s
** hipchat:trigger_notification 0s
** db:check_for_pending_migrations 2s
** deploy
** ..deploy:update
** ....hipchat:set_client 0s
** ....hipchat:notify_deploy_started 1s
** ....deploy:update_code
** ......db:symlink 1s
** ......newrelic:symlink 1s
** ......bundle:install 4s
** ......deploy:assets:symlink 0s
** ......deploy:finalize_update 1s
** ......deploy:assets:precompile
** ........deploy:assets:cdn_deploy 0s
** ......deploy:assets:precompile 0s
** ....deploy:update_code 24s
** ....deploy:symlink
** ......git:update_tag_for_stage 8s
** ....deploy:symlink 9s
** ..deploy:update 35s
** ..deploy:cleanup 1s
** ..newrelic:notice_deployment 5s
** ..deploy:restart 0s
** ..deploy:bg_task_default_action 0s
** ..deploy_test:web_trigger 0s
** ..cap_gun:email 1s
** ..hipchat:notify_deploy_finished 0s
** deploy 46s
** ==========================================================
So we’ve brought our deploy time back under a minute. These are solid improvements that makes it easier for developers to deploy and thus encourage them to deploy more often.
The Future
One thing I am still working on, that is not fully solved, is the asset compilation time. You need to add many minutes to the deploy time if the assets have changed. I can think of a few ways to improve this. First Rails wastes a lot of time compiling vendor assets (jQuery for example) that are available pre-minified. This would reduce the compile-time, but would require changing how the asset pipeline works.
The other solution would be to have our continuous integration server monitor our git repository for asset changes and build them asynchronously. The deploy script could then simply copy the compiled assets from the CI server to our CDN, which should be much faster. Also, if a single machine is responsible for compiling the assets, it can keep a cache of the compiled version of each file and not recompile this file if it hasn’t changed.
Conclusion
Our deployments are back under control. The main lessons are:
- Profile your deployments to find why they are slow
- Don’t do work when you don’t need to
- Do as many things asynchronously as possible
- Use monitoring to ensure asynchronous tasks succeed in a timely manner