Before diving deeper into any one area of the Signal tech stack in later articles, I wanted to give an overview of how our technology is built, laid-out, deployed, and managed. I love working in our environment because I feel it strikes a great balance between flexibility and standardization, which allows us to do many things with it in relatively short time spans.
First and foremost, we employ a service-oriented approach to our architecture. We have a few large-ish Java applications which handle the bulk of our real-time workload, but even the biggest are still designed to do “one thing really well”. As of this post, we have 47 separate “apps we can build” (not counting the libraries and shared components) with our general purpose fabric-based python builder, and a quick glance through that list shows that 20 of them are back-end RESTful HTTP applications or front-end UI applications. Together, these 20 applications drive the bulk of the production traffic we see in various ways, with the remaining pieces providing support roles like infrastructure management, asynchronous replication duties, monitoring responsibilities, and reporting.
As Eric mentioned in the first post, although we predominantly use Amazon’s EC2 for our production environment, we don’t rely on much of Amazon’s other tools. This is because we like to think that our solutions aren’t Amazon-specific — and to stay true to that notion, we deploy our externally-available staging environment to Rackspace’s public cloud, and our internal-only development environment to our own self-maintained OpenStack cluster. Each developer uses VirtualBox to run “Signal in a VirtualBox”, where every application and backend database can be run on a single VM and managed in the same manner as our other environments – using puppet.
We have a beefy physical Jenkins box with dozens of jobs where we continuously test each push to master, including individual unit tests, whole application smoke tests with seeded data, and finally behavior tests using Cucumber and Selenium nodes running on various versions of three major browsers. In a future blog post, we’ll describe our testing setup in more detail and what we’d like to do to improve it.
We currently deploy most of our code to production weekly. These are zero-downtime rolling deploys which currently take about an hour to do on all four of our global regions. All of the deployment code is again fabric with expectations from each of our web apps to have a special health-check method to ensure each process and its underlying databases are working properly before moving to the next one.
Instead of dealing with each cloud provider’s API separately, we built a python app using Flask and libcloud which gives us a common set of endpoints regardless of which provider it’s asking for data. We use these ‘manifests’ to generate load balancing configs, deployment routines, monitoring dashboards, and more.
Speaking of monitoring, we have a deep pipeline utilizing Graphite, logstash, elasticsearch, kibana, and too many home-grown dashboards to keep track of. The largest of these uses various front-end rendering libraries like d3, rickshaw, and dygraphs to reduce the amount of data we pull from each graphite server and speed up the time to show the graphs.
We’re very proud of the fact that we’re language- and database-agnostic when it comes to solutions to new problems, but we try to KISS as much as possible, too. Not everything we do is perfect, of course, and throughout future posts we’ll explore both the parts we love and the parts we’d like to improve, along with some ideas we have for making them better and some things we’ve tried which didn’t work out so well.