It’s Better Than Bad, It’s GOOD!

July 27, 2015

Let’s talk about logs. Everybody logs, but what do we do with them? Do they sit on on your machines, dutifully being compressed and rotated, hoping that someone will eventually look at them? Are they greedy, taking all the disk space they can grab until someone remembers to clean them?

I’d like to discuss some of the challenges we have had with logging here at Signal, and how we went about making things better. So please join me for this wild log ride.

There are many really good articles that focus on setting up the ELK stack, so I won’t go into that here, but I want to more focus on the actual transport of the log data. I see a lot of people focusing on the end result, and not giving much thought on how things got there.

Don’t get me wrong; not all logs are created equal, and building resiliency and redundancy into the logging infrastructure is not something that everyone needs/wants. You really need to look at the value of what is in those logs and the impact of not having that information to help you make some decisions on how much infrastructure you will need.

For us the following were the main things we were looking for after outgrowing our “ssh into this server and checks its logs” workflow:

  • Reliable – Does it deal with disconnections and downstream disruption?
  • Flexible – Can I use it for all our forwarding needs?
  • Low utilization/minimal requirements – Please don’t “Love him and squeeze him and name him George.”
  • Minimal deployment/maintenance work – It should just work.

There are many ways to accomplish this, and while this is by no means an exhaustive list, we did consider quite a few methods:

  • application sends directly to elasticsearch
  • application send to logstash
  • logstash on the box
  • rsyslog
  • bash
  • custom snippet in language of choice

Pros/Cons of each way

Application/Instance => Elasticsearch:

figure_1

Pros:

  1. Simple design
  2. Fewer dependencies
  3. Less infrastructure

Cons:

  1. Application messages need to be JSON
  2. Need to update code for any logging changes
  3. Need to handle failures within the app
  4. Can’t easily handle services that we don’t control (apache, mysql, etc)

 Application/Instance => Logstash => Elasticsearch

figure_2

Pros:

  1. Allows for filtering and mutation of messages
  2. Can accept any type of message

Cons:

  1. Still no guarantee of message delivery from source to logstash
  2. Need to update code for some logstash changes

 Application/Instance => Local Logstash => Elasticsearch

figure_3

Pros:

  1. Allows for filtering and mutation of messages
  2. Can accept any type of message

Cons:

  1. Need to update code for some logstash changes
  2. Running a potentially heavy java process
  3. Logstash configuration management across many instances

Application/Instance => Local Logstash Forwarder => Logstash => Elasticsearch

figure_4

Pros:

  1. Allows for filtering and mutation of messages
  2. Encrypted connection to logstash

Cons:

  1. Only watches files
  2. Another process to maintain

Application/Instance => Rsyslog => Logstash => Elasticsearch

figure_5

Pros:

  1. Allows for filtering and formatting of messages at source
  2. Encrypted connection to logstash
  3. Can queue messages locally if downstream is unavailable
  4. Part of the base Ubuntu install – most likely already running on your system
  5. Can forward vanilla syslog messages and app logs

Cons:

  1. Configuration management
  2. Can hold on to filehandle for file it is tailing, which may necessitate a SIGHUP to let go
  3. Can replay messages for on restart if not configured properly ( see example config and note below)

What we ended up with:

  • Application logs into .logstash json formatted-files locally
  • rsyslog tails these files and forwards them to logstash. For some of our higher throughput systems like load-balancers, we write to syslog which forwards directly to logstash so as not to take the I/O hit of writing to disk first
  • round-robin DNS for syslog => logstash
  • haproxy on logstash boxes
  • logstash => local haproxy => elasticsearch clusters

figure_6

 

Example rsyslog config for tailing a file


Note: Make sure to define $InputFileStateFile so that rsyslog will be able to resume where it left off.


Sizing

As I have said before there is no “right” answer for this, and you will have to make decisions on what works best in your environment. Our production environment is primarily Amazon EC2, so some of the things we really focus on is sizing as it directly relates to costs. It’s very easy to just throw a bigger instance at things, but one of the core tenets of our infrastructure is horizontal scalability. We would much rather have more small things than fewer large ones.

Elasticsearch

  • Size: 3 – 5 m1.xlarges. These instances offer a nice balance of memory, compute, and disk.
  • Index Volume: 19 million per day average per region
  • Retention: 45 days total, 30 days online
  • Dedicated r3.xlarge for a query node in our bigger regions

Logstash

  • Size: 2 – c3.larges. These instances are our go-to generic compute box.
  • HAproxy runs locally

Kibana

  • Runs on a VM in our office, one “deployment” per region

figure_7


 Future Plans

One of the next iterations of this infrastructure will involve utilizing Apache Kafka as part of this pipeline. This will give us a more fault-tolerant buffer for the messages and gives us the ability to not only replay but have multiple consumers able to process the data.

figure_8

We’re also excited to start storing more ad-hoc data in our elasticsearch database. We’re starting to put ‘events’ in there from various processes like deploying and when certain alert conditions are caught with monitoring tools. With this data we can annotate our grafana dashboards to watch correlations between system metrics and events just as other companies have shown, and the results are obviously very powerful for troubleshooting.

Conclusion

Every environment is different, and the decisions you make should be based on the specifics of your infrastructure. Not trying to build a battleship the first go-round and just concentrating on solving a single problem can really help things get moving in the right direction. In the end, whether your infrastructure is big, small, local, cloud, important, not so important, don’t forget that it’s not just about where your logs end up, but how they got there.

Andy Peckys

Andy Peckys is a Sr. DevOps Engineer at Signal, where he helps manage and grow a global cloud infrastructure. Having experience with high-throughput and low-latency computing, he helps Signal continue to expand Signal's capacity to handle the ever-growing amount of traffic generated by our customers. Prior to Signal, Andy held positions at DRW Trading and JPMorgan Chase. In his free time he wrangles 6 kids and loves to bowl.

Subscribe for Updates
X