Keep 'Em Separated

Note: If you end up enjoying this post, you should do two things: sign up for Pusher and then subscribe to destroy all software screencasts. I’m not telling you do this because I get referrals, I just really like both services.

For those that do not know, Gauges currently uses Pusher.com for flinging around all the traffic live.

Every track request to Gauges sends a request to Pusher. We do this using EventMachine in a thread, as I have previously written about.

The Problem

The downside of this, is when you get to the point we were (thousands of a requests a minute), there are so many pusher notifications to send (thousands of a minute) that the EM thread starts stealing a lot of time from the main request thread. You end up with random slow requests that have one to five seconds of “uninstrumented” time. Definitely not a happy scaler does this make.

In the past, we had talked about keeping track of which gauges were actually being watched and only sending a notification for those, but never actually did anything about it.

The Solution

Recently, Pusher added web hooks on channel occupy and channel vacate. This, combined with a growing number of slow requests, was just the motivation I needed to come up with a solution.

We (@bkeepers and I) started by mapping a simple route to a class.

class PusherApp < BaseApp
  post '/pusher/ping' do
    webhook = Pusher::WebHook.new(request)
    if webhook.valid?
      PusherPing.receive(webhook)
      'ok'
    else
      status 401
      'invalid'
    end
  end
end

Using a simple class method like this moves all logic out of the route and into a place that is easier to test. The receive method iterates the events and runs each ping individually.

class PusherPing
  def self.receive(webhook)
    webhook.events.each do |event|
      new(event, webhook.time).run
    end
  end
end

At first, we had something like this for each PusherPing instance.

class PusherPing
  def initialize(event, time)
    @event         = event || {}
    @time          = time
    @event_name    = @event['name']
    @event_channel = @event['channel']
  end

  def run
    case @event_name
    when 'channel_occupied'
      occupied
    when 'channel_vacated'
      vacated
    end
  end

  def occupied
    update(@time)
  end

  def vacated
    update(nil)
  end

  def update(value)
    # update the gauge in the
    # db with the value
  end
end

We pushed out the change so we could start marking gauges as occupied. We then forced a browser refresh, which effectively vacated and re-occupied all gauges people were watching.

Once we new the occupied state of each gauge was correct, we added the code to only send the request to pusher on track if a gauge was occupied.

Deploy. Celebrate. Booyeah.

The New Problem

Then, less than a day later, we realized that pusher doesn’t guarantee the order of events. Imagine someone vacating and then occupying a gauge, but receiving the occupy first and then the vacate.

This situation would mean that live tracking would never turn on for the gauge. Indeed, it started happening to a few people, who quickly let us know.

The New Solution

We figured it was better to send a few extra notifications than never send any, so we decided to “occupy” gauges on our own when people loaded up the Gauges dashboard.

We started in and quickly realized the error of our ways in the pusher ping. Having the database calls directly tied to the PusherPing class meant that we had two options:

Use the PusherPing class to occupy a gauge when the dashboard loads, which just felt wrong.
Re-write it to separate the occupying and vacating of a gauge from the PusherPing class.

Since we are good little developers, we went with 2. We created a GaugeOccupier class that looks like this:

class GaugeOccupier
  attr_reader :ids

  def initialize(*ids)
    @ids = ids.flatten.compact.uniq
  end

  def occupy(time=Time.now.utc)
    update(time)
  end

  def vacate
    update(nil)
  end

private

  def update(value)
    return if @ids.blank?
    # do the db updates
  end
end

We tested that class on its own quite quickly and refactored the PusherPing to use it.

class PusherPing
  def run
    case @event_name
    when 'channel_occupied'
      GaugeOccupier.new(gauge_id).occupy(@time)
    when 'channel_vacated'
      GaugeOccupier.new(gauge_id).vacate
    end
  end
end

Boom. PusherPing now worked the same and we had a way to “occupy” gauges separate from the PusherPing. We added the occupy logic to the correct point in our app like so:

ids = gauges.map { |gauge| gauge.id }
GaugeOccupier.new(ids).occupy

At this point, we were now “occupied” more than “vacated”, which is good. However, you may have noticed, that we still had the issue where someone loads the dashboard, we occupy the gauge, but then receive a delayed, or what I will now refer to as “stale”, hook.

To fix the stale hook issue, we simply added a bit of logic to the PusherPing class to detect staleness and simple ignore the ping if it is stale.

class PusherPing
  def run
    return if stale?
    # do occupy/vacate
  end

  def stale?
    return false if gauge.occupied_at.blank?
    gauge.occupied_at > @time
  end
end

Closing Thoughts

This is by no means a perfect solution. There are still other holes. For example, a gauge could be occupied by us after we receive a vacate hook from pusher and stay in an “occupied” state, sending notifications that no one is looking for.

To fix that issue, we can add a cleanup cron or something that occasionally gets all occupied channels from pusher and vacates gauges that are not in the list.

We decided it wasn’t worth the time. We pushed out the occupy fix and are now reaping the benefits of sending about 1/6th of the pusher requests we were before. This means our EventMachine thread is doing less work, which gives our main thread more time to process requests.

You might think us crazy for sending hundreds of http requests in a thread that shares time with the main request thread, but it is actually working quite well.

We know that some day we will have to move this to a queue and an external process that processes the queue, but that day is not today. Instead, we can focus on the next round of features that will blow people’s socks off.

2 Comments

Alex Sharp
Feb 14, 2012

> You might think us crazy for sending hundreds of http requests in a thread that shares time with the main request thread…

We do the same thing and it works like magic. Certainly not an end all solution, but its a great solution for the time being. Thanks for the inspiration on this idea :)
kevin
Apr 27, 2012

Hey johns, that was really beautifully explained by you.
Yes there are some issues with gauges stealing the instrumental time because of slow requests.
I think you figured out the good solution for the problem.
Best luck.