Multiple Domain Page Caching

The other day Brandon Wright emailed me about the following tweet:

Just deployed full page caching on Harmony. Our log file stopped spinning by which made me happy and sad.

Routing

It might seem like black magic, but it isn’t all that hard. The front side for Harmony is not the same as a typical Rails app as we have multiple domains pointed at Harmony and the paths are not known up front so they don’t go in the routes file. In order to get everything headed to a controller, the last route in our file is this:

map.dispatch '*path', :controller => 'the', :action => 'dispatch'

This uses Rails route globbing to send every path to an action named dispatch in a controller dubiously named “the” (because it made us laugh). From there, we determine if it we can find the site and if the site has an item (page, link, blog, post, etc.) that matches the path.

Caching

Somewhere down the rabbit hole we render that item based on it’s liquid template, immediately after which we call something like this:

cache_item(@item, contents)

# which looks kind of like this
def cache_item(item, contents)
  # gone for brevity
  
  FileUtils.mkdir_p(File.dirname(item.page_cache_path))
  File.open(item.page_cache_path, 'w+') { |f| f.puts(contents) }
end

*We could have used caches_page in Rails, but we are already using that without including the http host for asset and theme file caching, so it was easier to just roll our own.

All cache_item does is ensure that the directory exists and then write the contents of what we are about to send back to the browser into a file. Really nothing fancy. So what does item.page_cache_path look like? For a site like railstips.org and a path of /dude/, we end up with the following cache path:

#{RAILS_ROOT}/public/cache/railstips.org/dude/index.html

Note the use of the domain in the cache path. Since we have that, we can use apache rewrites along with conditions to tell apache to check if a cached file exists based on the host. If it does, we server that file and if it doesn’t, we just hit rails, cache the file, and return the response. We use Moonshine for our deployments so all we need to do is set the Passenger page cache directory like this:

:passenger:
  :page_cache_directory: '/cache/%{HTTP_HOST}'

When we deploy, this sets up the following Apache rewrite rules:

# Rewrite to check for Rails non-html cached pages (i.e. xml, json, atom, etc)
RewriteCond  %{THE_REQUEST} ^(GET|HEAD)
RewriteCond  %{DOCUMENT_ROOT}/cache/%{HTTP_HOST}%{REQUEST_URI} -f
RewriteRule  ^(.*)$ /cache/%{HTTP_HOST}$1 [QSA,L]

# Rewrite to check for Rails cached html page
RewriteCond  %{THE_REQUEST} ^(GET|HEAD)
RewriteCond  %{DOCUMENT_ROOT}/cache/%{HTTP_HOST}%{REQUEST_URI}index.html -f
RewriteRule  ^(.*)$ /cache/%{HTTP_HOST}$1index.html [QSA,L]

Note that in the RewriteRule, we include the HTTP_HOST, which when visiting railstips.org, would be railstips.org.

One URL to Rule Them All

The key to this being effective is only having one true url for each page. We do this right now by redirecting www to no-www and ensuring that each page has a trailing slash. First, no-www.

# no www
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteRule ^(.*)$ http://%1$1 [R=301,L]

Next, we ensure that there is always a trailing slash when needed. This means that /foo redirects to /foo/ and foo.json just stays as foo.json.

RewriteCond  %{THE_REQUEST} ^(GET|HEAD)
RewriteCond %{REQUEST_URI} !^/admin/
RewriteRule ^(.*/[^/\.]+)$ $1/ [R]

Ensuring that each page has one URL is better for search engines and analytics. You don’t end up with split page rank for the same page (with and without slash) and the same thing is true for pageviews.

Cache Clearing

Now that I’ve explained a bit how we do the caching, I’ll mention quickly how we clear it. As they say, cache expiration and naming are the two hardest things to do in programming. We opted for the most simple solution that would work for now.

I made a simple site cache clearer module that I include in any model that can affect a site on the front side. It looks something like this.

module SiteCacheClearer
  def self.included(model)
    model.after_save    :clear_item_cache
    model.after_destroy :clear_item_cache
  end
  
  def clear_item_cache
    site.clear_item_cache if site.present?
  end
end

# To use
class Item
  include MongoMapper::Document
  include SiteCacheClearer
end

All it does is remove the entire site’s cache whenever the model is updated or destroyed. Like I said, nothing fancy. Doesn’t check if the thing is published. Doesn’t check what pages it is actually shown on and only removes them. It just blows away cache when things change.

Someday we’ll definitely do something more advanced like a reference-based cache where only the pages that need to be blown away are, but this is working great for now. Hope this is helpful to someone.

The main thing to remember is to use the host and make sure there is only one way to get to the resource.

So what does this all mean to our read heavy application? Well, we end up with Scout graphs like this:

Harmony Page Caching

The blue is apache requests and the orange is Rails requests. Notice that as our apache requests go up, our Rails requests stay pretty steady.

12 Comments

Brandon Wright
Jan 22, 2010

So well explained, thanks! I can’t wait to give Harmony a try!
John Nunemaker
Jan 22, 2010

@Brandon Wright: You’re welcome!
PabloC
Jan 22, 2010

Great post John, tks. Don’t you think that NGINX is a better alternative to Apache? Specially on a rails backend and to serve static files.
Are you using Apache for a reason?
John Nunemaker
Jan 24, 2010

@PabloC: I have never had any problems with Apache or any needs for anything different. I have used Nginx before on other projects, but I have no feelings either for or against it. Apache in the default Railsmachine stack and they manage our hosting so we just went with it. I’m sure we could switch to Nginx if we wanted/needed to.
courtenay
Jan 24, 2010

Hah, technoweenie solved this in 2006 for mephisto. How’s that shiny new wheel design? :)

http://agilewebdevelopment.com/plugins/referenced_page_caching
PabloC
Jan 24, 2010

Thanks for your feedback!
John Nunemaker
Jan 24, 2010

@courtenay: Oh, trust me. I don’t intend on reinventing the wheel when I do the reference-based cache. :) I’ll probably start with something similar to mephisto and go from there.
Ken Mayer
Jan 25, 2010

We have a similar situation. One thing we discovered along the way is that the ActionController::Request object has 2 methods: #host and #domain. The first returns whatever was in the HTTP_HOST header, but #domain() will return the top level domain (you can specify the tld length as an argument to the method).

This came in quite handy when we wanted to map *.domain.dom to just domain.dom.
John Nunemaker
Jan 25, 2010

@Ken Mayer: Nice. Good to know if I run into a situation where I need that.
Ben Hunsaker
Feb 03, 2010

You explained this so insanely simple that even I feel like I could do this and I have been learning Ruby on Rails for a couple of months now. Great job!
Steven
Apr 13, 2010

Great info. Thank you for SiteCacheClearer, I thought about using something of this kind sometime ago. I liked the simplicity of your implementation
Jeremy Lecour
May 10, 2010

Hi,

I’d like to do some page caching the same way, but I can’t get to store the page caches in another directory than the public path. Do you have any pointer on how to do this ?

Thanks