KYTCR Part IV: Resource Reporting
Ok. Back on track for keeping your tag cloud running. We talked about backups and short/long term problem resolution, so now it’s on to resource reporting. I really see resource reporting more at a macro level. What I’m talking about are things like memory, CPU and filesystem usage, apache processes and volumes and mysql queries and throughput. These stats give you information on how your application is running and where possible rough spots are (spots that need tweaking).
There are a lot of tools out there to get the job done but the one that I picked is munin so that’s what I’ll talk about today. From the munin website:
Munin the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Its emphasis is on plug and play capabilities. After completing a installation a high number of monitoring plugins will be playing with no more effort.
That pretty much explains it. What I did is I have one small VPS that I use for version control and as a munin server. I guess before I go to far into that I should explain some quick theory. Munin has a node/server setup. That means you can have all your servers act as nodes which simply make information available to a configured server. The server is the machine that goes out every minute (or how ever often you say), grabs the information from the nodes and collects it into one place. This is nice because I can get a quick daily, weekly, monthly or yearly overview of all the nodes at once on one page. I only keep track of about seven so the page is not ridiculously full of information.
My gauge for easy installation is, if I can do it, anyone can do it. I’m far from a linux guru, trust me. On Ubuntu, it’s nearly as easy as
sudo apt-get install munin and then a few tweaks. On CentOS, once you get the RPMForge and Dag hooked up, installation is a simple
sudo yum install munin and, again, a couple of tweaks and you are good to go. I’ll list some links at the end that got me going.
So I’ll be the first to admit that Munin could use some UI love but the most important thing is seeing the data in a graph and it accomplishes that well enough. Below are a few samples of real graphs (all daily) pulled from Conductor, our home grown, rails, multi-site website management tool at the University of Notre Dame.
Apache Access: This is helpful just to see the wax and wane of traffic on your site. I find it interesting that most of the sites in Conductor go nearly dormant at night. Leads me to believe we have mostly U.S. traffic (which makes sense).
File System: Now I should have stuff set in to automatically warn me if I get up in file system usage but I’m so low right now, I don’t worry about it too much.
Memory Usage: This is really important for Rails apps. They are known to use a fair share of memory and you need to keep an eye on this for leaks. If you see steady growth, you probably have some issues. One thing to note from the spike above is that it is caused each night by a backup task that I have to shuffle my database, assets and themes off-site. I need to use rsync or something but it is working and who has time, right? Plus if things get out of control, I have monit to get things back in order.
MySQL Queries: Now obviously, more traffic equals more queries so this graph has tended to show the same trends as apache access. I think it’s pretty interesting to see the number of queries per second so I tend to check it out.
Real Life Ways Munin Helped Me
So I show this to give a valid example of how Munin helped me. I avoided premature optimization when building Conductor, and just focused on how the app worked. Once I got Munin tracking all this stuff, I noticed that it seemed like a lot of queries, for the amount of traffic we had.
I did some tailing of log files, tweaking of queries and some htaccess trickery, and you can see the dramatic drop. That drop was not a drop in traffic. In fact, traffic and the number of sites in conductor has only risen, but the queries are still down quite a bit, simply do to a bit of analyzation (which was prompted by the queries per second graph). Also, ignore the blank spots. :) We had some issues where server reboots were required and I forgot to set Munin to start on boot and also forgot to check to see if it was running. Like I said, I’m not a server admin. I’m like you (unless you are a server admin).
Another way that Munin helped out, was by showing me the number of slow queries. This particular slow query, again, happens when the memory jumps as ruby tries to FTP a crap load of backups off-site. How this was helpful, was it prompted me to do more research on indexes and analyzing queries. I was noticing a lot of slow queries as the tally of pages grew in conductor. I did some grepping of the slow query log, based on what Munin was telling me, then some EXPLAIN’in in MySQL, and quickly added two or three indexes which fixed all the issues (more on that in another post).
These are just two examples, and I’m sure I could come up with more but I think I got the point across. Munin is pretty easy to install, has lots of documentation/articles on the net and has come in really handy for me.
- How to install Munin on CentOS
- How to get RPMForge and DAG on CentOS
- Password protecting your Munin install
- FiveRuns Server and Rails Application Monitoring
The ‘Keep Your Tag Cloud Running’ Series
- Part 1: Backups
- Part 2: Short Term Problem Analysis and Resolution
- Part 3: Automating Problem Resolution
- Part 4: Resource Reporting
- Part 5: Log rotation and analysis
- Part 6: Slow Queries and Indexing
- Part 7: General Closing Thoughts