KYTCR Part II: Short Term Problem Analysis and Resolution
Again, I would like to repeat that this is not necessarily defacto or best practice, I’m just putting out what is working for me. If you think I’m stupid feel free to say so, but post an example of a better way or your comment will be deleted. Onward.
Your app is up and running. You have regularly scheduled backups. Oh crap. 503 Proxy Error. Downstream upstream. Clients and users are getting pissed. What do you do? No, calling ghostbusters is not an option. Most likely you are having memory problems or something.
Before I was using Monit (which we’ll discuss in part 3), these are steps I would take to figure out why things went down. First you have to get in your server so fire up terminal or iTerm and ssh in (be sure to check out I Can Has Command Line? for ssh shortcuts and other command line tips).
Once you have ssh’d in, check the processes (ps aux) and see what is running (or not running) and what is pissing off your server.
From the pic above you can see the columns of information that are given to you. Most likely you will get a long list of processes but don’t worry that is normal. There are a few important things to look at. First, scan up and down the list in the CPU and MEM columns. Is anything over like 20-25? If so, there is a good chance that process is causing some issues. You can kill it if you want (and you know how to start it back up).
sudo kill 206 will do the trick where 206 is replaced with the process id in the ps aux PID column. It is a good idea to check out processes every now and then when things are running fine so you have an idea of what is typical. That way when something is wrong, you’ll instantly notice what it is different and you can start your troubleshooting there.
The other column to note in this output is COMMAND. Command is typically what was called to startup this process. For example, often apache processes COMMAND output look something like this:
Likewise, mongrel instance COMMAND output often looks like this:
/usr/bin/ruby /usr/bin/mongrel_rails start -d -e production -p 8000 -a 127.0.0.1 -P log/mongrel.8000.pid -c /var/www/apps/myapp/current --user username --group groupname
If your mongrel conf file has four instances set and you only see 3 when you run the following:
ps aux | grep ruby
That means that one of your instances of mongrel has crashed. If you have them running on 8000, 8001, 8002, and 8003 and 8002 is missing, you don’t need to restart all the dogs, you can just start the one. Simply copy the command of one of the running mongrels and replace the port with the port of the instance that crashed. So for example, if port 8002 is not showing up in your ps aux output, copy the port 8000 command, replace 8000 with 8002 near -p and -P and hit enter. You can ps aux | grep ruby again to make sure that it has in fact started up.
If you see a ton of apache workers, there is a good chance that it’s time to pimp your setup a bit. Most likely, in my experience, you have a ton of requests coming in and apache is overloading one or two of your mongrel instances and nearly forgetting about the others. Apache’s mod_proxy_balancer is easy to setup but it seems to not care if a mongrel instance is slow or not responding and keeps firing requests at it. If you have KeepAlive set to on in apache and a mongrel instance gets overloaded, you’ll end up with a crap load of httpd workers. Anyway…those are a few things that can go wrong.
Sometimes if things are running slow or stopped and all my instances of mongrel are still running a simple restart of them and an ensuing restart of apache will clear things up. On my machine, I setup alias in my bash profile to allow for easy restarting of the entire cluster and apache:
alias restart_web='sudo /etc/init.d/httpd restart' alias restart_app='sudo mongrel_rails cluster::restart -C /etc/mongrel_cluster/myapp.conf'
Your config and restart commands may be different, but you get the idea. You don’t want to have to think when something goes down. Typically, a restart will fix things so make it really easy to do.
Before you go cowboy and start restarting all your dogs, you might want to check out your memory and file system usage. To see how much memory you are using, you can use
The top row used value (243) will always be very close to the top row total value (254). This is because linux uses any spare memory to cache disk blocks (or so I’ve read online). The important number is the buffers/cache used value (138). This tells how much memory you are actually using. For best performance, it should be less than your total memory (254). If your buffers/cache used value (138) is higher than your total memory + your swap memory, you are going to get out of memory errors which I’m assuming would be really bad. :)
Another way to analyze memory is with vmstat (which will be graphing in part 4).
There is no way I can explain vmstat better than this Rimuhosting article so I’ll direct quote:
The first row shows your server averages. The si (swap in) and so (swap out) columns show if you have been swapping (i.e. needing to dip into ‘virtual’ memory) in order to run your server’s applications. The si/so numbers should be 0 (or close to it). Numbers in the hundreds or thousands indicate your server is swapping heavily. This consumes a lot of CPU and other server resources and you would get a very (!) significant benefit from adding more memory to your server.
Some other columns of interest: The r (runnable) b (blocked) and w (waiting) columns help see your server load. Waiting processes are swapped out. Blocked processes are typically waiting on I/O. The runnable column is the number of processes trying to something. These numbers combine to form the ‘load’ value on your server. Typically you want the load value to be one or less per CPU in your server.
The bi (bytes in) and bo (bytes out) column show disk I/O (including swapping memory to/from disk) on your server.
The us (user), sy (system) and id (idle) show the amount of CPU your server is using. The higher the idle value, the better.
df and du
You’ll also want to keep an eye on total disk usage (until we start graphing it in part 4) so get familiar with df and du. df is for checking your total file size and du is for checking particular directories. Use the -h option with either to get more human readable file sizes.
Your best friend in times of crisis is a level head. Remember…outages happen. Computers messup. You messup. The most important thing is that you stay calm and attack logically. In the year or so that I’ve been managing rails apps, I haven’t ran into to many problems that a restart of the mongrel cluster and apache wouldn’t fix.
Did I miss anything? Have a better idea? Didn’t understand something? Leave comments below.
The ‘Keep Your Tag Cloud Running’ Series
- Part 1: Backups
- Part 2: Short Term Problem Analysis and Resolution
- Part 3: Automating Problem Resolution
- Part 4: Resource Reporting
- Part 5: Log rotation and analysis
- Part 6: Slow Queries and Indexing
- Part 7: General Closing Thoughts