Monday, July 23, 2012

Automated VM cloning with PowerCLI

Most small businesses cannot afford the high performance storage area networks (SANs) that make traditional redundancy options such high availability and fault tolerance possible. Despite this, the APIs available to administrators of virtualized infrastructure using direct attached storage (DAS) make it possible to recreate many of the benefits of high availability.

High Availability on SAN vs DAS

A single server failure in a virtualized environment can mean many applications and services can become unavailable simultaneously; for small organizations, this can be particularly damaging. High availability with SANs minimize the downtime of applications and services when a host fails by keeping virtual machine (VM) storage off the host and on the SAN. VMs on a failed host can then be automatically restarted on hosts with excess capacity. This of course requires SAN infrastructure to be highly redundant, adding to the already expensive and complex nature of SANs.

Alternatively, direct attached storage (DAS) is very cost effective, performant, and well understood. By using software to automate the snapshot and cloning of VMs via traditional gigabit Ethernet from host to host, we can create a "poor man's" high availability system.

It's important for administrators to understand that there is a very real window of data loss that can range from hours to days depending on the number of systems backed up and hardware in use. However, for many small businesses who may not have trustworthy backups, automated cloning is an excellent step forward.

Automated cloning with VMWare's PowerCLI

Although End Point is primarily an open source shop, my introduction virtualization was with VMWare. For automation and scripting, PowerCLI, the PowerShell based command line interface for vSphere, is the platform on which we will build. The process is as follows:

  • A scheduled task executes the backup script.
  • Delete all old backups to free space.
  • Read CSV of VMs to be backed up and the target host and datastore.
  • For each VM, snapshot and clone to destination.
  • Collect data on cloning failures and email report.

I have created a public GitHub repository for the code and called it powercli_cloner.

Currently, it's fairly customized around the needs of the particular client it was implemented for, so there is much room for generalization and improvement. One area of improvement is immediately obvious: only delete a backup after successfully replacing it. Also, the script must be run as a Windows user with administrator vSphere privileges, as the scripts assumes pass-through authentication is in place. This is probably best for keeping credentials out of plain text. The script should be run during non-peak hours, especially if you have I/O intensive workloads.

Hopefully this tool can provide opportunities to develop backup and disaster recovery procedures that are flexible, cost-effective, and simple. I'd welcome pull requests and other suggestions for improvement.

Tuesday, July 17, 2012

Changing Passenger's Nginx Timeouts

It may frighten you to know that there are applications which take longer than Passenger's default timeout of 10 minutes. Well, it's true. And yes, those application owners know they have bigger fish to fry. But when a customer needs that report run *today* being able to lengthen a timeout is a welcomed stopgap.

Tracing the timeout

There are many different layers at which a timeout can occur, although these may not be immediately obvious to your users. Typically they receive a 504 and an ugly "Gateway Time-out" message from Nginx. Review the Nginx error logs both at the reverse proxy and application server, you might see a message like this:

upstream timed out (110: Connection timed out) while reading response header from upstream

If you're seeing this message on the reverse proxy, the solution is fairly straight forward. Update the proxy_read_timeout setting in your nginx.conf and restart. However, it's more likely you've already tried that and found it ineffective. If you expand your reading of the Nginx error you might notice another clue.

upstream timed out (110: Connection timed out) while reading response header from upstream, 
upstream: "passenger://unix:/tmp/passenger.3940/master/helper_server.sock:"

This is the kind of error message you'd see on the Nginx application server when a Passenger process takes longer than the default timeout of 10 minutes. If you're seeing this message, it'd be wise to review the Rails logs to get a sense for how long this process actually takes to complete so you can make a sane adjustment to the timeout. Additionally, it's good to see what task is actually taking so long so you can offload the job into the background eventually.

Changing nginx-passenger module's timeout

If you're unable to address the slow Rails process problem and must extend the length of the time out, you'll need to modify the Passenger gem's Nginx configuration. Start by locating the Passenger gem's Nginx config with locate nginx/Configuration.c and edit the following lines:

ngx_conf_merge_msec_value(conf->upstream.read_timeout,
                              prev->upstream.read_timeout, 60000);
Replace the 60000 value with your desired timeout in milliseconds. Then run sudo passenger-install-nginx-module to recompile nginx and restart.

Improving Error Pages

Another lesson worth addressing here is that Nginx error pages are ugly and unhelpful. Even if you have a Rails plugin like exception_notification installed, these kind of Nginx errors will be missed, unless you use the error_page directive. In other applications I've setup explicit routes to test exception_notification properly sends an email by creating a controller action that simple raises an error. Using Nginx's error_page directive, you can call an exception controller action and pass useful information along to yourself as well as present the user with a consistent error experience.