Self.Develop!: 2012

Thursday, October 18, 2012

Case Sensitive MySQL Searches

MySQL's support for case sensitive search is explained somewhat opaquely in the aptly titled Case Sensitivity in String Searches documentation. In short, it explains that by default, MySQL won't treat strings as case sensitive when executing a statement such as:

SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]';

This simple search to look for contacts whose first name starts with a lower case letter, will return *all* contacts because in the default character set used by MySQL (latin1), upper and lower case letters share the same collation. The documentation for both MySQL and PostgreSQL have lengthy discussions on the topic.

Enough with the backstory, how do I perform case sensitive searches?!

The docs say to convert the string representation to a binary one. This allows "comparisons [to] use the numeric values of the bytes in the operands". Let's see it in action:

SELECT first_name FROM contacts WHERE BINARY(first_name) REGEXP '^[a-z]';

There are other strategies available, such as changing the character set being used for comparisons with the COLLATE function. This would likely work better for cases where you had many columns to compare.

SELECT first_name FROM contacts WHERE first_name REGEXP '^[a-z]' COLLATE latin1_bin;

You can even go so far as to have MySQL switch character sets and collations. But you do have to do this for each database, each table, and each column you need to convert. Not terribly fun.

Friday, October 12, 2012

Do this now on all your production Rails app servers:

1	`ps ux \| grep Rails`

The first column in the results of that command show which user runs your Rails and Passenger processes. If this is a privileged user (sudoer, or worse yet password-less sudoer), then this article is for you.

Assumptions Check

There are several different strategies for modifying which user your Rails app runs as. By default the owner of config/environment.rb is the user which Passenger will run your application as. For some, simply changing the ownership of this file is sufficient, but in some cases, we may want to force Passenger to always use a particular user.

This article assumes you are running nginx compiled with Passenger support and that you have configured an unprivileged user named rails-app. This configuration has been tested with nginx version 0.7.67 and Passenger version 2.2.15. (Dated I know, but now that you can't find the docs for these old versions, this article is extra helpful.)

Modifying nginx.conf

The changes required in nginx are very straight forward.

# Added in the main, top-level section
user rails-app;
 
# Added in the appropriate http section among your other Passenger related options
passenger_user_switching off;
passenger_default_user rails-app;

The first directive tells nginx to run it's worker processes as the rails-app user. It's not completely clear to me why this was required, but failing to include this resulted in the following error. Bonus points to any one who can help me understand this one.

1	`[error] 1085#0: *1 connect() to unix:/tmp/passenger.1064/master/helper_server.sock failed (111: Connection refused) while connecting to upstream, client: XXX, server: XXX, request: "GET XXX HTTP/1.0", upstream: "passenger://unix:/tmp/passenger.1064/master/helper_server.sock:", host: "XXX"`

The second directive, passenger_user_switching off, tells Passenger to ignore the ownership of config/environment.rb and instead use the user specified in the passenger_default_user directive. Pretty straight forward!

Log File Permissions Gotcha

Presumably you're not storing your production log files in your apps log directory, but instead in /var/log/app_name and using logrotate to archive and compress your logs nightly. Make sure you update the configuration of logrotate to create the new log files with the appropriate user. Additionally, make sure you change the ownership of the current log file so that Passenger can write your applications logs!

Monday, July 23, 2012

Automated VM cloning with PowerCLI

Most small businesses cannot afford the high performance storage area networks (SANs) that make traditional redundancy options such high availability and fault tolerance possible. Despite this, the APIs available to administrators of virtualized infrastructure using direct attached storage (DAS) make it possible to recreate many of the benefits of high availability.

High Availability on SAN vs DAS

A single server failure in a virtualized environment can mean many applications and services can become unavailable simultaneously; for small organizations, this can be particularly damaging. High availability with SANs minimize the downtime of applications and services when a host fails by keeping virtual machine (VM) storage off the host and on the SAN. VMs on a failed host can then be automatically restarted on hosts with excess capacity. This of course requires SAN infrastructure to be highly redundant, adding to the already expensive and complex nature of SANs.

Alternatively, direct attached storage (DAS) is very cost effective, performant, and well understood. By using software to automate the snapshot and cloning of VMs via traditional gigabit Ethernet from host to host, we can create a "poor man's" high availability system.

It's important for administrators to understand that there is a very real window of data loss that can range from hours to days depending on the number of systems backed up and hardware in use. However, for many small businesses who may not have trustworthy backups, automated cloning is an excellent step forward.

Automated cloning with VMWare's PowerCLI

Although End Point is primarily an open source shop, my introduction virtualization was with VMWare. For automation and scripting, PowerCLI, the PowerShell based command line interface for vSphere, is the platform on which we will build. The process is as follows:

A scheduled task executes the backup script.
Delete all old backups to free space.
Read CSV of VMs to be backed up and the target host and datastore.
For each VM, snapshot and clone to destination.
Collect data on cloning failures and email report.

I have created a public GitHub repository for the code and called it powercli_cloner.

Currently, it's fairly customized around the needs of the particular client it was implemented for, so there is much room for generalization and improvement. One area of improvement is immediately obvious: only delete a backup after successfully replacing it. Also, the script must be run as a Windows user with administrator vSphere privileges, as the scripts assumes pass-through authentication is in place. This is probably best for keeping credentials out of plain text. The script should be run during non-peak hours, especially if you have I/O intensive workloads.

Hopefully this tool can provide opportunities to develop backup and disaster recovery procedures that are flexible, cost-effective, and simple. I'd welcome pull requests and other suggestions for improvement.

Tuesday, July 17, 2012

Changing Passenger's Nginx Timeouts

It may frighten you to know that there are applications which take longer than Passenger's default timeout of 10 minutes. Well, it's true. And yes, those application owners know they have bigger fish to fry. But when a customer needs that report run *today* being able to lengthen a timeout is a welcomed stopgap.

Tracing the timeout

There are many different layers at which a timeout can occur, although these may not be immediately obvious to your users. Typically they receive a 504 and an ugly "Gateway Time-out" message from Nginx. Review the Nginx error logs both at the reverse proxy and application server, you might see a message like this:

upstream timed out (110: Connection timed out) while reading response header from upstream

If you're seeing this message on the reverse proxy, the solution is fairly straight forward. Update the proxy_read_timeout setting in your nginx.conf and restart. However, it's more likely you've already tried that and found it ineffective. If you expand your reading of the Nginx error you might notice another clue.

upstream timed out (110: Connection timed out) while reading response header from upstream, 
upstream: "passenger://unix:/tmp/passenger.3940/master/helper_server.sock:"

This is the kind of error message you'd see on the Nginx application server when a Passenger process takes longer than the default timeout of 10 minutes. If you're seeing this message, it'd be wise to review the Rails logs to get a sense for how long this process actually takes to complete so you can make a sane adjustment to the timeout. Additionally, it's good to see what task is actually taking so long so you can offload the job into the background eventually.

Changing nginx-passenger module's timeout

If you're unable to address the slow Rails process problem and must extend the length of the time out, you'll need to modify the Passenger gem's Nginx configuration. Start by locating the Passenger gem's Nginx config with locate nginx/Configuration.c and edit the following lines:

ngx_conf_merge_msec_value(conf->upstream.read_timeout,
                              prev->upstream.read_timeout, 60000);

Replace the 60000 value with your desired timeout in milliseconds. Then run sudo passenger-install-nginx-module to recompile nginx and restart.

Improving Error Pages

Another lesson worth addressing here is that Nginx error pages are ugly and unhelpful. Even if you have a Rails plugin like exception_notification installed, these kind of Nginx errors will be missed, unless you use the error_page directive. In other applications I've setup explicit routes to test exception_notification properly sends an email by creating a controller action that simple raises an error. Using Nginx's error_page directive, you can call an exception controller action and pass useful information along to yourself as well as present the user with a consistent error experience.

Monday, June 25, 2012

Simple Example of Dependency Injection with Rails

Today I came across a great opportunity to illustrate dependency injection in a simple context. I had a Rails partial that was duplicated across two subclasses. The partial was responsible for displaying options to create a new record from the data of the current record. It also offered two types of copy, shallow and deep. The shallow copy used a button to POST data, while the deep copy offered a form with some additional options. The only difference between the partials was the path to post data to. Let's see this in code.

#app/views/fun_event/_copy_options.html.erb
button_to(t("create_and_edit_shallow_copy"), fun_event_path(:from_event => @event.id, :return => true), :    id => "shallow_copy_btn")
 
form_tag(fun_event_path(:return => true)) do
  #form code
end
 
 
#app/views/boring_event/_copy_options.html.erb
button_to(t("create_and_edit_shallow_copy"), boring_event_path(:from_event => @event.id, :return => true), :    id => "shallow_copy_btn")
 
form_tag(boring_event_path(:return => true)) do
  #form code
end

The first, failed iteration

To remove the duplication, I passed in a path option into the partial, replacing specific references with the generic.

#app/views/fun_events/copy.html.erb
<%= render :partial => "events/copy_options", :event_path => fun_event_path %>
 
 
#app/views/boring_events/copy.html.erb
<%= render :partial => "events/copy_options, :event_path => boring_event_path %>
 
#app/views/events/_copy_options.html.erb
button_to(t("create_and_edit_shallow_copy"), event_path(:from_event => @event.id, :return => true), :    id => "shallow_copy_btn")
 
form_tag(event_path(:return => true)) do
  #form code
end

Can you guess where this led?

undefined method `event_path' for ActionView::Base:0xd6acf18

Dude! Inject the dependency!

Obviously the event_path variable I was passing was a string, not a method. I needed the method so I could pass in the appropriate arguments to construct the URL I needed. Had there not been two different calls to the routes, I would likely have just passed in the string needed in each context. But in this case, I was forced to think outside the box. Here's what I ended up with.

#app/views/fun_events/copy.html.erb
<%= render :partial => "events/copy_options", :event_path => method(:fun_event_path) %>
 
 
#app/views/boring_events/copy.html.erb
<%= render :partial => "events/copy_options, :event_path => method(:boring_event_path) %>
 
#app/views/events/_copy_options.html.erb
button_to(t("create_and_edit_shallow_copy"), event_path.call(:from_event => @event.id, :return => true), :    id => "shallow_copy_btn")
 
form_tag(event_path.call(:return => true)) do
  #form code
end

The changes are really quite subtle, but we use Object's method method to pass the reference to the method we want to call, and simply pass in the arguments when needed. Mind == Blown

Tuesday, May 8, 2012

Inherit an Application by Rewriting the Test Suite

One of my first tasks at End Point was to inherit a production application from the lead developer who was no longer going to be involved. It was a fairly complex domain model and had passed through many developers' hands on a tight client budget. Adding to the challenge was the absence of any active development; it's difficult to "own" an application which you're not able to make changes to or work with users directly. Moreover, we had a short amount of time; the current developer was leaving in just 30 days. I needed to choose an effective strategy to understand and document the system on a budget.

Taking Responsibility

At the time I was reading Robert C. Martin's The Clean Coder, which makes a case for the importance of taking responsibility as a "Professional Software Developer". He defines responsibility for code in the broadest of terms.

Drawing from the Hippocratic oath may seem arrogant, but what better source is there? And, indeed, doesn't it make sense that the first responsibility, and first goal, of an aspiring professional is to use his or her powers for good?

From there he continues to expound in his declarative style about how to do no harm to the function and structure of the code. What struck me most about this was his conclusions about the necessity of testing. The only way to do no harm to function is know your code works as expected. The only way to know your code works is with automated tests. The only way to do no harm to structure is by "flexing it" regularly.

The fundamental assumption underlying all software projects is that software is easy to change. If you violate this assumption by creating inflexible structures, then you undercut the economic model that the entire industry is based on.
In short: You must be able to make changes without exorbitant costs.
The only way to prove that your software is easy to change is to make easy changes to it. Always check in a module cleaner than when you checked it out. Always make some random act of kindness to the code whenever you see it.
Why do most developers fear to make continuous changes to their code? They are afraid they'll break it! Why are they afraid to break it? Because they don't have tests.
It all comes back to the tests. If you have an automated suite of tests that covers virtually 100% of the code, and if that suite of tests can be executed quickly on a whim, then you simply will not be afraid to change the code.

Test Suite for the Triple Win

Fortunately, there was a fairly large test suite in place for the application, but as common with budget-constrained projects, the tests didn't track the code. There were hundreds of unit tests, but they weren't even executable at first. After just a few hours of cleaning out tests for classes which no longer existed, I found about half of the 500 unit tests passed. As I worked through repairing the tests, I was learning the business rules, classes, and domain of the application, all without touching "production" code (win). These tests were the documentation that future developers could use to understand the expected behavior of the system (double win). While rebuilding the tests, I got to document bugs, deprecation warnings, performance issues and general code quality issues (triple win).

By the end of my 30 day transition, I had 500+ passing unit tests that were more complete and flexible than before. Additionally, I added 100+ integration tests which allowed me to exercise the application at a higher level. Not only was I taking responsibility for the code, I was documenting important issues for the client and myself. This helps the client feel I had done my job transitioning responsibilities. This trust leaves the door open to further development, which means a better system over the long haul.

Tuesday, May 1, 2012

Profile Ruby with ruby-prof and KCachegrind

This week I was asked to isolate some serious performance problems in a Rails application. I went down quite a few paths to determine how to best isolate the issue. In this post I want to document what tools worked most quickly to help find offending code.

Benchmarks

Before any work begins finding how to speed things up, we need to set a performance baseline so we can know if we are improving things, and by how much. This is done with Ruby's Benchmark class and some of Rail's Benchmark class.

The Rails guides would have you setup performance tests, but I found this cumbersome on this Rails 2.3.5 application I was dealing with. Initial attempts to set it up were unfruitful, taking time away from the task at hand. In my case, the process of setting up the test environment to reflect the production environment was prohibitively expensive, but if you can automate the benchmarks, do it. If not, use the logs to measure your performance, and keep track in a spreadsheet. Regardless of benchmarking manually or automatically, you'll want to keep some kind of log of the results keeping notes about what changed in each iteration.

Isolating the Problem

As always, start with your logs. In Rails, you get some basic performance information for free. Profiling code slows down runtime a lot. By reviewing the logs you can hopefully make a first cut at what needs to be profiled, reducing already long profile runs. For example, instead of having to profile an entire controller method, by reading the logs you might notice that it's just a particular partial which is rendering slowly.

Taking a baseline benchmark

Once you've got a sense of where the pain is, it's easy to get a benchmark for that slow code as a baseline.

module SlowModule
  def slow_method
    benchmark "SlowModule#slow_method" do
      #my slow code
    end
  end
end

Look to your log files to see results. If for some reason, you're outside your Rails enviornment, you can use Ruby's Benchmark class directly.

require 'benchmark'
result = Benchmark.ms do
  #slow code
end
puts result

This will tell you the process time in milliseconds and give you a precise measurement to compare against.

Profiling with ruby-prof

First, setup ruby-prof. Once installed, you can add these kinds of blocks around your code.

require 'ruby-prof'
 
module SlowModule
  def slow_method
    benchmark "SlowModule#slow_method" do
      RubyProf.start
      # your slow code here
      results = RubyProf.end
    end
    File.open "#{RAILS_ROOT}/tmp/SlowModule#slow_method_#{Time.now}", 'w' do |file|
      RubyProf::CallTreePrinter.new(results).print(file)
    end
  end
end

Keep in mind that profiling code will really slow things down. Make sure to collect your baseline both with profiling and without, to make sure you're doing apples-to-apples comparison.

By default ruby-prof measures process time, which is the time used by a process between any two moments. It is unaffected by other processes concurrently running on the system. You can review the ruby-prof README for other types of measurements including memory usage, object allocations and garbage collection time.

If you choose to measure any of these options, make sure your Ruby installation has the tools a profiler needs to collect data. Please see the Rails guides for guidance on compiling and patching Ruby.

Interpreting the Data with KCachegrind

At this point you should have a general sense of what code is slow having reviewed the logs. You've got a benchmark log setup with baseline measurements to compare to. If you're going to Benchmark while your profiling, make sure your baseline includes the profile code; it will be much slower! Remember we want an apples-to-apples comparison! You're ready to start profiling and identifying the root source of the performance problems.

After manually or automatically running your troubled code with the profiling block above, you can open up the output from ruby-prof and quickly find it not to be human friendly. Fortunately, KCachegrind turns that mess into something very useful. I found that my Ubuntu installation had a package for it already built, so installation was a breeze. Hopefully things are as easy for you. Next simply open your result files and start reviewing there results.

The image above shows what's called a "call graph" with the percentages representing the relative amount of time that method uses for the duration of the profile run. The CacheableTree#children method calls Array#collect and takes up more then 90% of the runtime. The subsequent child calls are relatively modest in proportion. It's clear we can't modify Array#collect so let's look at CacheableTree#children.

module CacheableTree
  def children(element = @root_element)
    full_set.collect { |node| if (node.parent_id == element.id)
      node
    }.compact
  end
end

Defined elsewhere, full_set is an array of Ruby objects. This is common performance optimization in Rails; collecting data looping through arrays works well with a small data set, but quickly becomes painful with a large one. It turned out in this case that full_set had 4200+ elements. Worse yet the children method was being called recusrively on each of them. Yikes!

At this point I had to decide how to optimize. I could go for broke and completely break the API and try and clean up the mess, or I could see if I could collect the data more quickly, some other way. I looked at how the full_set was defined and found I could modify that query to return a subset of elements rather easily.

module CacheableTree
  def children(element = @root_element)
    FormElement.find_by_sql(...) #the details aren't important
  end
end

By collecting the data directly via a SQL call, I was able to cut my benchmark by about 20%. Not bad for a single line change! Let's see what the next profile told us.

The above is another view of the profile KCachegrind provides. It's essentially the same information, but in table format. There were a few indicators that my optimization was helpful:

The total process_time cost had dropped
The amount of time spent in each function seemed to better distributed - I didn't have a single method soaking up all the process time
Most of the time was spent in code wasn't mine!

Although, we still saw 66% of time spent in the children method, we could also see that 61% of the time was spent in ActiveRecord::Base. Effectively, I had pushed the 'slowness' down the stack, which tends to mean better performance. Of course, there were LOTS of database calls being made. Perhaps some caching could help reduce the number of calls being made.

module CacheableTree
  def children(element = @root_element)
    @children ||= {}
    @children[element] ||= FormElement.find_by_sql(...) #the details aren't important
  end
end

This is called memoization and let's us reuse this expensive method's results within the page load. This method took another 10% off the clock against the baseline. Yay!

Knowing When to Stop

Performance optimization can be really fun, especially once all the infrastructure is in place. However, unless you have unlimited budget and time, you have to know when to stop. For a few lines of code changed, the client would see ~30% performance improvement. It was up to them to decide how much further to take it.

If allowed, my next step would be to make use of the applications existing dependence on redis, and add the Redis-Cacheable gem. It allows you to marshal Ruby objects in and out of a redis server. The application already makes extensive use of caching, and this page was no exception, but when the user modified the page in a way that expired the cache, we would hit this expensive method again, unnecessarily. Based on the call graph above, we could eliminate another ~66% of the call time, and perhaps, by pre-warming this cache, could help the user to minimize the chances of experiencing the pain of slow browsing!

Friday, April 20, 2012

Deconstructing an OO Blog Designs in Ruby 1.9

I've become interested in Avdi Grimm's new book Object on Rails, however I found the code to be terse. Avdi is an expert Rubyist and he makes extensive use of Ruby 1.9 with minimal explanation. In all fairness, he lobbies you to buy Peter Cooper's Ruby 1.9 Walkthrough. Instead of purchasing the videos, I wanted to try and deconstruct them myself.

In his first chapter featuring code, Mr. Grimm creates a Blog and Post class. For those of you who remember the original Rails blog demo, the two couldn't look more different.

Blog#post_source

In an effort to encourage Rails developers to think about relationships between classes beyond ActiveRecord::Relation, he creates his own interface for defining how a Blog should interact with a "post source".

# from http://objectsonrails.com/#sec-5-2
class Blog
  # ...
  attr_writer :post_source
   
  private
  def post_source
    @post_source ||= Post.public_method(:new)
  end
end

The code above defines the Blog class and makes available post_source= via the attr_writer method. Additionally, it defines the attribute reader as a private method. The idea being that a private method can be changed without breaking the class's API. If we decide we want a new default Post source, we can do it safely.

The magic of this code is in defining the post source as a class's method, in this case, Post.public_method(:new). The #public_method method is defined by Ruby's Object class and is similar to #method method. In short, it gives us a way of not directly calling Post.new, but instead, referring to the method that's responsible for creating new posts. This is logical if you remember that the name of this method is #post_source.

Now let's look how he puts post_source into action.

class Blog
  # ...
  def new_post
    post_source.call.tap do |p|
      p.blog = self
    end
  end
  # ...
end

During my first reading, it wasn't clear at all what was going on here, but if we remember that post_source is responsible for returning the method need to "call", we know that post_source.call is equivalent to Post.new. For the sake of clarity-while-learning for those not familiar with post_source.call, let's substitute it with something more readable so we can understand how tap is being employed.

class Blog
  # ...
  def new_post
    Post.new.tap do |p|
      p.blog = self
    end
  end
end

The tap method is available to all Ruby objects and serves as a way to have a block "act on" the method's caller and return the object called. Per the docs, "the primary purpose of this method is to 'tap into' a method chain, in order to perform operations on intermediate results within the chain". For some examples on using tap see MenTaLguY's post on Eavesdropping on Expressions. As he says in his post, "you can insert your [code] just about anywhere without disturbing the flow of data". Neat.

In this case, it's being used to tap into the process of creating a new blog post and define the blog to which that post belongs. Because tap returns the object it modifies, #new_post returns the post now assigned to the blog.

Brining it All Together

Avdi's approach may seem cumbersome at first, and it is compared to "the Rails way." But in general, that's the whole point of Object on Rails; to challenge you to see beyond a generic solution to a problem (in this case defining relationships between classes) so you can build more flexible solutions. Let's see some interesting things we might be able to do with this more flexible Blog class. We can imagine this same Blog class being able to handle posts from all sorts of different sources. Let's see if we can get creative.

class EmailPost < ActionMailer::Base
  def receive(message)
    @blog = Blog.find_by_owner_email(message.from)
    @blog.post_source = EmailPost.public_method(:new)
    @email_post = @blog.new_post(params[:email_post])
    @email_post.publish
  end
end

With this little snippet, we're able to use the Blog class to process a different sort of post. We simply let the blog know the method to call when we want a new post and pass along the arguments we'd expect. Let's see if we can think of something else that's creative.

require 'feedzirra'
# execute regularly with cronjob call like curl -d "blog_id=1&url=http://somefeed.com" http://myblog.com/feed_poster"
 
class FeedPostersController
  def create
    @feed = Feedzirra::Feed.fetch_and_parse(params[:url])
    @blog = Blog.find(params[:blog_id])
    @post.post_source = FeedPost.public_method(:new)
    @feed.entries.each do |entry|
      @blog.new_post(entry)
    end
  end
end

We could imagine the FeedPost.new method being the equivalent of a retweet for your blog using an RSS feed! Try having the blog class doing this with an ActiveRecord association! Seems to me the Blog class might need to get a bit more complex to support all these Post sources which makes post_source.call.tap look pretty good!

Tuesday, April 17, 2012

Monitoring cronjob exit codes with Nagios

If you're like me, you've got cronjobs that make email noise if there is an error. While email based alerts are better than nothing, it'd be best to integrate this kind of monitoring into Nagios. This article will break down how to monitor the exit codes from cronjobs with Nagios.

Tweaking our cronjob

The monitoring plugin depends on being able to read some sort of log output file which includes an exit code. The plugin also assumes that the log will be truncated with every run. Here's an example of a cronjob entry which meets those requirements:

rsync source dest 2>&1 > /var/log/important_rsync_job.log; echo "Exit code: $?" >> /var/log/important_rsync_job.log

So let's break down a couple of the more interesting points in this command:

2>&1 sends the stderr output to stdout so it can be captured in our log file
Notice the single > which will truncate the log every time it is run
$? returns the exit code of the last run command
Notice the double >> which will append to the log file our exit code

Setting up the Nagios plugin

The check_exit_code plugin is available on GitHub, and couldn't be easier to setup. Simply specify the log file to monitor and the frequency with which it should be updated. Here's the usage statement:

Check log output has been modified with t minutes and contains "Exit code: 0".
        If "Exit code" is not found, assume command is running.
        Check assumes the log is truncated after each command is run.
         
        --help      shows this message
        --version   shows version information
 
        -f          path to log file
        -t          Time in minutes which a log file can be unmodified before raising CRITICAL alert

The check makes sure the log file has been updated within t minutes because we want to check that our cronjob is not only running successfully, but running regularly. Perhaps this should be an optional parameter, or this should be called check_cronjob_exit_code, but for right now, it's getting the job done and cutting back on email noise.

Sunday, March 18, 2012

Check JSON responses with Nagios

As the developer's love affair with JSON continues to grow, the need to monitor successful JSON output does as well. I wanted a Nagios plugin which would do a few things:

Confirm the content-type of the response header was "application/json"
Decode the response to verify it is parsable JSON
Optionally, verify the JSON response against a data file

Verify content of JSON response

For the most part, Perl's LWP::UserAgent class makes short work of the first requirement. Using $response->header("content-type") the plugin is able to check the content-type easily. Next up, we use the JSON module's decode function to see if we can successfully decode $response->content.

Optionally, we can give the plugin an absolute path to a file which contains a Perl hash which can be iterated through in attempt to find corresponding key/value pairs in the decoded JSON response. For each key/value in the hash it doesn't find in the JSON response, it will append the expected and actual results to the output string, exiting with a critical status. Currently there's no way to check a key/value does not appear in the response, but feel free to make a pull request on check_json on my GitHub page.

Thursday, March 15, 2012

Check HTTP redirects with Nagios

Often times there are critical page redirects on a site that may want to be monitored. Often times, it can be as simple as making sure your checkout page is redirecting from HTTP to HTTPS. Or perhaps you have valuable old URLs which Google has been indexing and you want to make sure these redirects remain in place for your PageRank. Whatever your reason for checking HTTP redirects with Nagios, you'll find there are a few scripts available, but none (that I found) which are able to follow more than one redirect. For example, let's suppose we have a redirect chain that looks like this:

http://myshop.com/cart >> http://www.myshop.com/cart >> https://www.mycart.com/cart

Following multiple redirects

In my travels, I found check_http_redirect on Nagios Exchange. It was a well designed plugin, written by Eugene Kovalenja in 2009 and licensed under GPLv2. After experimenting with the plugin, I found it was unable to traverse multiple redirects. Fortunately, Perl's LWP::UserAgent class provides a nifty little option called max_redirect. By revising Eugene's work, I've exposed additional command arguments that help control how many redirects to follow. Here's a summary of usage:

-U          URL to retrieve (http or https)
        -R          URL that must be equal to Header Location Redirect URL
        -t          Timeout in seconds to wait for the URL to load. If the page fails to load, 
                    check_http_redirect will exit with UNKNOWN state (default 60)
        -c          Depth of redirects to follow (default 10)
        -v          Print redirect chain

If check_http_redirect is unable to find any redirects to follow or any of the redirects results in a 4xx or 5xx status code returned, the plugin will report a critical state code and the nature of the problem. Additionally, if the number of redirects exceeds the depth of redirects to follow as specified in the command arguments, it will notify you of this and exit with an unknown state code. An OK status will be returned only if the redirects result in a successful response to a URL which is a regex match against the options specified in the R argument.

The updated check_http_redirect plugin is available on my GitHub page along with several other Nagios plugins I'll write about in the coming weeks. Pull requests welcome, and thank you to Eugene for his original work on this plugin.

Thursday, March 1, 2012

IPv6 Tunnels with Debian/Ubuntu behind NAT

As part of End Point's preparation for World IPv6 Launch Day, I was asked to get my IPv6 certification from Hurricane Electric. It's a fun little game-based learning program which had me setup a IPv6 tunnel. IPv6 tunnels are used to provide IPv6 for those whose folks whose ISP or hosting provider don't currently support IPv6, by "tunneling" it over IPv4. The process for creating a tunnel is straight forward enough, but there were a few configuration steps I felt could be better explained.

After creating a tunnel, Hurricane Electric kindly provides a summary of your configuration and offers example configurations for several different operating systems and routers. Below is my configuration summary and the example generated by Hurricane Electric.

However, entering these commands change won't survive a restart. For Debian/Ubuntu users an update in /etc/network/interfaces does the trick.

#/etc/network/interfaces
auto he-ipv6
iface he-ipv6 inet6 v4tunnel
  address 2001:470:4:9ae::2
  netmask 64
  endpoint 209.51.161.58
  local 204.8.67.188
  ttl 225 
  gateway 2001:470:4:9ae::1

Firewall Configuration

If you're running UFW the updates to /etc/default/ufw are very straightforward. Simply change the IPV6 directive to yes. Restart the firewall and your network interfaces and you should be able to ping6 ipv6.google.com. I also recommend hitting http://test-ipv6.com/ for a detailed configuration test.

Behind NAT

If you're behind a NAT, the configuration needs to be tweaked a bit. First, you'll want to setup a static IP address behind your router. If you're router supports configuration of forwarding more than just TCP/UDP, you'll want to forward protocol 41 (aka IPv6) (NOT PORT 41), which is responsible for IPv6 tunneling over IPv4, to your static address. If you've got a consumer grade router that doesn't support this, you'll just have to put your machine in the DMZ, thus putting your computer "in front" of your router's firewall. Please make sure you are running a local software firewall if you chose this option.

After handling the routing of protocol 41, there is one small configuration change to /etc/network/interfaces. You must change your tunnel's local address from your public IP address, to your private NATed address. Here is an example configuration including both the static IP configuration and the updated tunnel configuration.

#/etc/network/interfaces
auto eth0
iface eth0 inet static
  address 192.168.0.50
  netmask 255.255.255.0
  gateway 192.168.0.1 
 
auto he-ipv6
iface he-ipv6 inet6 v4tunnel
  address 2001:470:4:9ae::2
  netmask 64
  endpoint 209.51.161.58
  <b>local 192.168.0.50</b>
  ttl 225 
  gateway 2001:470:4:9ae::1

Don't forget to restart your networking interfaces after these changes. I found a good ol' restart was helpful as well, but of course, we don't have this luxury in production, so be careful!

Checking IPv6

If you're reading this article, you're probably responsible for several hosts. For a gentle reminder which of your sites you've not yet setup IPv6, I recommend checking out IPvFoo for Chrome or 4or6 for Firefox. These tools make it easy for you to see which of your sites are ready for World IPv6 Launch Day!

Getting Help

Hurricane Electric provides really great support for their IPv6 tunnel services (which is completely free). Simply email ipv6@he.net and provide them with some useful information such as:

cat /etc/network/interfaces
cat netstat -nrA inet6  (these are your IPv6 routing tables)
cat /etc/default/ufw
relevant router configurations

I was very impressed to get a response from a competent person in 15 minutes! Sadly, there is one downside to using this tunnel; IRC is not an allowed.

Due to an increase in IRC abuse, new non-BGP tunnels now have IRC blocked by default. If you are a Sage, you can re-enable IRC by visiting the tunnel details page for that specific tunnel and selecting the 'Unblock IRC' option. Existing tunnels have not been filtered.

I guess ya gotta earn it to use IRC over your tunnel. Good luck!

Saturday, January 21, 2012

MySQL replication monitoring on Ubuntu 10.04 with Nagios and NRPE

If you're using MySQL replication, then you're probably counting on it for some fairly important need. Monitoring via something like Nagios is generally considered a best practice. This article assumes you've already got your Nagios server setup and your intention is to add a Ubuntu 10.04 NRPE client. This article also assumes the Ubuntu 10.04 NRPE client is your MySQL replication master, not the slave. The OS of the slave does not matter.

Getting the Nagios NRPE client setup on Ubuntu 10.04

At first it wasn't clear what packages would be appropriate packages to install. I was initially misled by the naming of the nrpe package, but I found the correct packages to be:

1	`sudo apt-get install nagios-nrpe-server nagios-plugins`

The NRPE configuration is stored in /etc/nagios/nrpe.cfg, while the plugins are installed in /usr/lib/nagios/plugins/ (or lib64). The installation of this package will also create a user nagios which does not have login permissions. After the packages are installed the first step is to make sure that /etc/nagios/nrpe.cfg has some basic configuration.

Make sure you note the server port (defaults to 5666) and open it on any firewalls you have running. (I got hung up because I forgot I have both a software and hardware firewall running!) Also make sure the server_address directive is commented out; you wouldn't want to only listen locally in this situation. I recommend limiting incoming hosts by using your firewall of choice.

Choosing what NRPE commands you want to support

Further down in the configuration, you'll see lines like command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10. These are the commands you plan to offer the Nagios server to monitor. Review the contents of /usr/lib/nagios/plugins/ to see what's available and feel free to add what you feel is appropriate. Well designed plugins should give you a usage if you execute them from the command line. Otherwise, you may need to open your favorite editor and dig in!

After verifying you've got your NRPE configuration completed and made sure to open the appropriate ports on your firewall(s), let's restart the NRPE service:

1	`service nagios-nrpe-server restart`

This would also be an appropriate time to confirm that the nagios-nrpe-server service is configured to start on boot. I prefer the chkconfig package to help with this task, so if you don't already have it installed:

sudo apt-get install chkconfig
chkconfig | grep nrpe
 
# You should see...
nagios-nrpe-server     on
 
# If you don't...
chkconfig nagios-nrpe-server on

Pre flight check - running check_nrpe

Before going any further, log into your Nagios server and run check_nrpe and make sure you can execute at least one of the commands you chose to support in nrpe.cfg. This way, if there are any issues, it is obvious now, while we've not started modifying your Nagios server configuration. The location of your check_nrpe binary may vary, but the syntax is the same:

1	`check_nrpe -H host_of_new_nrpe_client -c command_name`

If your command output something useful and expected, your on the right track. A common error you might see: Connection refused by host. Here's a quick checklist:

Did you start the nagios-nrpe-server service?
Run netstat -lunt on the NRPE client to make sure the service is listening on the right address and ports.
Did you open the appropriate ports on all your firewall(s)?
Is there NAT translation which needs configuration?

Adding the check_mysql_replication plugin

There is a lot of noise out there on Google for Nagios plugins which offer MySQL replication monitoring. I wrote the following one using ideas pulled from several existing plugins. It is designed to run on the MySQL master server, check the master's log position and then compare it to the slave's log position. If there is a difference in position, the alert is considered Critical. Additionally, it checks the slave's reported status, and if it is not "Waiting for master to send event", the alert is also considered critical. You can find the source for the plugin at my Github account under the project check_mysql_replication. Pull that source down into your plugins directory (/usr/lib/nagios/plugins/ (or lib64)) and make sure the permissions match the other plugins.

With the plugin now in place, add a command to your nrpe.cfg.

command[check_mysql_replication]=sudo /usr/lib/nagios/plugins/check_mysql_replication.sh -H <slave_host_address></slave_host_address>

At this point you may be saying, WAIT! How will the user running this command (nagios) have login credentials to the MySQL server? Thankfully we can create a home directory for that nagios user, and add a .my.cnf configuration with the appropriate credentials.

usermod -d /home/nagios nagios #set home directory
mkdir /home/nagios
chmod 755 /home/nagios
chown nagios:nagios /home/nagios
 
# create /home/nagios/.my.cnf with your preferred editor with the following:
[client]
user=example_replication_username
password=replication_password
 
chmod 600 /home/nagios/.my.cnf
chown nagios:nagios /home/nagios/.my.cnf

This would again be an appropriate place to run a pre flight check and run the check_nrpe from your Nagios server to make sure this configuration works as expected. But first we need to add this command to the sudoer's file.

1	`nagios ALL= NOPASSWD: /usr/lib/nagios/plugins/check_mysql_replication.sh`

Wrapping Up

At this point, you should run another check_nrpe command from your server and see the replication monitoring report. If not, go back and check these steps carefully. There are lots of gotchas and permissions and file ownership are easily overlooked. With this in place, just add the NRPE client using the existing templates you have for your Nagios servers and make sure the monitoring is reporting as expected.

Saturday, January 14, 2012

Using Disqus and Ruby on Rails

Recently, I posted about how to import comments from a Ruby on Rails app to Disqus. This is a follow up to that post where I outline the implementation of Disqus in a Ruby on Rails site. Disqus provides what it calls Universal Code which can be added to any site. This universal code is just JavaScript, which asynchronously loads the Disqus thread based on one of two unique identifiers Disqus uses.

Disqus in a development environment

Before we get started, I'd recommend that you have two Disqus "sites"; one for development and one for production. This will allow you to see real content and experiment with how things will really behave once you're in production. Ideally, your development server would be publicly accessible to allow you to fully use the Disqus moderation interface, but it isn't required. Simply register another Disqus site, and make sure that you have your shortname configured by environment. Feel free to use whatever method you prefer for defining these kinds of application preferences. If you're looking for an easy way, considering checking out my article on Working with Constants in Ruby. It might look something like this:

# app/models/article.rb
 
DISQUS_SHORTNAME = Rails.env == "development" ? "dev_shortname".freeze : "production_shortname".freeze

Disqus Identifiers

Each time you load the universal code, you need to specify a few configuration variables so that the correct thread is loaded:

disqus_shortname: tells Disqus which website account (called a forum on Disqus) this system belongs to.
disqus_identifier: tells Disqus how to uniquely identify the current page.
disqus_url: tells Disqus the location of the page for permalinking purposes.

Let's create a Rails partial to set up these variables for us, so we can easily call up the appropriate comment thread.

# app/views/disqus/_thread.html.erb
# assumes you've passed in the local variable 'article' into this partial
# from http://docs.disqus.com/developers/universal/
 
<div id="disqus_thread"></div>
<script type="text/javascript">
 
    var disqus_shortname = '<%= Article::DISQUS_SHORTNAME %>';
    var disqus_identifier = '<%= article.id %>';
    var disqus_url = '<%= url_for(article, :only_path => false) %>';
 
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function() {
        var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
        dsq.src = 'http://' + disqus_shortname + '.disqus.com/embed.js';
        (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
    })();
</script>

The above code will populate the div#disqus_thread with the correct content based on your disqus_identifier. By setting up a single partial that will always render your threads, it becomes very easy to adjust this code if needed.

Disqus Identifier Gotcha

We found during our testing a surprising and unexpected behavior in how Disqus associates a thread to a URL. In our application, the landing page was designed to show the newest article as well as the Disqus comments thread. We found that once a new article was posted, the comments from the previous article were still shown! It seems Disqus ignored the unique disqus_identifier we had specified and instead associated the thread with the landing page URL. In our case, a simple routing change allowed us to forward the user to the unique URL for that content and thread. In your case, there may not be such an easy work around, so be certain you include both the disqus_identifier and disqus_url JavaScript configuration variables above to minimize the assumptions Disqus will make. When at all possible, always use unique URLs for displaying Disqus comments.

Comment Counters

Often an index page will want to display a count of how many comments are in a particular thread. Disqus uses the same asynchronous approach to loading comment counts. Comment counts are shown by adding code such as the following where you want to display your count:

# HTML
<a href="http://example.com/article1.html#disqus_thread"
   data-disqus-identifier="<%=@article.id%>">
This will be replaced by the comment count
</a>
 
# Rails helper
<%= link_to "This will be replaced by the comment count", 
    article_path(@article, :anchor => "disqus_thread"), 
    :"data-disqus-identifer" => @article.id %>

At first this seemed strange, but it is the exact same pattern used to display the thread. It would likely be best to remove the link text so nothing is shown until the comment count is loaded, but I felt for my example, having some meaning to the test would help understanding. Additionally, you'll need to add the following JavaScript to your page.

# app/view/disqus/_comment_count_javascript.html.erb
# from http://docs.disqus.com/developers/universal/
# add once per page, just above </body>
 
<script type="text/javascript">
    
    var disqus_shortname = '<%= Article::DISQUS_SHORTNAME %>';
 
    /* * * DON'T EDIT BELOW THIS LINE * * */
    (function () {
        var s = document.createElement('script'); s.async = true;
        s.type = 'text/javascript';
        s.src = 'http://' + disqus_shortname + '.disqus.com/count.js';
        (document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s);
    }());
</script>

Disqus recommends adding it just before the closing </body> tag. You only need to add this code ONCE per page, even if you're planning on showing multiple comment counts on a page. You will need this code on any page with a comment count, so I do recommend putting it in a partial. If you wanted, you could even include it in a layout.

Styling Comment Counts

Disqus provides extensive CSS documentation for its threads, but NONE for its comment counters. In our application, we had some very particular style requirements for these comment counts. I found that in Settings > Appearance, I could add HTML tags around the output of the comments.

This allowed me to style my comments as needed, although these fields are pretty small, so make sure to compress your HTML as much as possible.