utf8-cleaner gem protects your Rails app

TL;DR: Add our utf8-cleaner Ruby gem to your Rails app and kiss request encoding errors goodbye!

Our Rails apps send us error reports by email every time a request results in a 500 error. Sometimes these are legitimate bugs, and we can roll out a fix to our code. However, sometimes they're just the result of a misbehaved user agent (browser or search engine bot).

Several of our Rails apps, including this site, were reporting errors like "invalid byte sequence in UTF-8". In our apps, this was almost always caused by a parameter in the request URL called "Result" whose value was a pile of URL-encoded, non-ASCII characters. We never figured out where this parameter was coming from, but we could tell that the encoded characters were not valid UTF-8 characters when we tried to decode them.

Rails, helpful framework that it is, does a bunch of work to parse incoming parameters before they get handed to your application code. Unfortunately, it does not provide a built-in mechanism for handling badly-encoded characters. However, there is a very convenient place to plug in code before it gets to Rails, and that is Rack middleware. Middleware was mysterious to me for a while, but it's actually pretty simple conceptually. Rack middleware is a stack of software that processes each incoming request and each outgoing response. It lives between your web server and your framework. Because of where it's located, it's the perfect place to clean up those funky request parameters before Rails can choke on them.

Because we were seeing these problems in multiple apps, we wanted to be able to easily share the fix. Ruby's gem packaging system was an obvious good fit. We created a gem called utf8-cleaner using the simple bundle gem  command. 

We built our middleware, using test-driven development to keep our design clean and ensure that we were solving the right problems. We used the String class's encoding methods where possible, but we also had to do some more manual processing of the encoded characters because we were dealing with such malformed input. In the process, we learned way more about UTF-8 encoding than we every wanted to!

Once the middleware was doing its job, we needed to tie it into Rails. Enter Railties, which lets your code hook into the various places Rails, including the initialization process. Our Railtie simply inserts our middleware at the front of the middleware stack. This ensures that we'll be able to clean up the nasty parameters before they hit any other Ruby code that might choke on them.

With the Railtie in place, we tested our new gem locally by using the :path option in our app's Gemfile. By doing so, we were able to work out any integration issues before we pushed the gem up to rubygems.org with the beautifully simple rake release command provided by Bundler.

Because Rails and Rack provide so many integration points, using utf8-cleaner is as easy as adding it to your Gemfile and running bundle install . It currently cleans HTTP_REFERER, PATH_INFO, QUERY_STRING, REQUEST_PATH, REQUEST_URI, and HTTP_COOKIE, but the design should allow it to clean other environment vars (e.g. a POST request body) pretty easily.

We'd love to hear your feedback on this gem! It's quieted down our error reports, significantly increasing our signal-to-noise ratio. We hope you'll find it useful!