January 15, 2010

Our “go” url shortening/keyword service.

Ever wondered why our url shortening service is different than others and how it works? Our url shortening service is very much designed for readability and marketing use. While you can of course use the shortened urls in emails and through other electronic distribution means (and there are good reasons to do so), they are designed to be foremost a memorable link, to facilitate print/audio/video to web transfer. That said, here’s how it works. I’ve talked be…

By David McKelvey

 

Ever wondered why our url shortening service is different than others and how it works?

Our url shortening service is very much designed for readability and marketing use. While you can of course use the shortened urls in emails and through other electronic distribution means (and there are good reasons to do so), they are designed to be foremost a memorable link, to facilitate print/audio/video to web transfer.

That said, here’s how it works. I’ve talked before about our search engine, which is a rails app that uses thinking sphinx and sphinx to glean content from our database-driven site. Our shortener/keyword service is actually part of the search web application.

Associations
This portion of the web application works by associating keywords or keyword combinations via one of two methods (generic or redirect) to a url. A generic association is used for terms like “green” or things where we don’t have enough specificity, like “admissions” alone. (We have three admissions offices.) A redirect association is used where uniqueness is assured and it then also functions at the go.lclark.edu sub-domain.

The main distinction being that the keyword/combo must be specific enough not to have multiple potential destinations for the redirect. This is primarily why it is completely managed — we can’t allow people to just choose their own keyword/combo, as we balance the potential future use of a term. (I’ve planned a request service into the next generation of this app, but most people just email requests now — feel free to ask about this service.)

Associations work as follows: a single url can have more than one keyword/combo redirect and more than one generic keyword/combo. Conversely, a generic keyword/combo can have more than one url, but a keyword/combo redirect can only have one url. (For obvious reasons, as there should be no confusion on which url to use.) These associations are used for two services, the go service (redirect method) and our recommended results service on the search (redirect or generic method).

go.lclark.edu
Essentially, the go.lclark.edu service is merely a single PHP index file using cURL to poll the search engine (since rails includes a REST interface by default) for a matching redirect. It basically sends the uri string after the host name to the web application, which then either returns a redirect uri, or a 404 error. If the PHP file gets a 404, then it takes the same string sent and redirects the user (and the string as the query) to the search engine for other potential results. (Good for simple mistyped keyword/combos, as they’d still get a good result.)

For matching a redirect, the uri string is canonicalized as follows:

  1. forced lowercase
  2. slashes, underscores, spaces and pluses are converted to spaces (or essentially keyword separators — it could be anything)
  3. except for dashes (which remain as dashes), all other punctuation is stripped
  4. keywords are sorted alphabetically

This canonicalization then means that the end user can choose the both the type of separator to publicize and the order of the terms. So the following are equivalent:

  • new/media
  • new_media
  • new+media
  • media/New

(I’ve thought about utilizing camelCase too, but haven’t done that yet, as the readability doesn’t match the other methods.)

Recommended Results
The generic keyword/combos (along with the redirects) also then drive our recommended results in the search engine. (Try: <http://search.lclark.edu/pages?search=english> for an example.) So, we can get double usage out of our time to manage this resource.

Have questions or thoughts about this? Do let us know. I’ve considered throwing this up in GitHub as an open source tool. Comment and let us know if that would be of interest.