Search is hard

Most websites and web applications (like most applications of any type, ever) are glorified database front-ends. It’s easy to retrieve entries if you know exactly what you’re looking for, such as a specific Flickr user or an exact phrase from a blog post.

Websites usually implement search by “dumb” search algorithms, such as MySQL LIKE, FULLTEXT, or substring matching. But what if you want to find everything on a website about, say, music? Wouldn’t an entry including “song” or “RIAA” be appropriate? None of the dumb algorithms would return it, because they only do binary logic tests: “Return all posts whose body contains the word ‘music’.”

Linguistic search is hard, and the dumb algorithms don’t perform any linguistic analysis. Suppose you were building a search feature for a music review and discussion website. The database can’t tell in a search query or index that:

  • song is equivalent to songs
  • singing is equivalent to sing and sang, and is related to song
  • Cracker is not equivalent to crack, crackers, or cracking
  • tune, track, hit, and number might all mean song in the context of music
  • record and album are equivalent, while LP is similar
  • An entry that mentions the RIAA, Metallica, and radio is related to music even if it doesn’t contain the word music
  • An entry containing Green Day is not the same as an entry containing green and day
  • A user searching for REM probably means R.E.M.
  • Maroon Five is equivalent to Maroon 5
  • the should be ignored, but the artist might mean Prince, while artist alone doesn’t
  • Live is a band, but live is a type of recording, and a user searching for live songs could mean either
  • Many people can’t spell

I could go on like this forever. You get the idea.

The fundamental problem is that good search needs to figure out what you mean, not necessarily what you said.

A lesson to young Web 2.0 programmers: Don’t try to build your own search engine. It will definitely suck, and you’ll never finish it. Go get a commercial one or use a free alternative if you can’t afford it.