Search is hard

By Marco on July 26, 2007

Most websites and web applications (like most applications of any type, ever) are glorified database front-ends. It’s easy to retrieve entries if you know exactly what you’re looking for, such as a specific Flickr user or an exact phrase from a blog post.

Websites usually implement search by “dumb” search algorithms, such as MySQL LIKE, FULLTEXT, or substring matching. But what if you want to find everything on a website about, say, music? Wouldn’t an entry including “song” or “RIAA” be appropriate? None of the dumb algorithms would return it, because they only do binary logic tests: “Return all posts whose body contains the word ‘music’.”

Linguistic search is hard, and the dumb algorithms don’t perform any linguistic analysis. Suppose you were building a search feature for a music review and discussion website. The database can’t tell in a search query or index that:

song is equivalent to songs
singing is equivalent to sing and sang, and is related to song
Cracker is not equivalent to crack, crackers, or cracking
tune, track, hit, and number might all mean song in the context of music
record and album are equivalent, while LP is similar
An entry that mentions the RIAA, Metallica, and radio is related to music even if it doesn’t contain the word music
An entry containing Green Day is not the same as an entry containing green and day
A user searching for REM probably means R.E.M.
Maroon Five is equivalent to Maroon 5
the should be ignored, but the artist might mean Prince, while artist alone doesn’t
Live is a band, but live is a type of recording, and a user searching for live songs could mean either
Many people can’t spell

I could go on like this forever. You get the idea.

The fundamental problem is that good search needs to figure out what you mean, not necessarily what you said.

A lesson to young Web 2.0 programmers: Don’t try to build your own search engine. It will definitely suck, and you’ll never finish it. Go get a commercial one or use a free alternative if you can’t afford it.

Marco.org

Search is hard