The Future: Speech Recognition

There's a discussion on The Forums about the future of computers and related technology. Specifically, what will we see 20 years from now?

Every time someone asks this question in a more general sense, someone always suggests flying cars. Well, we know how well those work. When asked about computers, everyone always suggests that the keyboard will no longer be necessary, and we'll be able to simply speak to our computers to enter text and commands.

There is absolutely no way that this will ever be a reality.

In computer science, we have easy problems and hard problems. Easy problems include searching through a large data set for a particular record or adding realtime effects to digital video. While these might stress hardware, we can produce satisfactory results (there is usually only one correct answer) and we have a pretty good idea of what algorithms are necessary. But a hard computer science problem doesn't mean that it will take longer to do, or it would need a faster CPU - it usually means that it's not possible to do effectively with limited resources, given the algorithms that we've developed so far. Sometimes, like trying to play the perfect chess game, there aren't enough atoms in the universe to store the information required to compute the perfect answer. We rarely recategorize a problem from hard to easy, and we've known about most hard problems for at least 40 years.

For most hard problems, we've invented relatively easy ways to generate approximate answers. They're not perfect, but they're acceptable in many applications.

Speech recognition is a hard problem, but we've had speech-to-text software for over a decade. It has never been perfect, but the best software can achieve approximately 90-95% accuracy during dictation. Easy, right? And if a 100 MHz Pentium could be 90% accurate in 1996, a 3.6 GHz Pentium 4 in 2005 should manage 100%, right?

Unfortunately, this isn't the case. Speech recognition software can do one thing well: distinguish between a limited set of commands. Many annoying phone-center menus now request that you speak commands or numbers. But it's a lot easier to tell the difference between "support" and "sales" than to distinguish between every word in a language when they're strung together into long sentences without pauses between words. That's how we speak.

We haven't made any significant progress in the last 10 years, for a few key reasons:

  • We're not stressing existing hardware at all. Hardware is not the limiting factor.
  • We haven't made any progress in algorithm development. We're still using the same recognition methods and getting the same results.
  • There's very little demand in the market to improve the existing products.

If all three of these remain true, I don't see any way that we'd see noticeable change 20 years from now. It's easy to project that the first two probably won't change, but what about the market demand?

Most computers are used near other people. Can you imagine how your school lab or office would sound if everyone was talking to their computers? You wouldn't be able to get anything done, and everyone around you would have to listen to you saying "Forward the selected email to Mom with the following comment: Hey Mom, this is hilarious, pass it on! Hope the cows are doing well. Love, Marco. Send email. No, backspace, backspace, backspace. Send."

And you'd get wonderful emails. "Hey Marco, could you take a look at this? I think I'm having a. Hello? Yes, this is Dan. Can I call you back in a few minutes? Thanks, problem with the Clusty toolbar saving its preferences under the new Firefox build. Dash dash Dan send email."

Computer interaction must be kept separate from your other communication processes - otherwise, they can't happen simultaneously without sacrificing the quality of at least one.

Plus, much computer interaction bears no similarity to spoken or written language. Can you imagine the reality of telling your computer which web URL to open, or trying to write code in a programming language?

Like flying cars, speech recognition as a replacement for keyboards is complete fantasy.