Potential of speech
recognition technologies
Speech
recognition has been a part of software engineering for sometime, but the
technologies (and software systems) behind it are still at a very nascent
stage. MOHAN BABU writes about the potential of speech recognition
technologies which can make our lives much easier
During the
recent drama, when the Washington DC serial killers (snipers) tried to
communicate with police using voice synthesizers, the police in turn used
sophisticated voice recognition software to decipher the speech patterns.
This set me thinking about the potential of voice recognition technologies
and I decided to dig around a bit.
Imagine this
scenario: I walk into my office and instead of flipping the switch to boot
my PC, instruct it by voice to boot. After that I start talking to “it” as
an executive would dictate stuff to his/her secretary and expect the system
to “magically” do all that a secretary would, including formatting
documents, creating presentations, printing them, sending e-mails, etc. In
most offices, all that an executive does is to instruct the secretary with
clear voice commands, so, visionaries would have us believe that computers
could someday take over a secretary’s role and transform voice commands into
tasks that it performs. That day is still a long way off, at least with the
current technologies and software. However, voice recognition systems and
software are here and being used routinely by businesses to interact with
customers.
Speech
recognition allows users to provide input to applications with voice instead
of clicking a mouse, typing on keyboards, pressing a key or phone keypad.
Many airlines and transportation companies regularly use voice recognition
software systems to handle customer calls, avoiding hooking to a human
operator unless absolutely necessary. This makes the system streamlined and
cost effective. Such systems typically work like this: users call the toll
free number and are prompted by a voice greeting. For instance, if I were to
call an airlines flight arrival/departure system, I would be prompted with a
voice greeting and asked to say the flight number (say DL 1106). The system
would recognise what I said and repeat it to confirm, after which it would
take me through a series of options till I got the information I wanted.
Most speech
recognition process is performed by software systems written around
components called speech recognition engines. The speech recognition
engine’s primary function is to process spoken input and translate it into
text that an application can understand. Behind the scenes, voice
recognition systems are built around complex software engines, typically
using VoiceXML technologies. Most commercially deployed voice recognition
systems are speaker-independent, requiring a lower degree of knowledge of
the speaker’s voice characteristics and do not need to be “trained” on a
voice or accent of speakers. Instead, such systems are designed around
menu-driven prompt architectures. Even prompt driven speech recognition
systems need powerful engines to cater to the possible grammar of the
application. For instance a banking VoiceXML system will need to handle all
the common banking terminologies like debit, credit, account, transaction,
etc.
If speech
recognition is such a convenient interface to communicate with computer
systems, why hasn’t it taken off in a big way, you might be wondering? The
reasons are many, including the following:
* Nascent
technologies and software: Even though speech recognition has been a
part of software engineering for a while, the technologies (and software
systems) behind it are still at a very nascent stage.
*
Grammatical and language issues: Even assuming the systems being
developed are going to recognise only one language, say English, the grammar
and pronunciation of English words vary from region to region. For instance,
in US English, the word “the” has at least two pronunciations: “thee” and “thuh”.
* Usage
and accents: Indians speaking English are going to sound different from
British and Americans or even Europeans. A system should be designed to be
sophisticated enough to understand the different accents, etc.
Even though VoiceXML technologies are at a nascent stage, they hold promise
for a country like India where a percentage of our population
is illiterate and semi-literate. Voice enabled computer kiosks
will help us leapfrog the learning curve and bring system
usage to masses. Needless to say, there are problems that
are going to be unique to India like the prevalence of many
languages and scores of dialects. Systems designed to “talk”
to an auto-driver in Salem (Tamil Nadu) may not work for a
farmer in Bhatinda (Punjab). However, just as language has
not been a showstopper for Satyam in rolling out web-portals
catering to people from different regions, it should not prevent
Indian entrepreneurs from thinking outside the box.
|