Automatic speech recognition systems

Lecture

SARR - what is it?

Voice keys

A bit of history

Analyst Forecasts

Main difficulties

Offers modern market SARR

Aculab

Babear SDK Version 3.0

Loquendo ASR

Lumenvox

Nuance

SPIRIT

Voiceware

How good it was before! By calling the help desk, it was possible to talk with the girl operator and even arrange a date for her. Now, on that end of the wire, a pleasant but inanimate female voice can be heard, offering to dial 1 for receiving such information, 2 for connecting with this, 3 for accessing the menu, etc. Increasingly, access to information is controlled by the system, and not by humans. This has its own logic: monotonous, uninteresting work is done not by man, but by machine. And for the user, the procedure for obtaining information is simplified: he called a certain set of numbers — he received the necessary information.

How does such a system work? Let's try to figure it out.

The two main types of speech recognition software are:

• voice, or speech, navigators (voice navigator) for managing software and hardware; they are also called "command recognition programs";

• dictation programs - entering text and numeric data.

Voice navigators control programs, to some extent replacing the keyboard and mouse. They have a small dictionary (100-300 words). Some may work with continuous speech and do not require training.

Immediately make a reservation that the system Text-to-speech and speech-to-text, that is, translate the text into oral speech and vice versa, we will not consider. We confine ourselves only to automatic command recognition systems, or voice navigators.

SARR - what is it?

Automatic speech recognition systems (CAPP) is an element of the speech processing process, the purpose of which is to provide a convenient dialogue between the user and the machine. In the broadest sense, we are talking about systems that perform phonemic 1 decoding of a speech acoustic signal when pronouncing speech messages with a free style, an arbitrary speaker, without taking into account the problematic orientation and restrictions on the volume of the dictionary. In the narrow sense, CAPR facilitates the solution of particular problems, imposing some restrictions on the requirements for the recognition of natural-sounding speech in its classical sense. Thus, the range of varieties of CAPP extends from simple stand-alone devices and children's toys that are able to recognize or synthesize separately pronounced words, numbers, cities, names, etc., to supersync systems of natural sounding speech recognition and its synthesis for use, for example as an assistant secretary (IBM VoiceType Simply Speaking Gold).

Being the main component of any friendly interface between a machine and a person, CAPP can be embedded in various applications, for example, in voice control systems, voice access to information resources, computer language training, computer assistance, information access via voice verification systems / identification.

SARR is very useful as a means of searching and sorting recorded audio and video data. Speech recognition is also used when entering information, which is especially convenient when the eyes or hands of a person are occupied. SARR allows people working in a tense situation (doctors in hospitals, workers in the workplace, drivers) to use a computer to get or enter the necessary information.

CAPP is usually used in such systems as telephone applications, embedded systems (dialing systems, work with a pocket computer, driving, etc.), multimedia applications (language learning systems).

Voice keys

Voice keys are sometimes called automatic speech recognition systems. Usually, these are biometric systems of either authorized access to information or physical access to objects. Two types of such systems should be distinguished: verification systems and identification systems. During verification, the user first presents his code, that is, declares himself in one way or another, and then says out loud the password or some arbitrary phrase. The system checks whether the given voice corresponds to the reference that was recalled from the computer's memory according to the code presented.

When identifying a preliminary statement of the user is not done. In this case, a comparison of the given voice with all the standards is performed and then it is specifically determined who the person identifiable by voice is. Today, there are many approaches and methods for implementing such systems, and all of them, as a rule, differ from each other - how many developers, so many of their varieties. The same can be said about speech recognition systems. Therefore, it is permissible to judge the characteristics of specific speech recognition and personality recognition systems by using only special test databases.

A bit of history

United States of America, late 1960s: “Three,” said Walter Cronkite, the leading popular science program “XXI Century,” during a demonstration of the latest developments in speech recognition. The computer recognized this word as "four." “Idiot,” muttered Walter. “This word is not in the dictionary,” the computer replied.

Although the first developments in speech recognition date back to the 1920s, the first system was created only in 1952 by Bell Laboratories (today it is part of Lucent Technologies). And the first commercial system was created even later: in 1960, IBM announced the development of such a system, but the program never entered the market.

Then, in the 1970s, Eastern Airlines in the United States established an announcement-dependent baggage dispatch system: the operator called the destination, and the luggage set off. However, due to the number of errors, the system has not passed the trial period.

After this development in this area, if conducted, it is rather sluggish. Even in the 1980s, there were very few real commercial applications using speech recognition systems.

Today not dozens, but hundreds of research teams in scientific and educational institutions, as well as in large corporations, are working in this direction. This can be judged by such international forums of scientists and experts in the field of speech technologies, such as ICASSP, EuroSpeech, ICPHS, etc. The results of the work, which, as we say, “have piled on the whole world”, cannot be overestimated.

For several years now, voice navigators, or command recognition systems, have been successfully used in various fields of activity. For example, the OmniTouch call center supplied to the Vatican by Alcatel was used to cater for events held as part of the celebration of the 2000th anniversary of Christ. The pilgrim who called the call-center expressed his question, and the automatic speech recognition system "listened" to it. If the system determined that the question was asked on a common topic, such as the schedule of events or hotel addresses, a previously recorded entry was included. If necessary, clarify the question suggested a speech menu in which one of the items had to be indicated with a voice. If the recognition system determined that there was no pre-recorded answer to the question asked, then the pilgrim was connected to a human operator.

In Sweden, not so long ago, an automated telephone help desk was opened using Philips speech recognition software. During the first month of the Autosvar service, which began operating without an official announcement, 200,000 clients used its services. A person must dial a specific number and after answering the automatic secretary, name the section of the information guide that interests him.

The new service is intended mainly for private clients who prefer it because of the significantly lower cost of services. The Autosvar service is the first system of its kind in Europe (in the United States, tests of a similar service at AT & T were launched last December).

Here are some examples of using this technology in the USA.

Realtors often use the services of Newport Wireless. When a realtor drives a car down the street and sees a "For Sale" sign near a house, he calls Newport Wireless and asks for information about a house with such and such a number located on such a certain street. Answering machine in a pleasant female voice tells him about the length of the house, the date of construction and the owners. All of this information is in the Newport Wireless database. Realtors can only give a message to the client. The subscription fee is about $ 30 per month.

Julie, Amtrak’s virtual agent, has served rail passengers since October 2001. She informs by phone about the train schedule, their arrival and departure, and also makes a booking of tickets. Julie is a product of SpeechWorks Software and Intervoice Hardware. It has already increased passenger satisfaction by 45%; 13 out of 50 clients receive all the necessary information from the "mouth" of Julie. Previously, Amtrak used the help tone system, but the satisfaction rate was then lower: only 9 out of 50 clients.

Amtrak recognizes that Julie paid for her price ($ 4 million) in 12-18 months. She allowed not to hire the whole com *** from employees. And British Airways saves $ 1.5 million per year using technology from Nuance Communications, which also automates the help desk.

Recently, Sony Computer Entertainment America presented Socom, the first video game in which players can give verbal orders to fighters from Deploy grenades. In a game worth $ 60, ScanSoft technology is used. Last year, 450 thousand such games were sold, which made Socom the undisputed sales leader of the company.

In expensive cars like Infinity and Jaguar, verbal control of the control panel has been used for several years: the radio, temperature control and navigation system understand the voice of the owner of the car and listen to the owner without question. But now voice recognition technology is beginning to be used in middle-class cars. So, since 2003, the Honda Accord has an embedded IBM voice identifier. It is called ViaVoice and is part of the navigation system for $ 2000. According to the supplier, one fifth of Honda Accord buyers chose a model with a voice navigation system.

Even in medicine, voice recognition technology has found its place. Already developed apparatus for inspection of the stomach, obedient to the voice of the doctor. True, these devices, according to experts, are still imperfect: they have a slow reaction to the orders of the doctor. But still ahead. In Memphis, VA Medical Center has invested $ 277,000 in the Dragon program, which allows doctors and nurses to dictate information into a computer database. Probably, it will not be necessary soon to suffer in order to make out the doctor's handwriting in the medical file.

Already hundreds of large companies use voice recognition technology in their products or services; among them - AOL, FedEx, Honda, Sony, Sprint, T. Rowe Price, United Airlines and Verizo. According to experts, the voice technology market in 2002 reached about $ 695 million, which is 10% higher than in 2001.

The airline United Airways introduced an automatic referral service in 1999. Automatic telephone call handling systems are operated by companies such as investment bank Charles Schwab & Co, Sears retailer, Roebuck supermarket chain. The American wireless operators (AT & T Wireless and Sprint PCS) have been using these programs for more than a year and providing voice dialing services. And although now the leader in the number of call-centers of this type is America, recently the benefits of speech recognition systems have begun to be realized in Europe. For example, the Swiss railway service already provides services for its German-speaking passengers that are similar to those offered by United Airways.

Analyst Forecasts

Today, speech recognition technology is considered one of the most promising in the world. Thus, according to forecasts of the American research company Cahners In-Stat, the global market for speech recognition software will increase from $ 200 million to $ 2.7 billion by 2005. According to Datamonitor, the volume of the voice technology market will grow by an average of 43% per year: from 650 million dollars in 2000 to 5.6 billion dollars in 2006 (Fig. 1). Experts working with CNN's media corporation attributed speech recognition to one of the eight most promising technologies this year. And analysts from IDC say that by 2005 speech recognition will generally oust all other speech technologies from the market (Fig. 2).

Main difficulties

The main problem arising in the development of CAPR is the variable pronunciation of the same word by different people and by the same person in different situations. It does not embarrass a person, but a computer can. In addition, numerous factors influence the input signal, such as ambient noise, reflection, echo, and channel noise. Complicated by the fact that the noise and distortion are not known in advance, that is, the system can not be adjusted to them before starting work.

However, more than half a century of work on various SARR has borne fruit. Virtually any modern system can operate in several modes. First, it can be dependent or independent of the announcer. A speaker-dependent system requires special training for a specific user in order to accurately recognize what he is saying. To learn the system, the user must pronounce several specific words or phrases that the system will analyze and remember the results. This mode is usually used in dictation systems when one user is working with the system.

Speech-independent system can be used by any user without a learning procedure. This mode is usually used where the learning procedure is not possible, for example, in telephone applications. Obviously, the recognition accuracy of a dictor-dependent system is higher than that of a speaker-independent one. However, a speaker-independent system is easier to use, for example, it can work with an unlimited number of users and does not require training.

Secondly, systems are divided into those working only with isolated commas and able to recognize coherent speech. Speech recognition is significantly more challenging than recognizing separately pronounced words. For example, when moving from recognition of isolated words to speech recognition with a 1000-word dictionary, the percentage of errors increases from 3.1 to 8.7, and it also takes three times longer to process speech.

The isolated command mode is the easiest and least resource-intensive. When working in this mode, after each word, the user pauses, that is, clearly indicates the boundaries of words. The system does not need to search for the beginning and end of the word in the phrase itself. The system then compares the recognized word with patterns in the dictionary, and the most likely model is adopted by the system. This type of recognition is widely used in telephony instead of the usual DTMF-methods 2 .

The continuous casting mode is more natural and close to the user. It is assumed that the system itself distinguishes the boundaries of words in the phrase. However, this mode requires much more system resources and memory, and recognition accuracy is lower than in the previous mode. Why is this so? There are several reasons. Firstly, in continuous speech, the pronunciation of words is less accurate than in “PIN mode”, that is, when each word is pronounced separately. Secondly, the speed of speech even for one person is different. He may think, doubt, forget the word. In colloquial speech, the words parasites are often encountered: “well,” “a,” “here”. In addition, the boundaries of words are often blurred, not clearly pronounced, which complicates the operation of the system.

Additional variations in speech also arise from arbitrary intonation, stress, lax structure of phrases, pauses, repetitions, etc.

At the junction of the continuous and separate pronouncing of words, a search mode for keywords appeared. In this mode, CAPP finds a predetermined word or group of words in the general stream of speech. Where can this be used? For example, in listening devices, which turn on and start recording when certain words appear in speech, or in electronic help. Получив запрос в произвольной форме, система выделяет смысловые слова и, распознав их, выдает необходимую информацию.

Размер используемого словаря — важная составляющая САРР. Очевидно, что чем больше словарь, тем выше вероятность того, что система ошибется. Во многих современных системах есть возможность или дополнять словари по мере необходимости новыми словами, или подгружать новые словари. Обычный уровень ошибок для дикторонезависимой системы с изолированным произнесением команд — около 1% для словаря в 100 слов, 3% — для словаря в 600 слов и 10% — для словаря в 8000 слов.

Offers of the modern CAPP market

Various companies are represented on the market today. Consider some of them.

Aculab

Recognition accuracy of 97%.

Speaker independent system. The system developers analyzed various databases for many languages in order to take into account all the variations of speech that occur depending on age, voice, gender and accent. Own algorithms provide speech recognition regardless of the characteristics of the equipment (headphones, microphone) and channel characteristics.

The system supports the ability to create additional dictionaries that take into account the features of pronunciation and accents. This is especially useful in cases where the system is used by people whose pronunciation is very different from the usual.

The system supports the most common languages, such as British and American English, French, German, Italian, North American Spanish. A dictionary can be configured in any of these languages, but it is not possible to use several languages simultaneously in a single dictionary.

The product is available on Windows NT / 2000, Linux, and Sun SPARC Solaris.

Babear SDK Version 3.0

Speaker-independent system that does not require training for a specific user. Adaptation to the user occurs during operation and provides the best recognition result. Automatically adjusting for voice activity allows you to recognize speech in a very noisy environment, such as in a car. The system does not detect words that are not included in the dictionary. You can search for keywords. The system can be configured to work both with a small dictionary (isolated pronunciation of commands), and with a large dictionary (speech).

The system supports the following languages: British and American English, Spanish German, French, Danish, Swedish, Turkish, Greek, Icelandic and Arabic.

The system runs on Windows 98 (SE) / NT 4.0 / 2000 / CE, Mac OS X and Linux.

Loquendo ASR

peaker-independent system optimized for use in telephony. It is possible to recognize individual words and speech, search for keywords (dictionary up to 500 words). Allows you to create user-friendly applications due to the large volume of the dictionary and the flexibility of the system.

It supports 12 languages, including the most common European languages (Italian, Spanish, British and American English, French, German, Greek, Swedish, etc.).

It is part of the Loquendo Speech Suite product along with a text-to-speech system and Loquendo VoiceXML Interpreter, which supports the use of different voices and languages.

Система работает на базе MS Windows NT/2000, UNIX и Linux.

LumenVox

Dictoronedependent system that does not require training, but after adaptation to a specific user, the recognition results become much better: recognition accuracy exceeds 90%.

It supports various formats of audio files: (u-law 8 kHz, PCM 8 kHz, PCM 16 kHz). Does not have strict requirements for hardware resources. Works on the basis of Windows NT / 2000 / XP and Linux.

System requirements (Windows based):

• Windows NT 4.0 with Service Pack 6a, Windows 2000 or Windows XP Pro;

• Intel Pentium III 800 MHz or higher;

• The minimum memory size is 512 MB.

System requirements (based on Red Hat Linux):

• Red Hat Linux 7.2;

• Intel Pentium III 800 MHz or higher;

• Memory size 256 MB;

• Disk size 17 MB (after decompression).

Nuance

According to manufacturers, the system is optimized for the least consumption of memory and other system resources. Recognition accuracy - up to 96%, and remains high even in a noisy room.

There is a possibility of self-training of the system and its adjustment for each user.

SPIRIT

The language can be any (the dictionary is compiled according to the specific requirements of the client and includes those words in that language that the client indicated in the requirements for the system settings. The dictionary can include words from different languages, that is, without changing the settings, the system can recognize words , for example, in both Chinese and Finnish, if they were previously entered into the dictionary). Thus, this system can work with any language, while other systems can work only with their specific set.

This is an automatic speech recognition system that provides high quality recognition even in a very noisy environment. The system can be easily configured to operate in one of two modes: phrase recognition with a fixed number of commands (pronouncing individual commands, PIN code mode) and phrase recognition with an arbitrary number of commands (continuous pronunciation of commands, “connected speech mode”). It is possible to search for keywords. This solution works in conditions of additive non-stationary noise. The required signal-to-noise ratio is up to 0 dB in the “PIN code mode” and up to +15 dB in the connected speech mode.

Recognition delay - 0.2 s. Parameters of the acoustic channel: bandwidth in the range of 300-3500 Hz. Adaptation to the acoustic environment is carried out using noise fragments with a total length of at least 3 s.

For "PIN mode":

• dictionary - 50 teams;

• the probability of correct recognition is 95-99% with SNR 3 = 0 ... 6 dB;

• required acoustic conditions: additive wide-band static noise with SNR (signal-to-noise ratio)> = 15 dB.

For coherent speech recognition mode:

• dictionary - 12 words / numbers;

• the probability of correct recognition of a chain of words is 98-99%.

Specificity: adaptation to arbitrary noise.

SPIRIT's automatic speech recognition system is available in the form of an application for PC under MS Windows or assembler code. At the request of customers, the solution can be ported to any DSP or RISC platform.

VoiceWare

The system can work both in speaker-dependent and speaker-independent mode, therefore, special training of the system for working with a specific user is not required.

Provides high recognition accuracy and real-time operation, even in a noisy environment.

The system recognizes coherent speech and a sequential list of numbers.

Words that are not included in the dictionary, and extraneous noise are not perceived by it, and meaningless words, such as “a,” “well,” etc., are discarded.

New words can be added to the dictionary.

The system automatically adjusts to the tone, pronunciation and other speech features of the user.

VoiceWare supports American English and Korean; Chinese and Japanese are in development.

The system runs on Windows 95/98 / NT 4.0, UNIX, and Linux.

Comments

To leave a comment

If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.

To reply

Comment

To confirm that you are not a bot, answer:

Name

Email(not published)

Vote

Automatic speech recognition systems

SARR - what is it?

Voice keys

A bit of history

Analyst Forecasts

Main difficulties

Offers of the modern CAPP market

Aculab

Babear SDK Version 3.0

Loquendo ASR

LumenVox

Nuance

SPIRIT

VoiceWare

Comments

To leave a comment

Auto Speech Recognition

Terms: Auto Speech Recognition