How one team turned the dream of speech recognition into a reality

When Françoise joined Google in 2005, speech recognition was an experimental field, still in its first decade of using statistical modeling to assess the accuracy of translation. "We started working on speech recognition at Google as a small research effort. You have to remember that this was before smartphones or any of these kinds of technologies were on the market, so the potential of speech recognition the way it is used now was not quite as obvious yet."

But Google's ability to quickly process large data sets via MapReduce combined with new data collection methods and the need to provide voice search on mobile phones, pushed this 'small research effort' front and center, fast.

The Google speech team launched its first smartphone app in 2008, followed by the desktop-based Voice Search in June 2011, supporting only American English queries at first. Their original algorithms to interpret voice input into text were trained from models based on speech patterns from GOOG-411, a quirky experimental speech recognition product launched in 2007 to look up phone numbers in the United States or Canada.

Over the next two years, the team expanded Voice Search to support 50 additional languages so that millions of people all over the world could use this tool. Using the same software structure, they built different statistical models and provided language specific training for each model based on language specific anonymized word databases collected from Google products, e.g. anonymized text data from Google.com.br would be used to train the Brazilian Portuguese language model, while data from Google.com.mx would be used for Mexican Spanish.

Yet soon a new problem surfaced in the speech recognition models: With increasing frequency, Voice Search started interpreting recurring nonsense words instead of what users had voiced. "We saw these odd words popping up in several languages. I think the first one we noticed was in Korean and it stood out immediately because it was actually in latin characters: 'pp'. And what could that possibly be, something like pfpf, right? Or, the term 'kdkdkd' started coming up in English, which didn’t make ANY sense."

The nonsense words were Mystery One. Quickly followed by Mystery Two: the speech recognizer started misspelling common words more and more frequently. For instance, users would say 'probably', and the output would be rendered as 'probly.' As a native French speaker, Françoise was curious about the French recognition model output as well. "I was horrified. It didn’t look like French… the results often looked like something written by someone who was either totally ignorant of French or maybe had never learned how to write. I specifically noticed that the speech system had problems with hyphenated words, which are quite common in French. There are a lot of small town names such as 'Châlons-en-Champagne', which would appear as something like 'Chalonsenchampagne.' Common French first names such as 'Jean-François' would be outputted as 'Jeanfrancois'. That is, of course, if it properly identified the phonemes in the first place, which it often didn't. So you would probably get something more like 'Chalossechempain' or 'Jeafracua'.'"

The recognition system has three separate models: The acoustic model, the pronunciation model, and the language model. These three models are trained separately, but are then composed into one gigantic search graph. Essentially, speech recognition is taking an audio waveform, pushing it through this search graph, and letting it find the path of least resistance—that is, finding the word sequence that has the maximum likelihood.

The acoustic model

The acoustic model identifies the phonemes that are most likely to be present in an audio sample. It takes a waveform, chunks it into small time-segments, implements a frequency analysis, and outputs a probability distribution over all the triphone-states for that particular input (triphone-states model the slight variances in a phone’s waveform based on whether it’s at the beginning, middle, or end of a sound sequence in addition to factoring in variance based on the context of other proximal phones). The waveform frequency vector, matched with a probability distribution, thus identifies which phonemes are more likely than others to be contained in that audio sample, thereby delivering a sequence of phonemes that exist in that input over time.

User says "Spa", acoustic model narrows down guesses of phones it heard. Probable phones for first letter are S or F, second letter P or B, third letter A or E. It gives likelihoods: 60% chance of S, 40% chance of F, etc…

The pronunciation model

Simultaneously, the pronunciation model kicks into effect. Effectively, it takes the phonemic probability distributions from the acoustic model, checks them against a massive lexicon defining valid sequences of phonemes for the words of a specific language, and restricts the possible phoneme sequences to ones that make sense in that language.

Acoustic model identified some likely phonemes. Pronunciation model says, "s-p" is a likely combo in English, but "s-b" and "f-b" and "f-p" are not found in the English lexicon. The combo "S-p-a" and "S-p-e" are both found in the English lexicon and "S-b-a", "F-b-a", and "F-p-a" are eliminated.

That list can’t possibly hold all the valid phonemes of all the words for every language in the world, so the model has a statistical engine which can instantly generate estimated pronunciations based on an (also estimated!) orthography for words that it has never seen. It tries to match this with words on the list, unless it is sure it heard a distinct new word, in which case it will learn the new word. These new words typically have a very low probability score, so they are almost never picked again unless it is the closest match out of all possible words in the lexicon.

Someone searches for "Hobbit", the phones "H-o-b-i-t" are identified. The model is pretty sure it heard "h-o-b-i-t", not "h-a-b-i-t". This matches with possible phonemic combos in English (unlike something like "Hbpit", which isn't possible), so it OKs the new word. It tries to guess the pronunciation/orthography of this new word, "Hobbit". Now, if others search Hobbit, the model will find Hobbit. But if someone searches Hpbit, it will also find Hobbit, since that is the most similar word that is valid in the lexicon.

The language model

Combining the acoustic and pronunciation models, we have audio coming in and words coming out. But that's not quite specific enough to provide reliable Voice Search, because you cannot just string any word together with any other word: there are word combinations that are more reasonable than others. Enter the language model, the third component of the recognition system. It calculates the frequencies of all word sequences between one to five words and thereby constrains the possible word sequences that can be formed out of the two aforementioned models to ones that are sensible combinations in language. The final search algorithm will then pick the valid word sequence that has the highest frequency of occurrence in the language.

User says, "My dog ran away". Audio/Pronunciation models identified various valid possibilities: My or Mai, Dog or Dock, Ran or Ram, Away or A Whey. Language model looks at combos and figures that it has seen "my dog ran away" much more frequently than "Mai dock ram a whey" or "my dock ran away," so it constrains it to that combination.

These three models create a huge search graph, through which waveforms can be pushed to create near instantaneous text output. So where, in this massive and complex system, would we find the reason for the rise in obvious misspellings? And is that different from what's causing the rise in nonsense words?

Google is very flexible, it understands users even if they mistype.

The models are all built on a very large corpus of data—e.g. all searches typed into YouTube or Google Maps, or simply entered into Google.com. Let's pause for a moment and reflect on the grammar, spelling and sentence structure of most typed search queries.

Françoise explains: "Google is very flexible, it understands users even if they mistype, so many people don't bother spelling correctly or typing special characters like hyphens." This is great for users, but not so great for the language database, which gets much of its data from statistical analyses of anonymized aggregate search queries. If a misspelled word appears with enough frequency, the probability of the recognition model selecting, approving, and pairing the misspelled version of a word with the phonemic sequences is higher and higher. When the audio input’s phonemic sequence ("Hobbit") and orthographical representation does not have any high probability matches in the word list, the speech recognition system broadens its search to find a similar phonemic sequence (aforementioned hpbit, or a misspelled version of "habit" such as "hobit") and if that misspelled term appears often enough in the search logs, it can be chosen by the system. On top of this, people often swallow certain syllables or phonemes when they say words, (they 'probably' say 'probly'). Thus, the pronunciation component is mapping mispronounced phonemic sequences with higher and higher frequency. At some point these mispronounced versions are frequent enough to become the default, dismissing the correct pronunciation as error or noise. Ergo, the recognition system has a mispronounced word that vaguely matches a misspelled word, and happily outputs these nonsense terms instead of what the user meant.

Trains and retraining the recognition system

And what about 'pp' and 'kdkdkd', which aren't explained by mispronunciations? It turns out these nonsense terms were being caused by background noise. "So for instance, with 'kdkdkd' we found in our data noise samples what sounded like the sound of a train on the tracks and then found that the audio segments were often occurring while people were in trains," said Françoise. "We realized that these must have been the train tracks whose faint, low-quality audio samples were picked up by the system and identified phonemically as 'kdkdkd.' The recognition model then tried to match this to a word, and came up with a low probability term 'kdkdkd' from some random series of nonsense typos that it may have logged at some point, which it added to the lexicon. The language model probably rejected this initially, but with the feedback loops from recognition result logs, it slowly began seeing 'kdkdkd' as a legitimate word with high enough frequency to actually pick it. Speech recognition errors were originating from a potent combination of user typos, machine learning, and pronunciation model-forced matches to background noise which, over time, trained the algorithms with misspelled words and bad pronunciation.

"The pronunciation model finds those butchered words with higher and higher frequency, such that the language model now sees them as having high probability AND matching the acoustic input," said Françoise. "And now, everything lines up and the system very happily keeps outputting junk, which is then logged again, which is then fed back into the training for the model etc. etc. etc."

"Breaking those feedback loops has been a tremendous engineering effort, because we have to go through the entire system to figure out exactly where these points of failure are. Plus, we have to find smart ways of initializing all of these models and fixing the errors without losing any performance or speed. I definitely don’t want to wait two seconds after saying something for my phone to transcribe it, would you?"

While scale of available data was crucial at the start of the project to begin the kind of statistical matching necessary for accuracy, in this case the sheer size of the corpus worked against Françoise and the team. "It is extremely hard to fix these errors, because in a field like speech recognition where everything is statistical, we often use previous models, run them through the data, and get better and better data. Then, we fold this data into the next generation model. So at some point, being able to say, 'I just want to throw away everything I've learned and restart from scratch'—that's really hard. I mean, it would be like saying, in the context of a search engine, I'm gonna throw away the whole index because I've cleaned up something. But we are doing it, extirpating errors every time they appear now that we finally discovered the root of the problem."

Each challenge solved ads/adds/adze to/too/two what we no/know

Yet more challenges remain to be solved for Françoise and her team. Speech recognition is already in use on products such as Google Search, Google Now, Android, and Google Voice, but just a short distance away from Françoise's desk, teams are incorporating voice as the primary mode of device interaction for groundbreaking technologies such as Google Glass, Google Watch, and Self-Driving Cars. This brings new challenges: "In Google Glass the microphone is no longer in front of your mouth and maybe you're walking in the street and there's a whole bunch of noise getting in there," she explains. "It will not be able to find good results using our current algorithms, whose database of sounds and words are trained on data from front-facing microphone inputs. Likewise, if you talk to a self-driving car or Google Watch, the microphone is much further away and in a totally different position. We will need to find new ways of adapting our speech recognition algorithms to these environments."

And for Françoise, the size and scope of these challenges match her passion for speech recognition and machine learning at Google. "Speech recognition is a fantastic technology, it's a toy store for me. Even though I've been working on it for more than 20 years, I still have fun every single day when I walk in because there’s always a new problem to think about and to solve. And of course, witnessing a technology you worked on from the beginning blow up the way speech recognition has and seeing it affect millions of people every day... that's a big satisfaction."

Yet what makes her so excited to be working on her passion at a place like Google? "Speech recognition, to work well, requires a lot of data and a lot of machines. And Google is one of the only places where these two things really come together on such a large scale. The whole chain from the product to the users and that constant feedback provides a pace for machine learning and language modeling that is unprecedented. I don't think any other place would be in such a position to do that.'

If I take my ambitious long term goal, which is to fix speech recognition and have it working really well for everybody on Earth, I think we can get there. I cannot imagine another place that could do it quite as well as Google.

Françoise is a Research Scientist at Google, where she heads a team of software engineers, researchers, and linguists working on the next frontiers of speech recognition technology

How one team turned the dream of speech recognition into a reality

When Françoise joined Google in 2005, speech recognition was an experimental field, still in its first decade of using statistical modeling to assess the accuracy of translation. "We started working on speech recognition at Google as a small research effort. You have to remember that this was before smartphones or any of these kinds of technologies were on the market, so the potential of speech recognition the way it is used now was not quite as obvious yet."

But Google's ability to quickly process large data sets via MapReduce combined with new data collection methods and the need to provide voice search on mobile phones, pushed this 'small research effort' front and center, fast.

The Google speech team launched its first smartphone app in 2008, followed by the desktop-based Voice Search in June 2011, supporting only American English queries at first. Their original algorithms to interpret voice input into text were trained from models based on speech patterns from GOOG-411, a quirky experimental speech recognition product launched in 2007 to look up phone numbers in the United States or Canada.

Over the next two years, the team expanded Voice Search to support 50 additional languages so that millions of people all over the world could use this tool. Using the same software structure, they built different statistical models and provided language specific training for each model based on language specific anonymized word databases collected from Google products, e.g. anonymized text data from Google.com.br would be used to train the Brazilian Portuguese language model, while data from Google.com.mx would be used for Mexican Spanish.

Yet soon a new problem surfaced in the speech recognition models: With increasing frequency, Voice Search started interpreting recurring nonsense words instead of what users had voiced. "We saw these odd words popping up in several languages. I think the first one we noticed was in Korean and it stood out immediately because it was actually in latin characters: 'pp'. And what could that possibly be, something like pfpf, right? Or, the term 'kdkdkd' started coming up in English, which didn’t make ANY sense."

The nonsense words were Mystery One. Quickly followed by Mystery Two: the speech recognizer started misspelling common words more and more frequently. For instance, users would say 'probably', and the output would be rendered as 'probly.' As a native French speaker, Françoise was curious about the French recognition model output as well. "I was horrified. It didn’t look like French… the results often looked like something written by someone who was either totally ignorant of French or maybe had never learned how to write. I specifically noticed that the speech system had problems with hyphenated words, which are quite common in French. There are a lot of small town names such as 'Châlons-en-Champagne', which would appear as something like 'Chalonsenchampagne.' Common French first names such as 'Jean-François' would be outputted as 'Jeanfrancois'. That is, of course, if it properly identified the phonemes in the first place, which it often didn't. So you would probably get something more like 'Chalossechempain' or 'Jeafracua'.'"

The recognition system has three separate models: The acoustic model, the pronunciation model, and the language model. These three models are trained separately, but are then composed into one gigantic search graph. Essentially, speech recognition is taking an audio waveform, pushing it through this search graph, and letting it find the path of least resistance—that is, finding the word sequence that has the maximum likelihood.

The acoustic model

The acoustic model identifies the phonemes that are most likely to be present in an audio sample. It takes a waveform, chunks it into small time-segments, implements a frequency analysis, and outputs a probability distribution over all the triphone-states for that particular input (triphone-states model the slight variances in a phone’s waveform based on whether it’s at the beginning, middle, or end of a sound sequence in addition to factoring in variance based on the context of other proximal phones). The waveform frequency vector, matched with a probability distribution, thus identifies which phonemes are more likely than others to be contained in that audio sample, thereby delivering a sequence of phonemes that exist in that input over time.

User says "Spa", acoustic model narrows down guesses of phones it heard. Probable phones for first letter are S or F, second letter P or B, third letter A or E. It gives likelihoods: 60% chance of S, 40% chance of F, etc…

The pronunciation model

Simultaneously, the pronunciation model kicks into effect. Effectively, it takes the phonemic probability distributions from the acoustic model, checks them against a massive lexicon defining valid sequences of phonemes for the words of a specific language, and restricts the possible phoneme sequences to ones that make sense in that language.

Acoustic model identified some likely phonemes. Pronunciation model says, "s-p" is a likely combo in English, but "s-b" and "f-b" and "f-p" are not found in the English lexicon. The combo "S-p-a" and "S-p-e" are both found in the English lexicon and "S-b-a", "F-b-a", and "F-p-a" are eliminated.

That list can’t possibly hold all the valid phonemes of all the words for every language in the world, so the model has a statistical engine which can instantly generate estimated pronunciations based on an (also estimated!) orthography for words that it has never seen. It tries to match this with words on the list, unless it is sure it heard a distinct new word, in which case it will learn the new word. These new words typically have a very low probability score, so they are almost never picked again unless it is the closest match out of all possible words in the lexicon.

Someone searches for "Hobbit", the phones "H-o-b-i-t" are identified. The model is pretty sure it heard "h-o-b-i-t", not "h-a-b-i-t". This matches with possible phonemic combos in English (unlike something like "Hbpit", which isn't possible), so it OKs the new word. It tries to guess the pronunciation/orthography of this new word, "Hobbit". Now, if others search Hobbit, the model will find Hobbit. But if someone searches Hpbit, it will also find Hobbit, since that is the most similar word that is valid in the lexicon.

The language model

Combining the acoustic and pronunciation models, we have audio coming in and words coming out. But that's not quite specific enough to provide reliable Voice Search, because you cannot just string any word together with any other word: there are word combinations that are more reasonable than others. Enter the language model, the third component of the recognition system. It calculates the frequencies of all word sequences between one to five words and thereby constrains the possible word sequences that can be formed out of the two aforementioned models to ones that are sensible combinations in language. The final search algorithm will then pick the valid word sequence that has the highest frequency of occurrence in the language.

User says, "My dog ran away". Audio/Pronunciation models identified various valid possibilities: My or Mai, Dog or Dock, Ran or Ram, Away or A Whey. Language model looks at combos and figures that it has seen "my dog ran away" much more frequently than "Mai dock ram a whey" or "my dock ran away," so it constrains it to that combination.

These three models create a huge search graph, through which waveforms can be pushed to create near instantaneous text output. So where, in this massive and complex system, would we find the reason for the rise in obvious misspellings? And is that different from what's causing the rise in nonsense words?

Google is very flexible, it understands users even if they mistype.

The models are all built on a very large corpus of data—e.g. all searches typed into YouTube or Google Maps, or simply entered into Google.com. Let's pause for a moment and reflect on the grammar, spelling and sentence structure of most typed search queries.

Françoise explains: "Google is very flexible, it understands users even if they mistype, so many people don't bother spelling correctly or typing special characters like hyphens." This is great for users, but not so great for the language database, which gets much of its data from statistical analyses of anonymized aggregate search queries. If a misspelled word appears with enough frequency, the probability of the recognition model selecting, approving, and pairing the misspelled version of a word with the phonemic sequences is higher and higher. When the audio input’s phonemic sequence ("Hobbit") and orthographical representation does not have any high probability matches in the word list, the speech recognition system broadens its search to find a similar phonemic sequence (aforementioned hpbit, or a misspelled version of "habit" such as "hobit") and if that misspelled term appears often enough in the search logs, it can be chosen by the system. On top of this, people often swallow certain syllables or phonemes when they say words, (they 'probably' say 'probly'). Thus, the pronunciation component is mapping mispronounced phonemic sequences with higher and higher frequency. At some point these mispronounced versions are frequent enough to become the default, dismissing the correct pronunciation as error or noise. Ergo, the recognition system has a mispronounced word that vaguely matches a misspelled word, and happily outputs these nonsense terms instead of what the user meant.

Trains and retraining the recognition system

And what about 'pp' and 'kdkdkd', which aren't explained by mispronunciations? It turns out these nonsense terms were being caused by background noise. "So for instance, with 'kdkdkd' we found in our data noise samples what sounded like the sound of a train on the tracks and then found that the audio segments were often occurring while people were in trains," said Françoise. "We realized that these must have been the train tracks whose faint, low-quality audio samples were picked up by the system and identified phonemically as 'kdkdkd.' The recognition model then tried to match this to a word, and came up with a low probability term 'kdkdkd' from some random series of nonsense typos that it may have logged at some point, which it added to the lexicon. The language model probably rejected this initially, but with the feedback loops from recognition result logs, it slowly began seeing 'kdkdkd' as a legitimate word with high enough frequency to actually pick it. Speech recognition errors were originating from a potent combination of user typos, machine learning, and pronunciation model-forced matches to background noise which, over time, trained the algorithms with misspelled words and bad pronunciation.

"The pronunciation model finds those butchered words with higher and higher frequency, such that the language model now sees them as having high probability AND matching the acoustic input," said Françoise. "And now, everything lines up and the system very happily keeps outputting junk, which is then logged again, which is then fed back into the training for the model etc. etc. etc."

"Breaking those feedback loops has been a tremendous engineering effort, because we have to go through the entire system to figure out exactly where these points of failure are. Plus, we have to find smart ways of initializing all of these models and fixing the errors without losing any performance or speed. I definitely don’t want to wait two seconds after saying something for my phone to transcribe it, would you?"

While scale of available data was crucial at the start of the project to begin the kind of statistical matching necessary for accuracy, in this case the sheer size of the corpus worked against Françoise and the team. "It is extremely hard to fix these errors, because in a field like speech recognition where everything is statistical, we often use previous models, run them through the data, and get better and better data. Then, we fold this data into the next generation model. So at some point, being able to say, 'I just want to throw away everything I've learned and restart from scratch'—that's really hard. I mean, it would be like saying, in the context of a search engine, I'm gonna throw away the whole index because I've cleaned up something. But we are doing it, extirpating errors every time they appear now that we finally discovered the root of the problem."

Each challenge solved ads/adds/adze to/too/two what we no/know

Yet more challenges remain to be solved for Françoise and her team. Speech recognition is already in use on products such as Google Search, Google Now, Android, and Google Voice, but just a short distance away from Françoise's desk, teams are incorporating voice as the primary mode of device interaction for groundbreaking technologies such as Google Glass, Google Watch, and Self-Driving Cars. This brings new challenges: "In Google Glass the microphone is no longer in front of your mouth and maybe you're walking in the street and there's a whole bunch of noise getting in there," she explains. "It will not be able to find good results using our current algorithms, whose database of sounds and words are trained on data from front-facing microphone inputs. Likewise, if you talk to a self-driving car or Google Watch, the microphone is much further away and in a totally different position. We will need to find new ways of adapting our speech recognition algorithms to these environments."

And for Françoise, the size and scope of these challenges match her passion for speech recognition and machine learning at Google. "Speech recognition is a fantastic technology, it's a toy store for me. Even though I've been working on it for more than 20 years, I still have fun every single day when I walk in because there’s always a new problem to think about and to solve. And of course, witnessing a technology you worked on from the beginning blow up the way speech recognition has and seeing it affect millions of people every day... that's a big satisfaction."

Yet what makes her so excited to be working on her passion at a place like Google? "Speech recognition, to work well, requires a lot of data and a lot of machines. And Google is one of the only places where these two things really come together on such a large scale. The whole chain from the product to the users and that constant feedback provides a pace for machine learning and language modeling that is unprecedented. I don't think any other place would be in such a position to do that.'

If I take my ambitious long term goal, which is to fix speech recognition and have it working really well for everybody on Earth, I think we can get there. I cannot imagine another place that could do it quite as well as Google.

Françoise is a Research Scientist at Google, where she heads a team of software engineers, researchers, and linguists working on the next frontiers of speech recognition technology