I have often been asked what PLS is, why it exists, etc., so I thought it would be worth reviewing the history of why web-based standards are a good thing for the voice/speech industry and then go into PLS and how it fits.
Why are today’s voice standards web-based?
W3C began creating Voice markup languages in 1999 when it began work on what would eventually lead to VoiceXML 2.0 and 2.1. The two groups that were interested in this were W3C and the speech recognition and synthesis industry.
W3C was interested because of a strong desire for the content of the Web to be accessible to “anyone, anywhere, anytime, using any device” (see W3C’s Ubiquitous Web Domain). Because of the broad success of HTML as a language for creating visual user interfaces, it seemed logical to extend that notion to the creation of auditory user interfaces (voice interfaces) that would work well with the various languages developed by W3C for representing content.
The speech recognition and synthesis industry (voice industry) was also interested in standardization. Before VoiceXML and its related languages, including PLS, each vendor of speech recognition and synthesis technology had its own proprietary interface for controlling the recognizer or synthesizer. This slowed overall adoption of voice technologies because a) authors of voice applications had to learn multiple APIs and b) the differences from one API to another made it difficult to switch vendors.
One of the most amazing benefits from the creation of VoiceXML and its related markup languages was the introduction of the web model of programming. Just like with HTML and the World Wide Web, application files could be distributed around the world. Just like with HTML, where there is a visual browser that runs on your desktop computer that converts the HTML into text for you to read and buttons for you to click, for VoiceXML there is a voice browser that turns VoiceXML pages into spoken text and something that listens to what you say. The primary implementation difference is that the voice browser lives in a computer network rather than on your desktop, and it is accessed via the phone. Because of this XML-based language (VoiceXML) and the web development model, companies adopting voice technology could now make use of their existing web infrastructure for document caching, integration with business logic and back-end databases, and server reliability and availability, not to mention the growing number of programmers familiar with the web programming model and markup languages such as XML and HTML.
Where does PLS fit?
Let’s start from the top down. VoiceXML is a markup language for developing voice applications. VoiceXML makes use of speech recognition and speech synthesis.
Before going further, I need to briefly explain how a speech recognizer works:
The speech recognizer makes use of a grammar, a lexicon (or dictionary), and acoustic models. The grammar is a file that lists what words to listen for, in what order — for example, “I am flying from Boston”.
The lexicon (or dictionary) is a file that describes how each legal word is pronounced – that’s how it knows that “B o s t o n” is pronounced “Boston” and not “Poughkeepsie”.
The acoustic models describe the mapping between pronunciation symbols and the actual sounds that we hear — one model for “ae”, one for “k”, one for “uh”, and so on.
So when a speech recognizer listens to someone speaking, it uses all three of these pieces of information to convert the sounds the person makes into a set of pronunciation symbols, from those symbols to words, and from the words to a sentence. While acoustic models are a closely guarded secret that differentiates one speech recognition vendor from another, the other two pieces of information are a bit easier to standardize.
W3C already has a standard for specifying grammars, called the Speech Recognition Grammar Specification, or SRGS. The new Pronunciation Lexicon Specification “fills the gap” by providing a standard way to create pronunciation dictionaries.
Why were pronunciation dictionaries non-standard?
I alluded to this above. Since acoustic models were (and still are) private, before there was a standard way to specify grammars each speech recognition vendor had its own dictionary format, its own language for specifying how words were be pronounced. Often vendors used different pronunciation symbol sets, since each vendor’s symbol set was designed to match its private set of acoustic models. For example, the vowel sound in “cat” could be represented using the symbol “ae” or “aaa”, or anything else a vendor wanted.
There’s another reason too.
Speech synthesizers use pronunciation lexicons as well, but with slightly different formats. In brief, here’s how a synthesizer works:
A speech synthesizer converts written text into sounds to be spoken. To do this it uses an SSML document and one or more dictionaries (lexicons). The Speech Synthesis Markup Language (SSML) is a language that allows an author to change how text is spoken — for example, by marking some text as sentences and some as paragraphs, by telling the synthesizer when to change voices, or even by telling the synthesizer exactly how to pronounce a certain word. The lexicon documents used by a synthesizer, just like for a speech recognizer, describe how words are to be pronounced.
So there were these two reasons for differing pronunciation dictionaries (lexicons): different vendors used different pronunciation symbols, and recognizers and synthesizers used slightly different formats.
How does PLS help?
Above I described why there were differences in the pronunciation formats before PLS. What PLS did is this:
- First, it provided a single, standard XML-based language for describing pronunciations, both for speech recognizers and for speech synthesizers.
- Second, it requires support for IPA, the International Phonetic Alphabet. This Alphabet is a standard symbol set for representing pronunciations of all the languages of the world.
With PLS it is now possible to write one lexicon document that can be used by any speech recognizer and/or any speech synthesizer that supports it. One document for all of your pronunciations, independent of your voice technology vendor.
Although it’s still new at this point, I believe this specification will be widely supported in a couple of years.
If you found this post interesting or helpful, please consider either
subscribing via RSS, becoming
a fan on Facebook, or
following us on Twitter.