Quite often, the topic of how a developer should construct alphabetical “spell-out” grammars, or how one can best create an alpha-numeric recognition grammar is posed to the support team at voxeo. Many a posting to our VoiceXML developer forums has touched on this subject, but we haven’t really delved into this in precise detail to explain exactly why this is such a challenge until now.
“Alphabetical recognition is a challenge?”, you ask? You bet it is, if you want to get any semblance of accurate recognition results. And when we throw alpha characters, and maybe some numeric characters within the same utterance string, then we are really looking at a difficult grammar to get tuned to a point where it is usable.
So whats the big deal, anyhow?
The inherent problems with spelled input recognition is best illustrated by a simple anecdote:
Imagine that you are at a restaurant on a busy Friday evening, and waiting for your table. While in the lobby, there are people chatting, children cavorting about, and harried workers trying to seat the flood of diners. At the same time, your friends who are joining you for dinner call to say that they are lost, and ask for directions to the restaurant. Amidst all the background chatter, glasses clinking and the rest of the noisy distractions, how many times do you have to repeat “From I-95, get off on exit 76B, and then take a left at Montana street” before your buddy is able to accurately understand what you are saying? In this worst case scenario, at best you may have to repeat yourself only once. Even if the restaurant was dead empty, and as silent as a tomb, the chances for your pal misunderstanding “exit 76B” for “exit 763″ or something similar is not only quite plausible, but highly likely.
The root of the problem with alpha grammars, and even more so with alpha-numeric grammars is the staggering chance of disambiguation of like-sounding matches: “B” sounds like “C”, sounds like “Z”, sounds like “E”, sounds like “three”, and “M” sounds like “N” sounds like “ten”……you get the picture. And this is for a *single character match* only: To further illustrate the challenges that we face, consider the fact that a 1-character alphabet grammar has only 26 possible results. But a 7-character grammar would have over eight BILLION possibilities. As you can imagine, the amount of possible results for an alphagram of arbitrary length is simply staggering.
Suggestions for alphabetical voice reco: Alternative Options
Firstly, constructing a user-defined alphabet grammar is something that we don’t really recommend attempting for “spell anything” applications, as the plain, unvarnished truth is that todays voice recognition technology is simply not up to the task. To be certain, improvements in ASR technology over the past few years has seen dramatic improvements, but not so much as to allow us to spell, or say just any old utterance and expect accurate match results. In a lot of cases, a Statistical Language Model grammar will do the job, assuming that you expect your callers to input certain types of input, such as a first name, a city name, or a state name.
While this isn’t the time or place to cover SLM grammars in depth, a brief summary should explain the strengths of these pre-compiled, pre-tuned grammars. SLM grammars in the context of spelling are essentially designed to fill in the blanks when we have partial input, using predetermined logic that is tailored to the input context/category. For instance, assume that we have an SLM firstname grammar active (note that these are available when using the Prophecy + Nuance platform on the evolution.voxeo.com portal), and our spelled utterance from the caller reads like what we have below, where unrecognizable utterance fragments are represented by a question mark:
“C O R ? E L I ? S”
Using the pre-tuned logic that is part of the SLM grammar, the ASR will determine that there are no firstname matches that read as “Coraelias”, or “Corbelibs”, etc: It will make the decision that the only first name that matches this pattern where some fragments of the utterance are missing would be “Cornelius”: This is the gist of how SLM grammars work, and if your project allows you to use somewhat narrower categories for any utterance you want to recognize, then using a predefined SLM grammar, or even crafting your own SLM grammar is a better way to go than trying to make a flat-file alphabetical SRGS file.
One of the common tasks for alphabetical grammars seems to be the capture of names, or street addresses, and if this is the case, there is a very accurate add-on service that can handle this task rather nicely. The TargusInfo feature allows developers to access one, or both of these two services:
* Name & Address lookups based on Caller ID
* Pre-tuned name & full address grammars
These services are remarkable in terms of Caller ID-to-Address accuracy, and the name/address grammars are top-notch, and quite acceptable for full scale, enterprise deployments as well. The only caveats to using this is that this service is limited to the United States only, and there is an applicable per-transaction fee to use this in a production capacity. However, we can honor developer requests to test drive this service by allowing a 30 or so hits to this service at no charge. Developers interested in this service can login to their evolution.voxeo.com accounts, and create an account ticket requesting access to this service to see just how good it is. And trust me on this one: You’ll be mightily impressed, and more importantly, so will your callers.
If you gotta do it…
In the event that the SLM grammar option, or the TargusInfo option won’t fit the bill for your IVR project, then you may well be forced to try and craft a flat-file Alpha Grammar using w3c-compliant SRGS/SISR syntaxes. If you do fall into this category, we can give you some advice on doing so, with the full disclaimer that Results May Vary, and that 100% recognition accuracy using this methodology is Science Fiction, at least for the time being.
* Start small by testing one-character strings so that you can tune and tweak utterance values in the grammar.
* Track user utterance, and confidence scores via “lastresult$” shadow variables for post testing analysis, and as a basis for what needs to be tuned.
* Leverage the VXML 2.1 utterance recording via the “recordutterance” setting, and save off all user recording data for post-call analysis.
* Flesh out utterance values by phonetically sounding them out: For instance, “a” could be represented by:
* Try to get as broad a user base as possible for testing, else you run the risk of tuning your grammar to a small subset of user speech patterns. If you have but a single grammar tester who happens to have a Deep South accent, then the tuned grammar will likely not be much good to callers in New York, or our friends in the UK.
* After each round of changes that you apply to your grammars, test them thoroughly, analyze the results, and then test them again. Then test once more just to be sure of your results.
* Careful use of grammar weighting can really save the day for like-sounding characters. The chances of a user utterance of “E” is much higher than one of “Z”, but be very careful when applying weights, as it is possible to go overboard when doing so, and weight your grammar to hard in favor of one particular letter, which will then skew your recognition results and accuracy.
* Consider using n-best post-processing when overall recognition confidence scores are below a certain threshold: It’s much better to take the extra step to get confirmed accuracy than to assume wrongly.
* For utterance strings that are static in length, implementing a mixed initiative dialog can be an excellent tactic to cut down on the disambiguation factor that skyrockets when the string length grows in size. This can be a tricky project to get right, but it is one that is well worth the effort in development.
Next TechTip: In the next certified tech tip from the Voxeo support team, we will illustrate our last suggestion in detail. That’s right, we will take on the task of posting and dissecting a mixed initiative dialog, and the associated alphanumeric grammar that could accept Canadian zip code input. As we stated before in non-nonsense terms, this is possibly one of the hardest, if not *the* hardest things that a developer can attempt to do reliably, but as you are well aware, the Voxeo team is quite fearless, and doesn’t respect the concept of “impossible”.
Till next time,
Director of Customer Support
Statistical Language Model Grammars
Nuance Grammar Developers Guide
Mixed-Initiative dialog tutorial
SRGS Grammar Specification: Grammar weighting
SISR Grammar Specification
VXML 2.0 specification: The LastResult array
VXML 2.1 specification: Utterance Recording