This post is part of a series exploring voice applications and VoiceXML through the eyes of a web developer. For the rest of the series, see the index.
If you want to follow along with these examples, you should create a free VoiceXML hosting account in Evolution. Complete instructions were in the first installment of the series.
Last time out, I createda simple Hello World VoiceXML app that simply answers an incoming call and speaks some text. Now what if we want to add some interactivity and let the caller talk to the application?
Unlike some of the telephony services out there, Voxeo performs speech recognition. Our engine allows someone to punch buttons on their touch tone keypad (known as DTMF, for Dual Tone Multi-Frequency) or to speak to the application using natural language. Why ask your customers to listen to a menu of pizza toppings and remember which number to press when you can just let them say the names of the toppings?
Throughout this series, I’m building an application for Strato Pizza, a fictional pizza chain. In this installment, I’ll ask the caller which topping they’d like. For now, I’m only letting them order a one-topping pizza. Then I’m going to hang up.
The first step in adding either voice recognition or DTMF input is to add an input field to your document. In HTML if you want your user to give you information you use input tags inside a form tag. In VoiceXML you use <field> elements inside a <form> element. Fields have names and just like HTML, you can use those field names to get the values input by the caller. The field name must be a valid JavaScript variable name (so no spaces or dots in the name), and cannot start with an underscore (“_”) or end in a dollar sign (“$”).
Here’s what my form field looks like for asking the caller for their list of pizza toppings.
<form>
<field name="topping">
What topping would you like on your pizza?
</field>
</form>
In my first application, I used <prompt> to speak the text and had to put that prompt element inside a <block> element. Here, I don’t need a block element, because form fields can live directly inside forms. I also don’t need to use prompt – the contents of my field will be spoken to the user and then the application will wait for their response.
For speech recognition to work, I need to provide a list of what the caller is going to say using a grammar. These grammars allow the speech recognition engine to pick out what the user said. Essentially I’m training the recognition engine.
A grammar can have a list of single words, can allow compound words (like “extra cheese”), and can even have synonyms so it understands that Ham and Canadian Bacon are the same thing.
Grammars go inside the body of a <grammar> element. Because you might be using reserved XML characters in your grammar, it’s a good idea to place this inside a CDATA section. The attribute type specifies the MIME type of the grammar file and is required. Grammar file? That sounds like I can use an external file for my grammars. I’ll look into external files in a later installment of this series. For now, I’m using an inline grammar with a type of text/gsl.
<grammar type="text/gsl">
<![CDATA[
;Lines starting with a semicolon are comments.
;Match one of the enclosed terms
[
;Terms are separated by a space
pepperoni olives sausage anchovies
;They can also be on separate lines.
; Each line is recognized as a separate term
onions
peppers
;Parentheses require all of the enclosed terms
;to be matched. A logical AND
(extra cheese) (roasted garlic)
;Square brackets are the same as OR
[mushrooms portobello]
;You can mix AND & OR together
[ham (canadian bacon)]
]
]]>
</grammar>
This grammar applies only to the pizza toppings field, so I’m putting the grammar element inside the “topping” field. There’s other places it can go, but I’ll show those in a later installment. Putting these together, you get:
<form>
<field name="topping">
What topping would you like on your pizza?
<grammar type="text/gsl">
<![CDATA[
;Lines starting with a semicolon are comments.
;Match one of the enclosed terms
[
;Terms are separated by a space
pepperoni olives sausage anchovies
;They can also be on separate lines.
; Each line is recognized as a separate term
onions
peppers
;Parentheses require all of the enclosed terms
;to be matched. A logical AND
(extra cheese) (roasted garlic)
;Square brackets are the same as OR
[mushrooms portobello]
;You can mix AND & OR together
[ham (canadian bacon)]
]
]]>
</grammar>
</field>
</form>
Now when someone calls, they can speak their topping and the application will understand it – as long as their topping fits within the grammar I’ve defined. There’s single word toppings like “pepperoni” and “onions” as well as multiple word toppings like “extra cheese.” Because I’ve put parentheses around “extra cheese” the recognizer won’t match if the caller says simply “cheese”. Callers have a tendency to say things you might not expect, like asking for “canadian bacon” instead of just “ham”, so the grammar can handle synonym terms as well.
What if a caller asks for a topping that Strato Pizza doesn’t offer? If Barbara calls up Strato and asks for her favorite potato pizza, my application should now what to do with her request.
On a web form, you generally perform some validation on your form submissions to make sure the user said what you expected them to say. In VoiceXML, I can use the <nomatch> element as a trigger for the caller saying something that doesn’t match the grammar I supplied. Inside the nomatch element, I add a <reprompt/> element to replay the question.
<!-- The caller said something that was not defined in our grammar --> <nomatch> I did not recognize that topping. Please try again. <reprompt/> </nomatch>
In a voice application, I have another type of validation to perform. One that doesn’t happen on the web. In a web form, I can present the user with a form and wait all day for them to fill it out and hit the submit button. But in a voice application, after I ask the caller a question, if they don’t respond, I probably want to ask them again. For this, I can use the <noinput> element to determine what to do when a caller is silent in response to a question. In my noinput I’m going to ask the question again using the reprompt element.
<!-- The caller was silent, restart the field --> <noinput> I did not hear anything. Please try again. <reprompt/> </noinput>
These two validation elements go inside the form field, just like my grammar did. So now my field looks like this:
<form>
<field name="topping">
What topping would you like on your pizza?
<grammar type="text/gsl">
<![CDATA[
;Lines starting with a semicolon are comments.
;Match one of the enclosed terms
[
;Terms are separated by a space
pepperoni olives sausage anchovies
;They can also be on separate lines.
; Each line is recognized as a separate term
onions
peppers
;Parentheses require all of the enclosed terms
;to be matched. A logical AND
(extra cheese) (roasted garlic)
;Square brackets are the same as OR
[mushrooms portobello]
;You can mix AND & OR together
[ham (canadian bacon)]
]
]]>
</grammar>
<!-- The caller was silent, restart the field -->
<noinput>
I did not hear anything. Please try again.
<reprompt/>
</noinput>
<!-- The caller said something that was not defined in our grammar -->
<nomatch>
I did not recognize that topping. Please try again.
<reprompt/>
</nomatch>
</field>
</form>
Now my application is able to find out what sort of pizza a caller would like and can handle mistakes, distracted callers, and toppings I don’t have. Adding this to my greeting from the last post, I have:
<?xml version="1.0" encoding="UTF-8"?>
<vxml version = "2.1" >
<form>
<block>
<prompt>
Thanks for calling Strato Pizza.
</prompt>
</block>
<field name="topping">
What topping would you like on your pizza?
<grammar type="text/gsl">
<![CDATA[
;Lines starting with a semicolon are comments.
;Match one of the enclosed terms
[
;Terms are separated by a space
pepperoni olives sausage anchovies
;They can also be on separate lines.
; Each line is recognized as a separate term
onions
peppers
;Parentheses require all of the enclosed terms
;to be matched. A logical AND
(extra cheese) (roasted garlic)
;Square brackets are the same as OR
[mushrooms portobello]
;You can mix AND & OR together
[ham (canadian bacon)]
]
]]>
</grammar>
<!-- The caller was silent, restart the field -->
<noinput>
I did not hear anything. Please try again.
<reprompt/>
</noinput>
<!-- The caller said something that was not defined in our grammar -->
<nomatch>
I did not recognize that topping. Please try again.
<reprompt/>
</nomatch>
</field>
</form>
</vxml>
The next requirement for my application is to collect the caller’s phone number so Strato can call if there’s a problem with the order. I’ll take a look at that tomorrow in my next blog post.
Related posts:
- Collecting touch tone input (VoiceXML for Web Developers)
- Processing Input (VoiceXML for Web Developers)
- VoiceXML for web developers: Hello World
- Want to learn VoiceXML? Check out our “VoiceXML for Web Developers” series…
- VoiceXML for Web developers: Introduction
Want to learn how Voxeo can help unlock your communications and deliver a better customer experience? Please contact us!
If you found this post interesting or helpful, please consider either subscribing via RSS, becoming a fan on Facebook, or following us on Twitter.
RSS Feed




December 21st, 2009 at 11:25 am
VoiceXML for Web Developers: Collecting Input – http://bit.ly/5u85Ya
This comment was originally posted on Twitter
December 21st, 2009 at 5:05 pm
Latest Voice for Web devs post shows how to do voice recognition: http://tr.im/Ihpm 25 lines of code, including detailed comments.
This comment was originally posted on Twitter
December 21st, 2009 at 5:14 pm
Fixed link: Latest Voice for Web devs post shows how to do voice recognition: http://bit.ly/90wE7u
This comment was originally posted on Twitter