SIV for VoiceXML 3.0: Language and Application Design Considerations
Ken Rehor Cisco Systems, Inc.
krehor@cisco.com
March 05, 2009
SIV for VoiceXML 3.0: Language and Application Design - - PowerPoint PPT Presentation
SIV for VoiceXML 3.0: Language and Application Design Considerations Ken Rehor Cisco Systems, Inc. krehor@cisco.com March 05, 2009 VoiceXML Application Architecture VoiceXML VoIP VoiceXML VoiceXML Verification Server Gateway IP PSTN
krehor@cisco.com
March 05, 2009
PSTN / VoIP VoIP Gateway
(HTTP)
VoiceXML Server ASR TTS SIV Audio DTMF
VoiceXML
VoiceXML Application IP Verification Application SIV engine
VP DB
PSTN / VoIP
(HTTP)
VoiceXML Server
VoiceXML
VoiceXML Application IP Verification Application SIV engine
VP DB
<subdialog> recordutterance <record>
PSTN / VoIP
(HTTP)
VoiceXML Server
VoiceXML
VoiceXML Application IP Verification Application SIV engine
VP DB
<subdialog> recordutterance <record>
<form name="verify"> <!-- could use external grammar --> <record name="utterance" maxtime="5s <prompt> Say this digit sequence: one two three four five.</prompt> <noinput> I didn't hear anything, please try again. </noinput> </record> <block> <submit next="check_utterance.pl" enctype="multipart/form-data" method="post" namelist="utterance"/> </block> </form>
<form name="verify"> <prompt>Say this digit sequence: one two three four five.</prompt> <field type="digits"> <filled> <!-- if spoken digits match expected response, then process voice model --> </filled> </field> </form>
<form name="verify"> <property name="recordutterance" value="true"/> <prompt>Say this digit sequence: one two three four five.</prompt> <field type="digits"> <filled> <!-- if spoken digits match expected response, then process voice model --> </filled> </field> </form>
.wav VoiceXML browser MRCP Server Voice Web Application Server
PSTN
network
Voice template database
Architecture options carry security implications
SIV Engine ASR Engine Authentication Web Service Other Application Web Services
Voice template database
<grxml>
<vxml> TTS Engine
<ssml>
voice template
Voice DEFF may be used between SIV components and services Web Service interface
.wav VoiceXML browser MRCP Server Voice Web Application Server
PSTN
network
MRCP Client ASR Engine
VoiceXML browser records the utterance and forwards to app server (typical scenario for VoiceXML 2.0/2.1)
Voice template database
<grxml>
<vxml> TTS Engine
<ssml>
voice template Voice templates managed and stored locally by SIV engine Note: DTMF processing not shown audio Audio stream vs. buffers Streaming handled by RTP? Buffers may be handled by audio recorder function. Part
<record> audio SIV Engine
.wav VoiceXML browser MRCP Server Voice Web Application Server
PSTN
network
MRCP Client ASR Engine
VoiceXML browser records the utterance and forwards to app server (typical scenario for VoiceXML 2.0/2.1)
Voice template database
<grxml>
<vxml> TTS Engine
<ssml>
voice template Voice templates managed and stored locally by SIV engine Note: DTMF processing not shown audio <record> audio SIV Engine Voice Web Application Server
IP
Service Provider
.wav VoiceXML browser MRCP Server Voice Web Application Server
PSTN
network
MRCP Client SIV Engine ASR Engine
Voice template database
<grxml>
<vxml> TTS Engine
<ssml>
voice template Voice templates managed and stored locally by SIV engine Note: DTMF processing not shown audio Audio stream vs. buffers Streaming handled by RTP? Buffers may be handled by audio recorder function. Part
.wav VoiceXML browser MRCP Server Voice Web Application Server
PSTN
network
MRCP Client SIV Engine ASR Engine
SIV engine managed by MRCP server SIV database managed by app server Voice model transmission managed by engine or MRCP Server
<grxml>
<vxml> TTS Engine
<ssml>
Voice templates retrieved from database by app server Note: DTMF processing not shown audio Voice template database voice template voice template
.wav VoiceXML browser MRCP Server Voice Web Application Server
PSTN
network
MRCP Client SIV Engine ASR Engine
SIV engine managed by MRCP server SIV database managed by app server Voice model transmission managed by VoiceXML browser
<grxml>
<vxml> TTS Engine
<ssml>
Voice templates retrieved from database by ap server Note: DTMF processing not shown audio Voice template database voice template Voice templates managed and stored locally by SIV engine voice template
– ASR and biometric engines – Simultaneously – Switch on a per <field> or verification basis
Resource Controller
(an object with semantics similar to form item)
Prompt queue Input
SSML FA Commands from
controllers Stop, Play audio, DTMF Mark data
SSML/media player
Events: mark, error, done Stop, Play Barge-in on/off, done
Input 2 Input 3
audio …
Inputs are all session-level
audio recorder recognition device(s) verification, etc
Add grammar()
events
Recording types to consider:
(e.g. mixed-initiative recording)
Add voiceprint()
Resources
YOU ARE HERE YOU ARE HERE
VoiceXMLSession SIV dialog SIV dialog SIV dialog Verification Session
– Environment – Properties – Attributes – Voice models
– Results specified as an EMMA result – Errors/info
PSTN / VoIP VoiceXML Browser VoiceXML Application IP Verification Application SIV engine
VP DB
recordutterance <record>
VoiceXML (HTTP)
BIAS
(Web Service)
BioAPI
PSTN / VoIP
(HTTP)
VoiceXML Browser
VoiceXML
VoiceXML Application IP Verification Application SIV engine
VP DB
<subdialog> recordutterance <record>
VoiceXML (HTTP)
PSTN / VoIP VoiceXML Browser VoiceXML Application
VP DB VoiceXML (HTTP)
SIV engine
BioAPI, MRCP, etc.
PSTN / VoIP
(HTTP)
VoiceXML Browser
VoiceXML
VoiceXML Application IP Verification Application
VP DB
<subdialog>
VoiceXML (HTTP)
SIV engine SIV engine
BioAPI, MRCP, etc.
PSTN / VoIP
(HTTP)
VoiceXML Browser
VoiceXML
VoiceXML Application IP Verification Application
VP DB
<subdialog> recordutterance <record>
VoiceXML (HTTP)
BIAS
(Web Service)
SIV engine SIV engine
PSTN / VoIP
(HTTP)
VoiceXML Browser
VoiceXML
VoiceXML Application IP Verification Application
VP DB
<subdialog> recordutterance <record>
VoiceXML (HTTP)
SIV engine SIV engine
SIV engines controlled directly by VoiceXML
processing
– More than one input engine, e.g. ASR+Verify
– Speed/responsiveness – Accuracy
– Aligned with other input resources
– Don't have to buy, install, maintain SIV engine – Shared resource on VXML platform
Service Providers
– Shared resource – Ease of deployment – Enhanced service offering
solution
– Enables developers to do 'bad' things
– Enables developers to do 'bad' things
– Voice models, engine capabilities, results/errors, etc. are all proprietary
– MRCPv2 not sufficient; need more features (MRCPv3?)
Pros Cons
VoiceXML browser -- Resource Controller VoiceXML document/script – Input review
Voice Web Application App-specific Decisions ASR Engine TTS Engine
<ssml>
<grxml>
results Audio player .wav DTMF Engine
<grxml>
results SIV Engine results voice model Audio recorder .wav
Input
Application-specific Decisions
<vxml>
Output