 
              Speech-Based Interaction
Using Speech as a “Natural” Data Type Speech as Input  Chief decision: Recognition versus Raw Data  Recognition  Translate into other information (words)  Must deal with errors  Useful for either human or machine consumption of results  Raw Data  For use “as data” (not commands) for human consumption  Often linked with other context (time) in capture applications  Speech as Output  Main issues: length of presentation time, lack of persistence, etc.  2
Issues in Speech as Input Perfect recognition of speech (or semantic understanding of any kind of audio) is  difficult to achieve Challenge: How would you begin?  Segmentation  Syntax  3
Interesting features in speech Pauses between phrases as well…  4
Issues  Use of open air microphones & speakers can result in undesired audio  ambient noise  audio feedback  Challenge: allow developers to easily add/use functions in their applications  Noise reduction  Enhance audio quality  Echo cancellation 5
Noise Reduction f(t) f’(t) Noise Filter Random noise is hard to predict  6
Echo Cancellation Echo f(t) f’(t) Canceller Software and hardware exist, but are hard for developers to easily add to  application Random noise is hard to predict, but echoes are not so random...  7
More Issues It is still difficult to:  grab  chunk (segment)  store  search/index/grep  playback (think about the pain of automated phone menus...)  Challenge: provide support for handling audio in manner similar to text  8
Most Straightforward Speech Interface Voice menu systems  System speaks list of possibilities then waits for you to select one  Minor improvement: you can jump in whenever you hear the item you want  Why are these so painful?  9
Most Straightforward Speech Interface Voice menu systems  System speaks list of possibilities then waits for you to select one  Minor improvement: you can jump in whenever you hear the item you want  Why are these so painful?  Hierarchy -- very wide and deep makes for a big search space  Often no easy way to jump around in the tree  “Where you are” matters, but there’s no way to know “where you are” other  than just hearing the menu again Presentation time -- reading of long lists of options  There are good points:  You know what you can do at any given time  Triumph of ease of implementation over imagination  10
Audio Features Think of as “degrees of freedom” of speech as an input device  Pauses  Analogy to mouse up/down/drag?  Who is speaking?  Turn-taking  How is someone speaking?  Prosody, afffect  What is being said?  Recognition of words  11
Case Study: Speech Acts Big idea: move away from voice as a replacement for menus (easy to implement but  painful to use), toward more conversational interfaces “Designing SpeechActs: Issues in Speech User Interfaces,” Yankelovich, Levow, Marx, CHI’95  Mail:   SpeechActs: You have 14 new messages and 47 old  messages. The first new message is from Eric Baatz regarding "report completed."   User: Let me hear it.  SpeechActs: "The first draft is ready for your comments.  Eric."  User: Reply and include the current message.  SpeechActs: Preparing message to Eric Baatz. Begin  recording after the tone. When finished,  pause for several seconds. User: Eric, I'll get those to you this afternoon.   SpeechActs: Say cancel, send, or review.  User: Send.  SpeechActs: Message sent. What now?  User: Next message.  SpeechActs: New message two from Stuart Adams,  subject "Can we meet at 3:00 today?"  User: Switch to calendar... Other commands:   What do I have tomorrow?  What about Bob?  What did he have last Wednesday?  And next Thursday? What was Paul doing three days after Labor Day?  What's the weather in Seattle?  How about Texas?  I'd like the extended forecast for Boston.  12
Speech Acts How is this an improvement over voice menu systems?  No formal hierarchy -- so no need for commands to navigate it  “Where you are” doesn’t matter so much, so no need to fret over how to  present it Presentation time -- minimizes output from the system, focusing on content  rather than commands or context Conversational -- takes advantage of implicit contextual cues in the workflow,  mimicking the way human conversation works Bad points?  You may not know what you have to say in order to control the system (not as  explicit as in menus) 13
Speech Acts Design Challenges Simulating Conversation  Avoid prompting wherever possible  Build context around subdialogs  Output prosodics: system asks “huh?”  Pacing: people often have to speak more slowly when talking to machines; need a  way to “barge in” to machine output Transforming GUIs into SUIs  Vocabulary: need wide, domain-dependent vocabulary  Information organization: how to present content like email messages, flags, message  numbers, etc., with consistency and w/o overwhelming the user Information flow: speech “dialog boxes” (force users into a small set of choices)  don’t fit well into conversational style (Users ignore or may produce unexpected answers: “Do you have the time?” not always answered by yes/no) 14
Speech Acts Design Challenges (cont’d) Recognition errors  Rejection errors (utterance not recognized) are frustrating. Can yield “brick wall” of “I  don’t understand” messages. Solution: provide progressive assistance Substitution errors are damaging. Don’t want to verify every utterance. Approach:  commands that present data are verified implicitly; commands that destroy data or are undoable are verified explicitly Insertion errors (background audio picked up as commands or data). Solution: key to  turn off recognizer The Nature of Speech  Lack of visual feedback. Users feel less in control; users can be faced with silence if they  don’t do anything; long pauses in conversations are uncomfortable so users may feel a need to respond quickly; less information transmitted to hte user at one time Speed and persistence: although speech is easy for humans to produce it is hard to  consume. Also not persistent: easy to forget, no on-screen reminder. 15
Speech Acts Summary SpeechActs shows the challenges in doing speech “right” (as opposed to  just voice menus) Speech as input  Speech as output  Real recognition  Other systems that address the same set of challenges:  Voice Notes (MIT): speech as data, plus input and output  There are other uses of speech that don’t involve so much hard  (recognition and design) work though Case studies:  Suede (Berkeley): faking “working” speech for UI design  Personal audio loop (GT): uninterpreted audio UI for human consumption  Family Intercom (GT): uninterpreted audio UI for human consumption  16
Case Study: Suede  Toolkit for prototyping speech interface  http://guir.berkeley.edu/projects/suede/ 17
18
19
20
Case Study: Personal Audio Loop  Application which continuously buffers user’s last 15 minutes of audio  ”What were we talking about…?”  ”What was that phone number I heard?”  Features above are used to speed up audio playback when skimming for point of access  compressed or discarded in some cases 21
Case Study: The Family Intercom Use location sensing in context-aware environment to connect people in  different places in a conversation 22
The Family Intercom (Ubicomp 2001) How do I do this math homework? son He is alone in his room. Jamie, have you finished your I want to talk homework? to Jamie. Mom 23
The Family Intercom (Ubicomp 2001) What is this little son two above the number? … Power of 2. When you finish, come set the dinner table. Bye. 24
Resources Java Speech API:  Recognition and synthesis  http://java.sun.com/products/java-media/speech/  FreeTTS:  A Java port of a very high quality speech synthesis package:  http://freetts.sourceforge.net/docs/index.php  25
Recommend
More recommend