Speech-Based Interaction Using Speech as a Natural Data Type - - PowerPoint PPT Presentation

speech based interaction
SMART_READER_LITE
LIVE PREVIEW

Speech-Based Interaction Using Speech as a Natural Data Type - - PowerPoint PPT Presentation

Speech-Based Interaction Using Speech as a Natural Data Type Speech as Input Chief decision: Recognition versus Raw Data Recognition Translate into other information (words) Must deal with errors Useful for


slide-1
SLIDE 1

Speech-Based Interaction

slide-2
SLIDE 2

Using Speech as a “Natural” Data Type

Speech as Input

Chief decision: Recognition versus Raw Data

Recognition

Translate into other information (words)

Must deal with errors

Useful for either human or machine consumption of results

Raw Data

For use “as data” (not commands) for human consumption

Often linked with other context (time) in capture applications

Speech as Output

Main issues: length of presentation time, lack of persistence, etc.

2

slide-3
SLIDE 3

Issues in Speech as Input

Perfect recognition of speech (or semantic understanding of any kind of audio) is difficult to achieve

Challenge: How would you begin?

  • Segmentation
  • Syntax

3

slide-4
SLIDE 4

Interesting features in speech

Pauses between phrases as well…

4

slide-5
SLIDE 5

 Use of open air microphones & speakers can result

in undesired audio

 ambient noise  audio feedback

 Challenge: allow developers to easily add/use

functions in their applications

 Noise reduction  Enhance audio quality  Echo cancellation

Issues

5

slide-6
SLIDE 6

Noise Reduction

Random noise is hard to predict

6

Noise Filter

f(t) f’(t)

slide-7
SLIDE 7

Echo Cancellation

Software and hardware exist, but are hard for developers to easily add to application

Random noise is hard to predict, but echoes are not so random...

7

Echo Canceller

f(t) f’(t)

slide-8
SLIDE 8

It is still difficult to:

grab

chunk (segment)

store

search/index/grep

playback (think about the pain of automated phone menus...)

Challenge: provide support for handling audio in manner similar to text

More Issues

8

slide-9
SLIDE 9

Most Straightforward Speech Interface

Voice menu systems

System speaks list of possibilities then waits for you to select one

Minor improvement: you can jump in whenever you hear the item you want

Why are these so painful?

9

slide-10
SLIDE 10

Most Straightforward Speech Interface

Voice menu systems

System speaks list of possibilities then waits for you to select one

Minor improvement: you can jump in whenever you hear the item you want

Why are these so painful?

Hierarchy -- very wide and deep makes for a big search space

Often no easy way to jump around in the tree

“Where you are” matters, but there’s no way to know “where you are” other than just hearing the menu again

Presentation time -- reading of long lists of options

There are good points:

You know what you can do at any given time

Triumph of ease of implementation over imagination

10

slide-11
SLIDE 11

Audio Features

Think of as “degrees of freedom” of speech as an input device

Pauses

Analogy to mouse up/down/drag?

Who is speaking?

Turn-taking

How is someone speaking?

Prosody, afffect

What is being said?

Recognition of words

11

slide-12
SLIDE 12

Case Study: Speech Acts

Big idea: move away from voice as a replacement for menus (easy to implement but painful to use), toward more conversational interfaces

“Designing SpeechActs: Issues in Speech User Interfaces,” Yankelovich, Levow, Marx, CHI’95

Mail:

SpeechActs: You have 14 new messages and 47 old

  • messages. The first new message is from

Eric Baatz regarding "report completed."

User: Let me hear it.

SpeechActs: "The first draft is ready for your comments.

Eric."

User: Reply and include the current message.

SpeechActs: Preparing message to Eric Baatz. Begin

recording after the tone. When finished,

pause for several seconds.

User: Eric, I'll get those to you this afternoon.

SpeechActs: Say cancel, send, or review.

User: Send.

SpeechActs: Message sent. What now?

User: Next message.

SpeechActs: New message two from Stuart Adams,

subject "Can we meet at 3:00 today?"

User: Switch to calendar...

Other commands:

What do I have tomorrow?

What about Bob?

What did he have last Wednesday?

And next Thursday?

What was Paul doing three days after Labor Day?

What's the weather in Seattle?

How about Texas?

I'd like the extended forecast for Boston.

12

slide-13
SLIDE 13

Speech Acts

How is this an improvement over voice menu systems?

No formal hierarchy -- so no need for commands to navigate it

“Where you are” doesn’t matter so much, so no need to fret over how to present it

Presentation time -- minimizes output from the system, focusing on content rather than commands or context

Conversational -- takes advantage of implicit contextual cues in the workflow, mimicking the way human conversation works

Bad points?

You may not know what you have to say in order to control the system (not as explicit as in menus)

13

slide-14
SLIDE 14

Speech Acts Design Challenges

Simulating Conversation

Avoid prompting wherever possible

Build context around subdialogs

Output prosodics: system asks “huh?”

Pacing: people often have to speak more slowly when talking to machines; need a way to “barge in” to machine output

Transforming GUIs into SUIs

Vocabulary: need wide, domain-dependent vocabulary

Information organization: how to present content like email messages, flags, message numbers, etc., with consistency and w/o overwhelming the user

Information flow: speech “dialog boxes” (force users into a small set of choices) don’t fit well into conversational style (Users ignore or may produce unexpected answers: “Do you have the time?” not always answered by yes/no)

14

slide-15
SLIDE 15

Speech Acts Design Challenges (cont’d)

Recognition errors

Rejection errors (utterance not recognized) are frustrating. Can yield “brick wall” of “I don’t understand” messages. Solution: provide progressive assistance

Substitution errors are damaging. Don’t want to verify every utterance. Approach: commands that present data are verified implicitly; commands that destroy data or are undoable are verified explicitly

Insertion errors (background audio picked up as commands or data). Solution: key to turn off recognizer

The Nature of Speech

Lack of visual feedback. Users feel less in control; users can be faced with silence if they don’t do anything; long pauses in conversations are uncomfortable so users may feel a need to respond quickly; less information transmitted to hte user at one time

Speed and persistence: although speech is easy for humans to produce it is hard to

  • consume. Also not persistent: easy to forget, no on-screen reminder.

15

slide-16
SLIDE 16

Speech Acts Summary

SpeechActs shows the challenges in doing speech “right” (as opposed to just voice menus)

Speech as input

Speech as output

Real recognition

Other systems that address the same set of challenges:

Voice Notes (MIT): speech as data, plus input and output

There are other uses of speech that don’t involve so much hard (recognition and design) work though

Case studies:

Suede (Berkeley): faking “working” speech for UI design

Personal audio loop (GT): uninterpreted audio UI for human consumption

Family Intercom (GT): uninterpreted audio UI for human consumption

16

slide-17
SLIDE 17

Case Study: Suede

 Toolkit for prototyping speech interface  http://guir.berkeley.edu/projects/suede/

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

Case Study: Personal Audio Loop

 Application which continuously buffers user’s last 15

minutes of audio

 ”What were we talking about…?”  ”What was that phone number I heard?”

 Features above are used to speed up audio playback

when skimming for point of access

 compressed or discarded in some cases

21

slide-22
SLIDE 22

Case Study: The Family Intercom

Use location sensing in context-aware environment to connect people in different places in a conversation

22

slide-23
SLIDE 23

The Family Intercom (Ubicomp 2001)

How do I do this math homework?

I want to talk to Jamie. Jamie, have you finished your homework? He is alone in his room.

Mom son

23

slide-24
SLIDE 24

The Family Intercom (Ubicomp 2001)

What is this little two above the number? … Power of 2. When you finish, come set the dinner table. Bye.

son

24

slide-25
SLIDE 25

Resources

Java Speech API:

Recognition and synthesis

http://java.sun.com/products/java-media/speech/

FreeTTS:

A Java port of a very high quality speech synthesis package:

http://freetts.sourceforge.net/docs/index.php

25