Building Ubiquitous and Robust Speech and Natural Language I - - PowerPoint PPT Presentation

building ubiquitous and robust speech and natural
SMART_READER_LITE
LIVE PREVIEW

Building Ubiquitous and Robust Speech and Natural Language I - - PowerPoint PPT Presentation

Building Ubiquitous and Robust Speech and Natural Language I nterfaces I Gary Geunbae Lee, Ph.D., Professor Dept. CSE, POSTECH Contents PART-I: Statistical Speech/Language Processing (60min) Natural Language Processing short intro


slide-1
SLIDE 1

Building Ubiquitous and Robust Speech and Natural Language I nterfaces I

Gary Geunbae Lee, Ph.D., Professor

  • Dept. CSE, POSTECH
slide-2
SLIDE 2

2 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing (60min)

– Natural Language Processing – short intro

– Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems (80min)

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation (40min)

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-3
SLIDE 3

3 IUI 20 0 7 tutoria l

Ubiquitous computing

  • Ubiquitous computing: network + sensor + computing
  • Pervasive computing
  • Third paradigm computing
  • Calm technology
  • Invisible computing
  • Irobot style interface – human language + hologram
slide-4
SLIDE 4

4 IUI 20 0 7 tutoria l

Ubiquitous computer interface?

  • Computer – robot, home appliances, audio, telephone, fax machine,

toaster, coffee machine, etc (every objects)

  • Universal speech interface project (CMU)
  • VoiceBox commercial systems
  • Telematics Dialog Interface (POSTECH, LG, DiQuest)
slide-5
SLIDE 5

5 IUI 20 0 7 tutoria l

Tele-service Tele-service Car-navigation Car-navigation Home networking Home networking Robot interface Robot interface

Example Domain

slide-6
SLIDE 6

6 IUI 20 0 7 tutoria l

What’s hard – ambiguities, ambiguities, all different levels

  • f ambiguities

John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. [from J. Eisner lecture note]

  • donut: To get a donut (doughnut; spare tire) for his car?
  • Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or

made of donut?

  • From work: Well, actually, he stopped there from hunger and exhaustion, not just from work.
  • Every few hours: That’s how often he thought it? Or that’s for coffee?
  • it: the particular coffee that was good every few hours? the donut store? the situation
  • Too expensive: too expensive for what? what are we supposed to conclude about what John

did?

slide-7
SLIDE 7

7 IUI 20 0 7 tutoria l

Structural vs. Statistical: Technology innovation thru dialectic

Statistical analysis data driven empirical connectionist speech community Structural analysis rule driven rational symbolic NLU, Chomskian, Shankian, AI community

slide-8
SLIDE 8

8 IUI 20 0 7 tutoria l

Structural NLP

  • grammar rules + lexicons

– Grammatical category (POS, syntactic category) – unification features (connectivity, agreements, semantics..)

  • chart parsing
  • compositional semantics
  • Limitation: enormous ambiguity

– “List the sales of the products produced in 1973 with the products produced in 1972” ==> 455 parses (Martin et. al. 1981)

slide-9
SLIDE 9

9 IUI 20 0 7 tutoria l

Statistical NLP

  • Grammar’s role? – estimating which word sequence is legal?

– Pr (w1, w2, …wn) – pr(w1)pr(w2|w1)pr(w3|w1w2)…..pr(wn|w1, ….wn-1) – pr(w2 |w1) = count (w1w2) / count (w1) [MLE] – E.g.) the (big, pig) dog – Shannon game -- predicting the next word given word sequence

  • language modeling -- probability matrix

– Language model evaluation -- cross entropy – - Σ pr(w1,n) log prM(w1,n) – when prM(w1,n) = pr(w1,n) cross entropy becomes minimum and language model M is perfect

slide-10
SLIDE 10

10 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-11
SLIDE 11

11 IUI 20 0 7 tutoria l

The Noisy Channel Model

  • Automatic speech recognition (ASR) is a process by which an acoustic

speech signal is converted into a set of words [Rabiner et al., 1993]

  • The noisy channel model [Lee et al., 1996]

– Acoustic input considered a noisy version of a source sentence Source sentence Noisy sentence Guess at

  • riginal sentence

Where is the bus stop ? Where is the bus stop ? Noisy Channel Decoder

slide-12
SLIDE 12

12 IUI 20 0 7 tutoria l

The Noisy Channel Model

  • What is the most likely sentence out of all sentences in the language L

given some acoustic input O?

  • Treat acoustic input O as sequence of individual observations

– O = o1,o2,o3,…,ot

  • Define a sentence as a sequence of words:

– W = w1,w2,w3,…,wn

) | ( max arg ˆ O W P W

L W∈

= ) ( ) | ( max arg ˆ W P W O P W

L W∈

= ) ( ) ( ) | ( max arg ˆ O P W P W O P W

L W∈

=

Bayes rule Golden rule

slide-13
SLIDE 13

13 IUI 20 0 7 tutoria l

Speech Recognition Architecture Meets Noisy Channel

Feature Extraction Decoding Acoustic Model Pronunciation Model Language Model

Where is the bus stop ? Speech Signals Word Sequence Wher is the bus stop ?

Network Construction Speech DB Text Corpora

HMM Estimation G2P LM Estimation

) ( ) | ( max arg ˆ W P W O P W

L W∈

= W O

slide-14
SLIDE 14

14 IUI 20 0 7 tutoria l

Network Construction

I L S A M 일 이 삼 사 I L I S A M S A 삼 사 일 이

Acoustic Model Pronunciation Model Language Model

I I L S A M Word transition P(일|x) P(사|x) P(삼|x) P(이|x) LM is applied S A start end 이 일 사 삼 Between-word transition Intra-word transition

Search Network

  • Expanding every word to state level, we get a search network [Demuynck

et al., 1997]

slide-15
SLIDE 15

15 IUI 20 0 7 tutoria l

References (1/2)

  • L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986.

Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52.

  • C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic

Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566.

  • K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon

network representation for cross-word context dependent phones, Proceedings

  • f the 5th European Conference on Speech Communication and Technology,

pp.143–146.

  • T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation

modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104.

  • M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in

speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88.

slide-16
SLIDE 16

16 IUI 20 0 7 tutoria l

References (2/2)

  • B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood

estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309.

  • C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and

Speaker Recognition: Advanced Topics, Kluwer Academic Publishers.

  • K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for

the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173.

  • L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected

applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286.

  • L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition,

Prentice-Hall.

  • S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple

conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department.

  • S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK
  • book. Entropics Cambridge Research Lab., Cambridge, UK.
slide-17
SLIDE 17

17 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-18
SLIDE 18

18 IUI 20 0 7 tutoria l

Spoken Language Understanding (SLU)

  • Spoken language understanding is to map natural language speech to

frame structure encoding of its meanings [Wang et al., 2005]

  • What’s difference between NLU and SLU?

– Robustness; noise and ungrammatical spoken language – Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) – Dialog; dialog history dependent and utt. by utt. analysis

  • Traditional approaches; natural language to SQL conversion

ASR Speech SLU SQL Generate Database Text Semantic Frame SQL Response A typical ATIS system (from [Wang et al., 2005])

slide-19
SLIDE 19

19 IUI 20 0 7 tutoria l

Semantic Representation

  • Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky, 2002]

– An intermediate semantic representation to serve as the interface between user and dialog system – Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. “Show me flights from Seattle to Boston”

ShowFlight Subject Flight FLIGHT Departure_City Arrival_City SEA BOS <frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’> FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot> </frame>

Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]

slide-20
SLIDE 20

20 IUI 20 0 7 tutoria l

Knowledge-based Systems

  • Knowledge-based systems:

– Developers write a syntactic/semantic grammar – A robust parser analyzes the input text with the grammar – Without a large amount of training data

  • Previous works

– MIT: TINA (natural language understanding) [Seneff, 1992] – CMU: PHEONIX [Pellom et al., 1999] – SRI: GEMINI [Dowding et al., 1993]

  • Disadvantages

1) Grammar development is an error-prone process 2) It takes multiple rounds to fine-tune a grammar 3) Combined linguistic and engineering expertise is required to construct a grammar with good coverage and optimized performance 4) Such a grammar is difficult and expensive to maintain

slide-21
SLIDE 21

21 IUI 20 0 7 tutoria l

Statistical Systems

  • Statistical SLU approaches:

– System can automatically learn from example sentences with their corresponding semantics – The annotation are much easier to create and do not require specialized knowledge

  • Previous works

– Microsoft: HMM/CFG composite model [Wang et al., 2005] – AT&T: CHRONUS (Finite-state transducers) [Levin and Pieraccini, 1995] – Cambridge Univ: Hidden vector state model [He and Young, 2005] – Postech: Semantic frame extraction using statistical classifiers [Eun et al.,

2004; Eun et al., 2005; Jeong and Lee, 2006]

  • Disadvantages

1) Data-sparseness problem; system requires a large amount of corpus 2) Lack of domain knowledge

slide-22
SLIDE 22

22 IUI 20 0 7 tutoria l

Reducing the Effort of Human Annotation

  • Active + Semi-supervised learning for SLU [Tur et al., 2005]

– Use raw data, and divide them into two sets Sraw = Sactive + Ssemi

Raw data Small Labeled data Model Predict & Estimate Confidence < threshold Active Learning Filter Labeled samples > threshold

yes no

Augmented data

+ +

slide-23
SLIDE 23

23 IUI 20 0 7 tutoria l

Semantic Frame Extraction

Dialog Act Identification Dialog Act Identification Frame-Slot Extraction Frame-Slot Extraction Relation Extraction Relation Extraction Unification Unification Feature Extraction / Selection Feature Extraction / Selection Info. Source Info. Source + + + + + + + + + +

Overall architecture for semantic analyzer

I like DisneyWorld. Domain: Chat Dialog Act: Statement Main Action: Like Object.Location=DisneyWorld

Examples of semantic frame structure

  • Semantic Frame Extraction (~ Information Extraction Approach)

1) Dialog act / Main action Identification ~ Classification 2) Frame-Slot Object Extraction ~ Named Entity Recognition 3) Object-Attribute Attachment ~ Relation Extraction – 1) + 2) + 3) ~ Unification

How to get to DisneyWorld? Domain: Navigation Dialog Act: WH-question Main Action: Search Object.Location.Destination=DisneyWorld

slide-24
SLIDE 24

24 IUI 20 0 7 tutoria l

Frame-Slot Object Extraction

Sequence Labeling Inference Conditional Random Fields [Lafferty et al. 2001]

y yt

t-

  • 1

1

y yt

t

y yt+1

t+1

x xt

t-

  • 1

1

x xt

t

x xt+1

t+1

CRF = Undirected graphical model

  • Frame-Slot Extraction ~ NER = Sequence Labeling Problem
  • A probabilistic model
slide-25
SLIDE 25

25 IUI 20 0 7 tutoria l

Long-distance Dependency in NER

… … … … fly fly from from denver denver to to chicago chicago

  • n
  • n

dec dec. . 10th 10th 1999 1999 DEPART.MONTH … …

… …

return return from from denver denver to to chicago chicago

  • n
  • n

dec dec. . 10th 10th 1999 1999 RETURN.MONTH

Feature Gain

  • A Solution: Trigger-Induced CRF [Jeong and Lee, 2006]

– Basic idea is to add only bundle of (trigger) features which increase log- likelihood of training data – Measuring gain to evaluate the (trigger) features using Kullback-Leibler divergence

slide-26
SLIDE 26

26 IUI 20 0 7 tutoria l

References (1/2)

  • J. Dowding, J. M. Gawron, D. Appelt, J. Bear, L. Cherny, R. Moore, D. and
  • Moran. 1993. Gemini: A natural language system for spoken language
  • understanding. ACL, 54-61.
  • J. Eun, C. Lee, and G. G. Lee, 2004. An information extraction approach

for spoken language understanding. ICSLP.

  • J. Eun, M. Jeong, and G. G. Lee, 2005. A Multiple Classifier-based

Concept-Spotting Approach for Robust Spoken Language Understanding. Interspeech 2005-Eurospeech.

  • D. Gildea, and D. Jurafsky. 2002. Automatic labeling of semantic roles.

Computational Linguistics, 28(3):245-288.

  • Y. He, and S. Young. January 2005. Semantic processing using the Hidden

Vector State model. Computer Speech and Language, 19(1):85-106.

  • M. Jeong, and G. G. Lee. 2006. Exploiting non-local features for spoken

language understanding. COLING/ACL.

  • J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional Random Fields:

Probabilistic models for segmenting and labeling sequence data. ICML.

slide-27
SLIDE 27

27 IUI 20 0 7 tutoria l

References (2/2)

  • E. Levin, and R. Pieraccini. 1995. CHRONUS, the next generation, In

Proceedings of 1995 ARPA Spoken Language Systems Technical Workshop, 269--271, Austin, Texas.

  • B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator:

An Architecture for Dialogue Systems. ICSLP.

  • R. E. Schapire., M. Rochery, M. Rahim, and N. Gupta. 2002,

Incorporating prior knowledge into boosting. ICML. pp538-545.

  • S. Seneff. 1992. TINA: a natural language system for spoken language

applications, Computational Linguistics, 18(1):61--86.

  • G. Tur, D. Hakkani-Tur, and R. E. Schapire. 2005. Combining active

and semi-supervised learning for spoken language understanding. Speech Communication. 45:171-186

  • Y. Wang, L. Deng, and A. Acero. September 2005, Spoken Language

Understanding: An introduction to the statistical framework. IEEE Signal Processing Magazine, 27(5)

slide-28
SLIDE 28

28 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-29
SLIDE 29

29 IUI 20 0 7 tutoria l

Dialog for EPG (POSTECH) Unified Chatting and Goal-oriented Dialog (POSTECH)

slide-30
SLIDE 30

30 IUI 20 0 7 tutoria l

Spoken Dialog System

ASR ASR SLU SLU DM DM RG RG Models, Rules Models, Rules

Semantic Meaning

ORIGIN_CITY: WASHINGTON DESTINATION_CITY: DENVER FLIGHT_TYPE: ROUNDTRIP

Dialog Management

System Action

GET DEPARTURE_DATE

Response Generation

System Speech

Which date do you want to fly from Washington to Denver? Automatic Speech Recognition

User Speech

“I need a flight from Washington DC to Denver roundtrip”

Recognized Sentence

Spoken Language Understanding

slide-31
SLIDE 31

31 IUI 20 0 7 tutoria l

VoiceXML-based System

  • What is VoiceXML?

– The HTML(XML) of the voice web. [W3C, working draft] – The open standard markup language for voice application

  • Can do

– Rapid implementation and management – Integrated with World Wide Web – Mixed-Initiative dialog – Able to input push button on telephone – Simple dialog implementation solution

  • VoiceXML dialogs are built from

– <menu>, <form> (similar to “Slot & Filling” system)

  • Limiting User’s Response

– Verification, and Help for invalid response – Good speech recognition accuracy

slide-32
SLIDE 32

32 IUI 20 0 7 tutoria l

Example – <Form>

<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"> <form id="login"> <field name="phone_number" type="phone"> <prompt> Please say your complete phone number </prompt> </field> <field name="pin_code" type="digits"> <prompt> Please say your PIN code </prompt> </field> <block> <submit next=“http://www.example.com/servlet/login” namelist=phone_number pin_code"/> </block> </form> </vxml> Browser : Please say your complete phone number User : 800-555-1212 Browser : Please say your PIN code User : 1 2 3 4

slide-33
SLIDE 33

33 IUI 20 0 7 tutoria l

Frame-based Approach

  • Frame-based system [McTear, 2004]

– Asks the user questions to fill slots in a template in order to perform a task (form-filling task) – Permits the user to respond more flexibly to the system’s prompts (as in Example 2.) – Recognizes the main concepts in the user’s utterance

Example 1)

  • System: What is your destination?
  • User: London.
  • System: What day do you want to

travel?

  • User: Friday

Example 2)

  • System: What is your destination?
  • User: London on Friday around 10

in the morning.

  • System: I have the following

connection …

slide-34
SLIDE 34

34 IUI 20 0 7 tutoria l

Agent-Based Approach

  • Properties [Allen et al., 1996]

– Complex communication using unrestricted natural language – Mixed-Initiative – Co-operative problem solving – Theorem proving, planning, distributed architectures – Conversational agents

  • An example
  • System attempts to provide a more co-operative response that might

address the user’s needs.

User : I’m looking for a job in the Calais area. Are there any servers? System : No, there aren’t any employment servers for Calais. However, there is an employment server for Pasde-Calais and an employment server for Lille. Are you interested in one of these?

slide-35
SLIDE 35

35 IUI 20 0 7 tutoria l

Galaxy Communicator Framework

  • The Galaxy Communicator software infrastructure is a distributed,

message-based, hub-and-spoke infrastructure optimized for constructing spoken dialog systems. [Bayer et al., 2001]

  • An open source architecture for constructing dialog systems
  • History: MIT Galaxy system Developed and maintained by MITRE

Message-passing protocol Hub and Clients architecture

slide-36
SLIDE 36

36 IUI 20 0 7 tutoria l

References (1/2)

  • J. F. Allen, B. Miller, E. Ringger and T. Sikorski. 1996. A Robust System for

Natural Spoken Dialogue, ACL.

  • S. Bayer, C. Doran, and B. George. 2001. Dialogue Interaction with the DARPA

Communicator Infrastructure: The Development of Useful Software. HLT Research.

  • R. Cole, editor., Survey of the state of the art in human language technology,

Cambridge University Press, New York, NY, USA, 1997.

  • G. Ferguson, and J. F. Allen. 1998. TRIPS: An Integrated Intelligent Problem-

Solving Assistant, AAAI, pp26-30.

  • K. Komatani, F. Adachi, S. Ueno, T. Kawahara, and H. Okuno. 2003. Flexible

Spoken Dialogue System based on User Models and Dynamic Generation of VoiceXML Scripts. SIGDIAL.

  • S. Larsson, and D. Traum. 2000. Information state and dialogue management in

the TRINDI Dialogue Move Engine Toolkit, Natural Language Engineering, 6(3-4).

  • S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and G.
  • Sagerer. 2003. Providing the basis for human-robotinteraction: A multi-modal

attention system for a mobile robot. ICMI. pp. 28–35.

slide-37
SLIDE 37

37 IUI 20 0 7 tutoria l

References (2/2)

  • E. Levin, R. Pieraccini, and W. Eckert. 2000, A stochastic model of human-

machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing. 8(1):11-23

  • C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based

Dialogue Management using Dialogue Examples. ICASSP.

  • W. Marilyn, H. Lynette, and A. John. 2000. Evaluation for Darpa

Communicator Spoken Dialogue Systems. LREC.

  • M. F. McTear, Spoken Dialogue Technology, Springer, 2004.
  • I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing

advanced spoken dialog management in Java. Speech Communication, 54(1):99- 124.

  • B. Pellom, W. Ward., and S. Pradhan. 2000. The CU Communicator: An

Architecture for Dialogue Systems. ICSLP.

  • A. Rudnicky, E. Thayer, P. Constantinides, C. Tchou, R. Shern, K. Lenzo, W.

Xu, and A. Oh. 1999. Creating natural dialogs in the Carnegie Mellon Communicator system. Eurospeech, 4, pp1531-1534.

  • W3C, Voice Extensible Markup Language (VoiceXML) Version 2.0 Working

Draft, http://www.w3c.org/TR/voicexml20/

slide-38
SLIDE 38

38 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-39
SLIDE 39

39 IUI 20 0 7 tutoria l

The Role of Dialog Management

  • For example, in the flight reservation system

– System : Welcome to the Flight Information Service. Where would you like to travel to? – Caller : I would like to fly to London on Friday arriving around 9 in the morning. – System : ?????????? ?????????? In order to process this utterance, the system has to engage in the following processes: 1) Recognize the words that the caller said. (Speech Recognition) 2) Assign a meaning to these words. (Language Understanding) 3) Determine how the utterance fits into the dialog so far and decide what to do next. (Dialog Management) There is a flight that departs at 7:45 a.m. and arrives at 8:50 a.m.

slide-40
SLIDE 40

40 IUI 20 0 7 tutoria l

Information State Update Approach – Rule-based DM (Larsson and Traum, 2000 )

  • A method of specifying a dialogue theory that makes it straightforward

to implement

  • Consisting of following five constituents

– Information Components

– Including aspects of common context – (e.g., participants, common ground, linguistic and intentional structure,

  • bligations and commitments, beliefs, intentions, user models, etc.)

– Formal Representations

– How to model the information components – (e.g., as lists, sets, typed feature structures, records, etc.)

slide-41
SLIDE 41

41 IUI 20 0 7 tutoria l

Information State Approach

– Dialogue Moves

– Trigger the update of the information state – Be correlated with externally performed actions

– Update Rules

– Govern the updating of the information state

– Update Strategy

– For deciding which rules to apply at a given point from the set of applicable

  • nes
slide-42
SLIDE 42

42 IUI 20 0 7 tutoria l

Example Dialogue

slide-43
SLIDE 43

43 IUI 20 0 7 tutoria l

Example Dialogue

slide-44
SLIDE 44

44 IUI 20 0 7 tutoria l

  • A Tree Branching for Every Possible Situation

– It can become very complex. Start

Information + Origin Information + Destination Information + Origin + Dest. Information + Date Information + Origin + Date Information + Origin + Dest +Date Information + Dest + Date Flight # Flight # + Date Flight # + Information Flight # + Reservation

The Hand-crafted Dialog Model is Not Domain Portable

slide-45
SLIDE 45

45 IUI 20 0 7 tutoria l

An Optimization Problem

  • Dialog Management as an Optimization Problem

– Optimization Goal

– Achieve an application goal to minimize a cost function (=objective function) – In General

– To minimize the turn of user-system and the DB access until filling all slots

– Simple Example : Month and Day Problem

– Designing a dialog system that gets a correct date (month and day) from a user through the shortest possible interaction – Objective Function

  • How to Mathematically Formalize?

– Markov Decision Process (MDP)

slots unfilled ns interactio *# Errors *# *# C

f e i D

ω ω ω + + =

slide-46
SLIDE 46

46 IUI 20 0 7 tutoria l

Mathematical Formalization

  • Markov Decision Process (MDP) (Levin et al 2000)

– Problems with cost (or reward) objective function are well modeled as Markov Decision Process. – The specification of a sequential decision problem for a fully observable environment that satisfies the Markov Assumption and yields additive rewards.

Dialog Action

(Prompts, Queries, etc.)

Dialog Manager Environment

(User, External DB or other Servers)

Dialog State Cost

(Turn, Error, DB Access, etc.)

slide-47
SLIDE 47

47 IUI 20 0 7 tutoria l

Month and Day Example

Optimal strategy is the one that minimizes the cost.

Strategy 1 is optimal if wi + P2* we - wf > 0 Recognition error rate is too high

Strategy 1.

Good Bye.

Strategy2. Strategy 3.

  • Day

Month Which date ? Good Bye.

  • Day

Month Day

  • Which day ?

Which month?

  • Good Bye.
  • 2

1

1

* * C

f i

ω ω + = P * 2 * 3

2 3

* * C

f e i

ω ω ω + + =

Strategy 3 is optimal if 2*(P1-P2)* we - wi > 0 P1 is much more high than P2 against a cost of longer interaction

P * 2 * 2

1 2

* * C

f e i

ω ω ω + + =

slide-48
SLIDE 48

48 IUI 20 0 7 tutoria l

POMDP (Young 2002)

  • Partially Observable Markov Decision Process (POMDP)

– POMDP extends Markov Decision Process by removing the requirement that the system knows its current state precisely.

– Instead, the system makes observations about the outside world which give incomplete information about the true current state.

– Belief State : A distribution over MDP states in the absence of knowing its state exactly .

s r(s,a)

) , | ( ) ( ) , | ' ( ) , ' | ( ) , , | ' ( ) ' ( b a

  • p

s b s a s p a s

  • p

b a

  • s

p s b

S s

= =

b(s) s`

=

S s

a s r s b a b ) , ( ) ( ) , ( ρ

MDP POMDP

Current State Reward Function Next State

slide-49
SLIDE 49

49 IUI 20 0 7 tutoria l

Example-based Dialog Model Learning (Lee et al 2006)

  • Example-Based Dialog Modeling

– Automatically modeling from dialog corpus

– Example-based techniques using dialog example database (DEDB). – This model is simple and domain portable.

– DEDB Indexing and Searching

– Query key : user intention, semantic frames, discourse history.

– Tie-breaking

– Utterance similarity Measure

– Lexico-Semantic Similarity : Normalized edit distance – Discourse History Similarity : Cosine similarity

slide-50
SLIDE 50

50 IUI 20 0 7 tutoria l

  • Indexing and Querying

– Semantic-based indexing for dialog example database

– Lexical-based example database needs much more examples. – The SLU results is the most important index key.

– Automatically indexing from dialog corpus.

Example-based Dialog Modeling

Utterance 그럼 SBS 드라마는 언제 하지? (when is the SBS drama showing?) Dialog Act Wh-question Main Action Search_start_time Component Slots [channel = SBS, genre =drama] Discourse History [1,0,1,0,0,0,0,0,0] System Action Inform(date, start_time, program)

Input : User Utterance Output : System Concept Indexing Key

slide-51
SLIDE 51

51 IUI 20 0 7 tutoria l

Example-based Dialog Modeling

  • Tie-breaking

– Lexico-Semantic Representation – Utterance Similarity Measure

User Utterance 그럼 SBS 드라마는 언제 하지? (when is the SBS drama showing?) Component Slots [channel = SBS, genre = 드라마(dramas)] Lexico-Semantic Representation 그럼 [channel] [genre] 는 언제 하 지

그럼 [channel] [genre] 는 언제 하 지 Slot-Filling Vector : [1,0,1,0,0,0,0,0,0] [date] [genre] 는 몇 시에 하 니 Slot-Filling Vector : [1,0,0,1,0,0,0,0,0] Current User Utterance Retrieved Examples

Lexico-Semantic Similarity Discourse History Similarity

slide-52
SLIDE 52

52 IUI 20 0 7 tutoria l

Strategy of Example-based Dialog Modeling

Dialogue Corpus Dialogue Example DB Domain Expert

User’s Utterance

Automatic Indexing Retrieval Discourse History

Query Generation

Dialogue Examples

Tie-breaking

Lexico-semantic Similarity Discourse history Similarity

Utterance Similarity

Semantic Frame

Best Dialogue Example

User Intention

System Responses

Dialogue Corpus Dialogue Corpus Dialogue Example DB Domain Expert

User’s Utterance

Automatic Indexing Retrieval Discourse History

Query Generation

Dialogue Examples

Tie-breaking

Lexico-semantic Similarity Discourse history Similarity

Utterance Similarity

Lexico-semantic Similarity Discourse history Similarity

Utterance Similarity

Semantic Frame

Best Dialogue Example

User Intention

System Responses

slide-53
SLIDE 53

53 IUI 20 0 7 tutoria l EPG DEDB EPG Dialog Corpus

EPG Expert

Discourse History Stack

  • previous user utterance
  • previous dialog act and

semantic frame

  • previous slot-filling vector

Frame-Slot Extraction (EPG) Dialog Act Identification Discourse Inference USER : What is on TV now? USER : What is on TV now?

  • Agent = Task
  • Domain = EPG
  • Dialog Act = Wh-question
  • Main Action= Search_Program
  • Start_Time = now

Retrieved Dialog Examples

  • Calculate utterance

similarity

System Response

EPG Meta-Rule

XML Rule Parser

  • When no example is

retrieved, meta-rules are used.

Domain Spotter Agent Spotter

SYSTEM : “XXX” is on SBS, ….. SYSTEM : “XXX” is on SBS, …..

Web Contents

Database Manager

TV Schedule Database

Multi-domain/genre Dialog Expert

slide-54
SLIDE 54

54 IUI 20 0 7 tutoria l

References

  • S. Larsson and D. Traum, “Information state and dialogue management in the TRINDI

Dialogue Move Engine Toolkit”, Natural Language Engineering, vol. 6, no. 3-4, pp. 323- 340, 2000

  • E. Levin, R. Pieraccini, and W. Eckert, “A stochastic model of human-machine interaction

for learning dialog strategies”, IEEE Transactions on Speech and Audio Processing, vol. 8, no. 1, pp. 11-23, 2000.

  • Steve Young. Talking to Machine (Statistically speaking), ICASSP 2002 , Denver
  • I. Lane and T. Kawahara. 2006. Verification of speech recognition results incorporating

in-domain confidence and discourse coherence measures, IEICE Transactions on Information and Systems, 89(3):931-938.

  • C. Lee, S. Jung, J. Eun, M. Jeong, and G. G. Lee. 2006. A Situation-based Dialogue

Management using Dialogue Examples. ICASSP.

  • C. Lee, S.Jung, M. Jeong, and G. G. Lee. 2006. Chat and Goal-Oriented Dialog

Together: A Unified Example-based Architecture for Multi-Domain Dialog Management, Proceedings of the IEEE/ACL 2006 workshop on spoken language technology (SLT), Aruba.

  • D. Litman and S. Pan. 1999. Empirically evaluating an adaptable spoken dialogue system.

ICUM, pp55-64.

  • M. F. McTear, Spoken Dialogue Technology, Springer, 2004.
  • I. O’Neil, P. Hanna, X. Liu, D. Greer, and M. McTear. 2005. Implementing advanced

spoken dialog management in Java. Speech Communication, 54(1):99-124.

  • M. Walker, D. Litman, C. Kamm, and A. Abella. 1997. PARADISE: A general

framework for evaluating spoken dialogue agents. ACL/EACL, pp271-280.

slide-55
SLIDE 55

55 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-56
SLIDE 56

56 IUI 20 0 7 tutoria l

  • Motivation

– The biggest problem to use dialog systems in a practical field is

“System Maintenance is difficult!”

– Practical Dialog Systems need:

– Easy and Fast Dialog Modeling to handle new patterns of dialog – Easy to build up new information sources – TV-Guide domain needs new TV-Schedule everyday – Reduce human efforts for maintaining – All dialog components should be synchronized! – Easy to tutor the system – Semi-automatic learning ability is necessary. – Human can’t teach everything.

  • Previous work

– Rapid application development; CSLU Toolkit [CSLU Toolkit] – Scheme design & management; SGStudio [Wang and Acero, 2005] – Help non-experts in developing a user interface; SUEDE [Anoop et al., 2001]

Dialog Workbench/Studio

slide-57
SLIDE 57

57 IUI 20 0 7 tutoria l

Dialog Workbench

  • Dialog Studio [Jung et al., 2006]

– Dialog workbench System for example-based spoken dialog system – Can do

– Tutor the dialog system by adding & editing dialog examples – Synchronize all dialog components

– ASR + SLU + DM + Information Accessing

– Providing semi-automatic learning ability – Reducing human-efforts for building up or maintaining dialog systems.

– Key idea

– Generate Possible Dialog Candidates from Corpus – Predicting the possible dialog tagging information using a current model – Human approving or disapproving.

slide-58
SLIDE 58

58 IUI 20 0 7 tutoria l

Issue – “Human Efforts Reduction”

  • New dialog example Tagging

– Can be supported by the System using old models.

– DUP automatically generates the instances.

– Administrator can audit DUP and modify the instances.

– ASR, SLU models are automatically trained

New dialog utterance Old dialog manager tries to handle it Display the result. Human audit & modify the result

Dialog Example Editing

Dialog Utterance Pool Dialog Utterance Pool

(Automatically generated example candidates)

ASR Model SLU Model Example-based DM Model

New Corpus Generation Example-DB Indexing Recommendation Generation Audit & Modify

slide-59
SLIDE 59

59 IUI 20 0 7 tutoria l

POSTECH Dialog Studio Demo

slide-60
SLIDE 60

6 0 IUI 20 0 7 tutoria l

References (1/2)

  • S. J. Cox, and S. Dasmahapatra. 2000. A semantically-based confidence

measure for speech recognition. In Proc. of the ICSLP 2000, Beijing.

  • J. Eun, C. Lee, and G. G. Lee. 2004. An information extraction

approach for spoken language understanding. In: Proc. of the ICSLP, Jeju Korea.

  • T. J. Hazen, J. Polifroni, and S. Seneff. 2002. Recognition confidence

scoring and its use in speech language understanding systems. Computer Speech and Language, vol. 16, no. 1, pp. 49–67.

  • T. J. Hazen, T. Burianek, J. Polifroni, and S. Seneff. 2000. Recognition

confidence scoring for use in speech understanding systems. In Proc. of the the ISCA ASR2000 Tutorial and Research Workshop, Paris.

  • H. Jiang. 2005. Confidence measures for speech recognition. Speech

Communication, vol. 45, no. 4, pp. 455–470.

  • S. Jung, C. Lee, G. G Lee, 2006. Three Phase Verification for

Spoken Dialog System. In Proc. IUI.

slide-61
SLIDE 61

6 1 IUI 20 0 7 tutoria l

References (2/2)

  • M. McTear, I. O’Neill, P. Hanna, and X. Liu. 2005. Handling errors and

determining confirmation strategies - an object-based approach. Speech Communication, vol. 45, no. 3, pp. 249–269.

  • I. O’Neill, P. Hanna, X. Liu, D.Greer, and M. McTear. 2005.

Implementing advanced spoken dialogue management in Java. Science

  • f Computer Programming, vol. 54, no. 1, pp. 99–124.
  • T. Paek, and E. Horvitz. 2000. Conversation as action under uncertainty.

In Proc. of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 455-464.

  • Ratnaparkhi, 1998. A Maximum Entropy Models for Natural Language

Ambiguity Resolution. Ph.D. Dissertation. University of Pennsylvania.

  • F. Torres, L.F. Hurtado, F.E Garcia, Sanchis, and E. Segarra. 2005.

Error handling in a stochastic dialog system through confidence

  • measures. Speech Communication, vol. 45, no. 3, pp. 211–229.
slide-62
SLIDE 62

6 2 IUI 20 0 7 tutoria l

References

  • K. S. Anoop, R.K. Scott, J. Chen, A. Landay, and C. Chen, 2001.

SUEDE: Iterative, Informal Prototyping for Speech Interfaces. Video poster in Extended Abstracts of Human Factors in Computing Systems: CHI, Seattle, WA, pp. 203-204.

  • S. Jung, C. Lee, G. G. Lee. 2006. Dialog Studio: An Example Based

Spoken Dialog System Development Workbench, Dialogs on dialog: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems, Interspeech2006-ICSLP satellite workshop

  • Y. Wang, and A. Acero. 2005. SGStudio: Rapid Semantic Grammar

Development for Spoken Language Understanding. Proceedings of the Eurospeech Conference. Lisbon, Portugal.

  • CSLU Toolkit, http://cslu.cse.ogi.edu/toolkit/
slide-63
SLIDE 63

6 3 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-64
SLIDE 64

64 IUI 20 0 7 tutoria l

Information Access Dialog

Information Sources Dialog Manager Question Query Answer Result

slide-65
SLIDE 65

6 5 IUI 20 0 7 tutoria l

Information Access Agent RDB Access Module Question Answering Module Relational Database WEB

Information Sources

slide-66
SLIDE 66

66 IUI 20 0 7 tutoria l

Building Relational DB from Unstructured Data

  • A Relational DB Model is Equivalent to an Entity-Relationship Model
  • We can build an ER Model with the Information Extraction Approach

– Named-Entity Recognition (NER) – Relation Extraction Relational Database WEB

slide-67
SLIDE 67

6 7 IUI 20 0 7 tutoria l

  • Named-Entity Recognition (NER)

– A task that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations,

  • etc. [Chinchor, 1998]

Hillary Clinton moved to New York last year. Hillary Clinton moved to New York last year. Person Geo-Political Entity

Named-Entity Recognition

slide-68
SLIDE 68

6 8 IUI 20 0 7 tutoria l

Relation Extraction

  • Relation Extraction

– A task that detects and classification relations between named-entities

Hillary Clinton moved to New York last year. Hillary Clinton moved to New York last year. Person Geo-Political Entity AT.Residence

slide-69
SLIDE 69

69 IUI 20 0 7 tutoria l

Question Answering

  • Question Answering System for Information Access Dialog System

– SiteQ [Lee et al. 2001; Lee and Lee, 2002] – Search answers, not documents POS Tagging Answer Type Identification Answer Justification Query Formation Dynamic Answer Passage Selection Answer Finding Document Retrieval Question Answer Type Answer

slide-70
SLIDE 70

70 IUI 20 0 7 tutoria l

References (1/2)

  • C. Blaschke, L. Hirschman, and A.Yeh. 2004. BioCreative Workshop.
  • N. Chinchor. 1998. Overview of MUC-7/MET-2, MUC-7.
  • N. Kambhatla. 2004. Combining lexical, syntactic and semantic features with

Maximum Entropy models for extracting relations. ACL.

  • E. Kim, Y. Song, C. Lee, K. Kim, G. G. Lee, B. Yi, and J. Cha. 2006. Two-

phase learning for biological event extraction and verification. ACM TALIP 5(1):61-73

  • J. Kim, T. Ohta, Y. Tsuruoka, and Y. Tateisi. 2003. GENIA corpus - a

semantically annotated corpus for bio-textmining, Bioinformatics, Vol 19 Suppl.1, pp. 180-182.

  • J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional random fields:

probabilistic models for segmenting and labelling sequence data. ICML.

  • G. G. Lee, J. Seo, S. Lee, H. Jung, B. H. Cho, C. Lee, B. Kwak, J. Cha, D.

Kim, J. An, H. Kim, and K. Kim. 2001. SiteQ: Engineering High Performance QA system Using Lexico-Semantic Pattern Matching and Shallow NLP. TREC-10.

slide-71
SLIDE 71

71 IUI 20 0 7 tutoria l

References (2/2)

  • S. Lee, and G. G. Lee. 2002. SiteQ/J: A question answering system for
  • Japanese. NTCIR workshop 3 meeting: evaluation of information retrieval,

automatic text summarization and question answering, QA tasks.

  • A. McCallum, and W. Li. 2003. Early results for named entity recognition with

conditional random fields, feature induction and web-enhanced lexicons, CoNLL.

  • S. Soderland. 1999. Learning information extraction rules for semi-structured

and free text. Machine Learning, 34, 233-72

  • Y. Song, E. Kim, G. G. Lee, and B. Yi. 2005. POSBIOTM-NER: a trainable

biomedical named-entity recognition system. Bioinformatics, 21 (11): 2794- 2796.

  • G. Zhou, J. Su, J. Zhang, M. Zhang. 2005. Exploring Various Knowledge in

Relation Extraction. ACL.

slide-72
SLIDE 72

72 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-73
SLIDE 73

73 IUI 20 0 7 tutoria l

POSTECH Chatbot Demo

slide-74
SLIDE 74

74 IUI 20 0 7 tutoria l

Emotion Recognition

  • Emotion Recognition
  • Why is Emotion Recognition important in dialog systems?

– Emotion is a part of User Context.

– It has been recognized as one of the most significant factor of people to communicate with each other. [T. Polzin, 2000]

– Application : Affective HCI (Human-Computer Interface)

– Home Networking, Intelligent Robot, ChatBot, … “ “I feel blue I feel blue today. today.” ” “ “Do you need a Do you need a cheer cheer-

  • up music? "

up music? " “ “what up? what up?” ”

slide-75
SLIDE 75

75 IUI 20 0 7 tutoria l

Traditional Emotion Recognition

USER : I am very happy. USER : I am very happy. Facial Expression Analysis Speech Analysis Linguistic Analysis

Classifier for Final Emotion Decision

Emotion Hypothesis

Facial Expression Speech Text

slide-76
SLIDE 76

76 IUI 20 0 7 tutoria l

Emotional Categories

  • Emotional Categories

System Categories

Emotional Speech DB

  • Positive: Confident, encouraging, friendly, happy, interested
  • Negative: angry, anxious, bored, frustrated, sad, fear
  • Neutral
  • Ex) EPSaT (Emotional Prosody Speech and Transcription), SiTEC DB

Call Center

  • Positive, Non-Positive
  • Anger, Fear, Satisfaction, Excuse, Neutral
  • Ex) HMIHY, Stock Exchange Customer Service Center

Tutor System

  • Positive, Negative, Neutral
  • Ex) ITSpoke

Chat Messenger

Neutral, Happy, Sad, Surprise, Afraid, Disgusted, Bored, …

slide-77
SLIDE 77

77 IUI 20 0 7 tutoria l

Emotional Features

  • Speech-to-Emotion

– Acoustic correlates related to prosody of speech have been used for recognizing emotions.

– Such as pitch, energy, and speech rate of the utterance,

– In general, the features extracted from speech play a significant role in recognizing emotion. Feature-Set Description

Acoustic-Prosodic Fundamental Frequency(f0) – max, min, mean, standard deviation Energy – max, min, mean, standard deviation Speaking Rate – voice frame/total frame Pitch Contour ToBI Contour, nuclear pitch accent, phrase+boundary tones Voice Quality Spectral tilt

slide-78
SLIDE 78

78 IUI 20 0 7 tutoria l

Emotional Features

  • Text-to-Emotion

– Basic Idea

– People tend to use specific words to express their emotions in spoken dialogs.

– Because they have learned how some words are related to the corresponding emotions.

– Psychologists have tried to identify the language of emotions by asking people to list the English words that describe specific emotions.

– They identified emotional keyword in spoken language. – It is highly domain dependent.

Feature-Set Description

Lexical N-gram (Unigram, Bigram, Trigram) Non-speech human noise : laughter and sighs Filled Pause : “Oh” Emotional Keyword Pragmatic Dialog Act, User Identifier Context Past observances in the previous user turns

slide-79
SLIDE 79

79 IUI 20 0 7 tutoria l

Classifier

  • Basic Algorithm within Emotion Recognizer

– It is very similar to text categorization and topic detection.

– Most of emotion detection system uses the same basic algorithm used in the text categorization or topic detection.

  • Emotional Keyword Extraction

– Emotional Salience

– Using mutual information

– TF*IDF (Cosine Normalization)

  • Emotion Classifier

– Feature-Based Classification with combining several features.

– Using statistical classification methodologies

– Linear discriminant classifier (LDC) – K-nearest neighborhood (k-NN) – Support Vector Machine (SVM)

∑ =

× × =

t i i j i i j i j i

n N tf n N tf w

1 2 , , ,

)] / log( [ ) / log(

=

= = =

k j j n n j n n

) e , w ( i ) w | e ( P ) w W ; E ( I ) w ( sal

1

slide-80
SLIDE 80

8 0 IUI 20 0 7 tutoria l

Advanced ChatBot

  • Context-Sensitive ChatBot

– Most of chatbot can not reflect the discourse context of previous user turns. [ALICE, 2002; B. Shawar, 2002]

USER : Did you watch a Korean team’s game yesterday? SYSTEM : What game?. USER : How come you didn’t know that? SYSTEM : Well, I don’t know.

Context-Free ChatBot USER : Did you watch a Korean team’s game yesterday? SYSTEM : What game?. USER : How come you didn’t know that? SYSTEM : I am not interested in sports. Context-Sensitive ChatBot What’s better? Obscure answering More intelligent answering

slide-81
SLIDE 81

8 1 IUI 20 0 7 tutoria l

Discourse History in ChatBot

USER : I had a date last week Dialog Act = statement-non-opinion Main Goal = perform Agent = chat Domain = friend date = last week event = date SYSTEM : Did you like her? USER : Well, I think so, but. Dialog Act = statement-opinion Main Goal = judge Agent = chat Domain = friend SYSTEM : but, what? USER : I don’t know her real feeling. Dialog Act = statement-opinion Main Goal = know Agent = chat Domain = friend theme = real feeling SYSTEM : Why? What happen?

Utterance 1 Utterance 2 Utterance 3

DA = statement-non-opinion MA = perform DOMAIN = friend Date = 1 Theme = 1 DA = statement-opinion MA = judge DOMAIN = friend DA = statement-opinion MA = know DOMAIN = friend Theme = 1

Context-Free Index Key

Discourse Coherence

θ > − − − = rform)

  • pinion,pe

non ent dge|statem

  • pinion,ju

t P(statemen ) ,MA |DA ,MA P(DA

t- t- t t 1 1

Context-Sensitive Index Key

Previous Semantics = “statement-non-opinion,perform” Previous Keyword = “date” Scenario Session = “2” DA = statement-opinion MA=judge DOMAIN=friend Previous Semantics = “statement-opinion,judge” Previous Keyword = “NULL” Scenario Session = “2” DA = statement-opinion MA=know DOMAIN=friend Previous Semantics = “<s>,<s>” Previous Keyword = “date” DA = statement-non-opinion MA = perform DOMAIN = friend Date = 1 Theme = 1

Abstraction of previous user turn

slide-82
SLIDE 82

8 2 IUI 20 0 7 tutoria l

References

  • ALICE. 2002. A.L.I.C.E, A.I. Foundation. http://www.alicebot.org/
  • L. Holzman and W. Pottenger, 2003. Classification of Emotions in Internet

Chat: An Application of Machine Learning Using Speech Phonemes, Technical Report LU-CSE-03-002, Lehigh University.

  • J. Liscombe, 2006. Detecting and Responding to Emotion inn Speech:

Experiments in Three Domains, Ph.D. Thesis Proposal, Columbia University

  • D. Litman and K. Forbes-Riley, 2005. Recognizing student emotions and

attitudes on the basis of utterances in spoken tutoring dialogues with both human and computer tutors, Speech Communication, 48(5):559-590.

  • C. M. Lee and S. S. Narayanan. 2005. Toward Detecting Emotions in Spoken

Dialogos, IEEE Transactions on Speech and Audio Processing, 13(2):293-303.

  • T. Polzin and A. Waibel. 2000. Emotion-sensitive human-computer interfaces.

the ISCA Workshop on Speech and Emotion.

  • B. Shawar and E. Atwell, 2002. A comparison between Alice and Elizabeth

chatbot systems. School of Computing Research Report, University of Leeds

  • X. Zhe and A. Boucouvalas, 2002. Text-to-Emotion Engine for Real Time

Internet Communication, CSNDDSP.

slide-83
SLIDE 83

8 3 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-84
SLIDE 84

8 4 IUI 20 0 7 tutoria l

POSTECH multimodal Dialog System Demo

slide-85
SLIDE 85

8 5 IUI 20 0 7 tutoria l

Multi-Modal Dialog

  • Task performance and user preference for multi-modal over speech

interfaces [Oviatt et al., 1997]

– 10% faster task completion, – 23% fewer words, – 35% fewer task errors, – 35% fewer spoken disfluencies

What is a decent Japanese restaurant near here?.

Hard to represent using only uni-modal !!

slide-86
SLIDE 86

8 6 IUI 20 0 7 tutoria l

Multi-Modal Dialog

  • Components of multi-modal dialog system [Chai et al., 2002]

Speech Gesture

Spoken Language Understanding Gesture Understanding Multimodal Integrator dialog Manager

Face Expression Uni-modal Understanding Multi-modal Understanding & reference analysis Discourse Understanding

Uni-modal interpretation frame Uni-modal interpretation frame Multi-modal interpretation frame

slide-87
SLIDE 87

8 7 IUI 20 0 7 tutoria l

References (1/2)

  • R. A. Bolt, 1980, “Put that there: Voice and gesture at the graphics

interface,” Computer Graphics Vol. 14, no. 3, 262-270.

  • J. Chai, S. Pan, M. Zhou, and K. Houck, 2002, Context-based

Multimodal Understanding in Conversational Systems. Proceedings of the Fourth International Conference on Multimodal Interfaces (ICMI).

  • J. Chai, P. Hong, and M. Zhou, 2004, A Probabilistic Approach to

Reference Resolution in Multimodal User Interfaces. Proceedings of 9th International Conference on Intelligent User Interfaces (IUI-04), 70-77.

  • J. Chai, Z. Prasov, J. Blaim, and R. Jin., 2005, Linguistic Theories in

Efficient Multimodal Reference Resolution: an Empirical Investigation. Proceedings of the 10th International Conference on Intelligent User Interfaces (IUI-05), 43-50.

  • P.R. Cohen, M. Johnston, D.R. McGee, S.L. Oviatt, J.A. Pittman, I.

Smith, L. Chen, and J. Clow, 1997, "QuickSet: Multimodal Interaction for Distributed Applications," Intl. Multimedia Conference, 31-40.

slide-88
SLIDE 88

8 8 IUI 20 0 7 tutoria l

References (2/2)

  • H. Holzapfel, K. Nickel, R. Stiefelhagen, 2004, Implementation and

Evaluation of a ConstraintBased Multimodal Fusion System for Speech and 3D Pointing Gestures, Proceedings of the International Conference

  • n Multimodal Interfaces, (ICMI),
  • M. Johnston, 1998. Unification-based multimodal parsing. Proceedings
  • f the International Joint Conference of the Association for

Computational Linguistics and the International Committee on Computational Linguistics , 624-630.

  • M. Johnston, and S. Bangalore. 2000. Finite-state multimodal parsing

and understanding. Proceedings of COLING-2000.

  • M. Johnston, S. Bangalore, G. Vasireddy, A. Stent, P. Ehlen, M. Walker,
  • S. Whittaker, and P. Maloor. 2002. MATCH: An architecture for

multimodal dialogue systems. In Proceedings of ACL-2002.

  • S. L. Oviatt , A. DeAngeli, and K. Kuhn, 1997, Integration and

synchronization of input modes during multimodal human-computer

  • interaction. In Proceedings of Conference on Human Factors in

Computing Systems: CHI '97.

slide-89
SLIDE 89

8 9 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-90
SLIDE 90

9 0 IUI 20 0 7 tutoria l

POSTECH conversational TTS demo Korean (Dialog)

slide-91
SLIDE 91

9 1 IUI 20 0 7 tutoria l

  • Text-to-speech system [M. Beutnagel, et al., 1999; J. Schroeter, 2005]

– Front end

– Text normalization: take raw text and convert things like numbers and abbreviations into their written-out word equivalents. – Linguistic analysis: POS-tagging, grapheme-to-phoneme conversion – Prosody generation: pitch, duration, intensity, pause

– Back end

– Unit selection: select the most similar units in speech DB to make actual sound

  • utput

Conversational Text-to-Speech

Text normalization Linguistic Analysis Prosody Generation Unit Selection Text Synthesis Back-end Speech

(Symbolic linguistic representation)

slide-92
SLIDE 92

9 2 IUI 20 0 7 tutoria l

  • Given an alphabet of spelling symbols (graphemes) and an alphabet of

phonetic symbols (phonemes), a mapping should be achieved transliterating strings of graphemes into strings of phonemes [W.

Daelemans, et al., 1996]

  • Alignment

Multilingual Grapheme-to-Phoneme Conversion

_ | _ e _ _ yo gg g a h

Phonemes:

| | | | | | | | ㅔ ㅇ _ ㅛ ㄱ ㄱ ㅏ ㅎ

Graphemes:

<Rule Generation> Alignment Rule extraction Rule pruning Rule association Dictionary <G2P Conversion> Text normalizer Input text Canonical form of graphemes Phonemes

slide-93
SLIDE 93

9 3 IUI 20 0 7 tutoria l

  • Predicting break index from POS tagged/syntax analyzed sentence
  • Break index [J. Lee, et al., 2002]

– No break: phrase-internal word boundary and a juncture smaller than a word boundary – Minor break: minimal phrasal juncture such as an AP (accentual phrase) boundary – Major break: a strong phrasal juncture such as an IP (intonational phrase) boundary

Break Index Prediction

Probabilistic break index prediction C4.5

Break index tagged POS tag sequence Break index tagged POS tag sequence POS tag sequence

Trigram (wtag wtag break wtag) Decision tree for error correction

slide-94
SLIDE 94

94 IUI 20 0 7 tutoria l

  • Using C4.5 (decision tree)
  • Assume linguistic information and lexical information have influence to

tone of syllable

  • IP tone label prediction [K. E. Dusterhoff, et al., 1999]

– Assign one tone among “L%”, “H%”, “LH%”, “HL%”, “LHL%” and “HLH%” tone to the last syllable of IP – Features

– POS, punctuation type, the length of phrase, onset, nucleus, coda

  • AP tone label prediction

– Assign one tone among “L” and “H” tone to each syllable of AP – Features

– POS, the length of phrase, the location in prosodic phrase

Pitch Prediction using K-ToBI

slide-95
SLIDE 95

9 5 IUI 20 0 7 tutoria l

  • Index of units: pitch, duration, position in syllable, neighboring phones
  • Half-diphone synthesis [A. J. Hunt, 1996; A. Conkie, 1999]

– The diphone cuts the units at the points of relative stability (the center of a phonetic realization), rather than at the volatile phone-phone transition, where so-called coarticulatory effects appear.

Unit Selection

slide-96
SLIDE 96

96 IUI 20 0 7 tutoria l

References (1/2)

  • M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou, and A. Syrdal.
  • 1999. The AT&T Next-Gen TTS System. Joint Meeting of ASA, EAA,

and DAGA.

  • A. Conkie. 1999. Robust Unit Selection System for Speech Synthesis.

Joint Meeting of ASA, EAA, and DAGA.

  • W. Daelemans. 1996. Language-Independent Data-Oriented Grapheme-

to-Phoneme Conversion. Progress in Speech Synthesis, Springer Verlag, pp77-90.

  • K. E. Dusterhoff, A. W. Black, and P. Taylor. 1999. Using decision

trees within the tilt intonation model to predict f0 contours. Eurospeech- 99.

  • A. J. Hunt, and A. W. Black. 1996. Unit Selection in a concatenation

speech synthesis system using a large speech database. ICASSP-96, vol. 1, pp 373-376.

slide-97
SLIDE 97

9 7 IUI 20 0 7 tutoria l

  • S. Kim. 2000. K-ToBI (Korean ToBI) Labelling Conventions. UCLA

Working Papers in Phonetics 99.

  • S. Kim, J. Lee, B. Kim, and G. G. Lee. 2006. Incorporating Second-

Order Information Into Two-Step Major Phrase Break Prediction for Korean. ICSLP-06

  • J. Lee, B. Kim, and G. G. Lee. 2002. Automatic Corpus-based Tone

and Break-Index Prediction using K-ToBI Representation. ACM transactions on Asian language information processing (TALIP), Vol 1, Issue 3, pp207-224.

  • J. Lee, S. Kim, and G. G. Lee. 2006. Grapheme-to-Phoneme

Conversion Using Automatically Extracted Associative Rules for Korean TTS System. ICSLP-06

  • J. Schroeter. 2005. Electrical Engineering Handbook, pp16(1)-16(12).

References (2/2)

slide-98
SLIDE 98

9 8 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-99
SLIDE 99

99 IUI 20 0 7 tutoria l

Statistical Machine Translation POSTECH Statistical MT System Demo Korean-Engish Japanese-Korean Speech to Speech

slide-100
SLIDE 100

10 0 IUI 20 0 7 tutoria l

SMT Task

  • SMT: Statistical Machine Translation
  • Task:

– Translate a sentence in a language into another language – using statistical features of data.

나는 나는 생각한다 생각한다, , 고로 고로 나는 나는 존재한다 존재한다. . I think thus I am. I think thus I am. P(I | P(I | 나는 나는 ) = 0.7 , P( me | ) = 0.7 , P( me | 나는 나는 ) = 0.2 , ) = 0.2 , … … P(think P(think| |생각하다 생각하다) = 0.5, ) = 0.5, P(think P(think| | 생각 생각) = 0.4 , ) = 0.4 ,… … … …

slide-101
SLIDE 101

10 1 IUI 20 0 7 tutoria l

The Machine Translation Pyramid

Interlingua Native semantics Native syntax Native sentence Foreign sentence Foreign syntax Foreign semantics Interlingua based system requires syntactic analysis, semantic analysis, language generation …… . that is, all other NLP techniques and linguistics.

Interlingua based system

slide-102
SLIDE 102

10 2 IUI 20 0 7 tutoria l

SMT in the Machine Translation Pyramid

Interlingua Native semantics Native syntax Native sentence Foreign sentence Foreign syntax Foreign semantics Statistical system requires nothing but data and statistics. Do not requires any other NLP techniques and linguistics.

Statistical system

slide-103
SLIDE 103

10 3 IUI 20 0 7 tutoria l

Statistical Model

  • Statistical Modeling

Korean Korean-

  • English

English Parallel text Parallel text English text English text Korean Korean Broken Broken English English English English Translation Translation model model Language Language model model S Statistical analysis tatistical analysis S Statistical analysis tatistical analysis

) (e P

) | ( e k P

slide-104
SLIDE 104

10 4 IUI 20 0 7 tutoria l

Statistical Model

  • Fundamental models

– Language Model

– Makes English fluently

– Translation Model

– Makes translation correctly

– Decoding Algorithm

– Finds best sentence

Translation Model Language Model

Output Input

Decoding Algorithm

) ( ) | ( max arg ) | ( max arg e P e k P k e P e

e e best

= =

slide-105
SLIDE 105

10 5 IUI 20 0 7 tutoria l

Translation Model

  • Give a probability to a word/phrase pair

– For a given word/phrase, list all the possible translations. – For a good translation, give high probability – For a poor translation, give low probability

  • Independent assumption

– Word translations are independent one another. – Probability of a sentence translation = Product of words translation probabilities

=

i i i e

k P E K P ) | ( ) | (

slide-106
SLIDE 106

10 6 IUI 20 0 7 tutoria l

Decoding

  • Search space – exponential to the length of sentence

– Pruning Reduces search space – Threshold pruning & Beam search algorithms n: f: ----- P: 1.0 n: I f: *---- P: 0.5 n: think f: -*--- P: 0.4 n: am f: *---* P: 0.13 n: think f: **--- P: 0.25

No word No word translated translated A word A word translated translated Two words Two words translated translated

slide-107
SLIDE 107

10 7 IUI 20 0 7 tutoria l

Evaluation

  • BLEU score

– Most famous metric – Range 0~1. – Higher score means better translation

=

⋅ =

N n n n

p w

1

) log exp( BP BLEU

BP: BP: factor related to the length of candidate translation factor related to the length of candidate translation p pn

n:

: n n-

  • gram precision, ignoring duplicate count

gram precision, ignoring duplicate count N: N: maximum order of n maximum order of n-

  • gram

gram w wn

n: weight

: weight

⎩ ⎨ ⎧ ≤ > =

r c e r c

c r

if if 1 BP

) / 1 (

c : length of candidate translation c : length of candidate translation r : length of reference sentence r : length of reference sentence

slide-108
SLIDE 108

10 8 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-109
SLIDE 109

10 9 IUI 20 0 7 tutoria l

IBM Model

  • Model 1

– Source length only dependent on target length – Assume uniform probability for position alignment – Source word only dependent on aligned word

  • Model 2

– Target position depends on the source position

  • Model 3

– Add Fertility Model

  • Model 4

– Model re-ordering of phrases – deficient: alignment can generate source positions outside of length

  • Model 5

– Remove deficiency from model 4

slide-110
SLIDE 110

110 IUI 20 0 7 tutoria l

GIZA++

  • GIZA

– Part of the SMT toolkit EGYPT – An word alignment tool – An implementation of IBM Model 4.

  • GIZA++

– And extension of GIZA – Model 5, HMM alignment model …

slide-111
SLIDE 111

111 IUI 20 0 7 tutoria l

Phrase-based SMT

  • Pharaoh: [Philipp Koehn, 2003]

– An implementation of Statistical Phrase-based Machine Translation – Phrase:

– Not a syntactic phrase – A sequence of contiguous words

– SMT, but Translation unit is the phrase

slide-112
SLIDE 112

112 IUI 20 0 7 tutoria l

Pharaoh Overview

  • Based on noisy channel model (Typical SMT )
  • Language model p(e) replaced with

– Word cost introduced to adjust output length – ω > 1 : prefer longer translation – ω = 1 : don’t care about length – ω < 1 : prefer shorter translation ) (

) ( ) (

e length LM e

p e p ω =

slide-113
SLIDE 113

113 IUI 20 0 7 tutoria l

Pharaoh Overview

  • Translation model p(f|e) replaced with
  • Input sentence f is segmented into a sequences of I phrases

– Translation occurs phrase by phrase – And assume independence of each translation of phrase – Distortion probability d() is introduced.

– ai : start position of the foreign phrase that was translated into the ith English phrase – bi: end position of the foreign phrase that was translated into the (i-1)th English phrase

= −

− =

I i i i i i I I

b a d e k e k p

1 1 1 1

) ( ) | ( ) | ( φ

I

k 1

i i

e k →

| 1 | 1

1

) (

− − −

= −

i i b

a i i

b a d α

slide-114
SLIDE 114

114 IUI 20 0 7 tutoria l

Pharaoh Training

  • Alignment, Intersection and union

K-E 생맥 주 한 잔 주 세요 . A Draft Beer , Please . E-k 생맥 주 한 잔 주 세요 . A Draft Beer , Please . Inter-sect 생맥 주 한 잔 주 세요 . A Draft Beer , Please . Inter-sect 생 맥 주 한 잔 주 세 요 . A Draft Beer ? , Please ? .

GIZA++ results Intersection Heuristic Union

slide-115
SLIDE 115

115 IUI 20 0 7 tutoria l

Pharaoh Training

  • Learning all phrase pairs that are consistent with the word alignment
  • (A Draft | 생맥주) ( Beer | 한 잔 ) (, | 주) (Please | 세요) (. | .)
  • (A Draft Beer | 생맥주 한 잔) (Beer , | 한 잔 주) (, Please | 주 세요) (Please . | 세요 .)
  • (A Draft Beer , | 생맥주 한 잔 주 ) ( Beer , Please | 한 잔 주 세요 ) ( , Please | 주 세요 .)
  • (A Draft Beer , Please | 생맥주 한 잔 주 세요 ) ( Beer , Please . | 한 잔 주 세요 .)
  • (A Draft Beer , Please . | 생맥주 한 잔 주 세요 . )

Inter-sect 생 맥 주 한 잔 주 세 요 . A Draft Beer , Please .

slide-116
SLIDE 116

116 IUI 20 0 7 tutoria l

Techniques to improve

  • Pre-processing

– Normalize the input text to the “easy to translate” form – Reordering, tagging, paraphrasing, …… .

Foreign Foreign

Normalized Normalized Foreign Foreign

Native Native Normalization Normalization Translation Translation

slide-117
SLIDE 117

117 IUI 20 0 7 tutoria l

Techniques to improve

  • Post-processing

– Translation may have some errors. – Perform error-correction decoding – Convert some trivial errors

– E.g. (morpheme connectivity check)

Foreign

Native Native with error with error

Native Native Translation Translation Error correction Error correction

slide-118
SLIDE 118

118 IUI 20 0 7 tutoria l

Add POS tag

  • Approach

– Add part-of-speech (POS) tags to the training data

  • Effect

– Distinguish some of the homonyms – Change spacing unit

  • Why useful?

– For many languages, automatic POS tagging is available. – Spacing unit is changed into unit of meaning

slide-119
SLIDE 119

119 IUI 20 0 7 tutoria l

Delete Useless words

  • Approach

– For some language pairs, there are useless words for translation. – Delete useless words to help word alignment

  • Effects

– Reduce the number of misaligned pairs

  • Example: Korean-English translation

– English : the, a, an, -es Korean has a tendency not to distinguish number in noun – Korean : some kinds of post-positions ( 은, 는, 이, 가, 을, 를, …) English does not have case-markers

slide-120
SLIDE 120

120 IUI 20 0 7 tutoria l

Using Dictionary

  • Approach

– Just append the dictionary to the end of parallel corpus

  • Effects

– Add one count for correct phrase pairs in the dictionary – Increase the coverage of vocabulary

  • Why useful?

– Usually, a dictionary is easily accessible – Already built in web or other applications – Adding dictionary gives significant improvement.

slide-121
SLIDE 121

121 IUI 20 0 7 tutoria l

Dividing Language Model

Korean Korean Broken Broken English English English English

Korean/English Korean/English Bilingual Text Bilingual Text English English Text Text

Translation Translation Model Model P(k|e P(k|e) ) Language Language Model1 Model1 Language Language Model2 Model2 ? ? Select LM

slide-122
SLIDE 122

122 IUI 20 0 7 tutoria l

Contents

  • PART-I: Statistical Speech/Language Processing

– Natural Language Processing – short intro – Automatic Speech Recognition – (Spoken) Language Understanding

  • PART-II: Technology of Spoken Dialog Systems

– Spoken Dialog Systems – Dialog Management – Dialog Studio – Information Access Dialog – Emotional & Context-sensitive Chatbot – Multi-modal Dialog – Conversational Text-to-Speech

  • PART-III: Statistical Machine Translation

– Statistical Machine Translation – Phrase-based SMT – Speech Translation

slide-123
SLIDE 123

123 IUI 20 0 7 tutoria l

Speech Translation

  • ASR

– Automatic Speech Recognizer – Generate texts of given speech signal.

  • TTS

– Text-To-Speech – Synthesis sounds of given text.

  • Speech Translation Task

– Translate speech signal in a language into another language – Combining ASR, TTS and Machine Translation

slide-124
SLIDE 124

124 IUI 20 0 7 tutoria l

Combining ASR, TTS and SMT

  • Cascading approach

– Connect ASR, SMT and TTS in cascading manner – The ASR result be a input for the SMT system. – Translation result from SMT system be a input for TTS system. – Simple!

ASR SMT TTS Original Original Speech Speech Recognized Recognized Text Text Translated Translated Text Text Translated Translated Speech Speech

slide-125
SLIDE 125

125 IUI 20 0 7 tutoria l

Combining ASR, TTS and SMT

ASR ASR

ASR ASR Result1 Result1 ASR ASR Result2 Result2 ASR ASR Result3 Result3 ASR ASR Result4 Result4 ASR ASR Result n Result n SMT SMT Result1 Result1 SMT SMT Result2 Result2 SMT SMT Result3 Result3 SMT SMT Result4 Result4 SMT SMT Result n Result n

H Highest score translation ighest score translation Speech Signal Speech Signal TTS TTS Speech Signal Speech Signal SMT SMT System System scoring scoring scoring scoring

slide-126
SLIDE 126

126 IUI 20 0 7 tutoria l

References (1/2)

  • P.F. Brown, S.A. Della Pietra, V.J. Della Pietra and R.L. Mercer. 1993.

Mathematics of Statistical Machine Translation: Parameter Estimation. Computatitional linguistics, vol. 19, no. 2, pages 263-311.

  • C. Callison-Burch, P. Koehn and M. Osborne. 2006, Improved

Statistical Machine Translation Using Paraphrases. In proceedings NAACL.

  • M. Collins, P. Koehn, and I. Kucerova. 2005. Clause Restructuring for

Statistical Machine Translation. ACL.

  • P. Koehn, F.J. Och and D. Marcu. 2003. Statistical Phrase-Based
  • Translation. In proceedings of HLT, Pages 127-133.
  • P. Koehn. 2004. Pharaoh: a Beam Search Decoder for Phrase-Based
  • SMT. In Proceedings of AMTA pages 115-124.
  • P. Koehn. 2004. Pharaoh, User Manual and Description for Version 1.2.

http://www.isi.deu./licensed-sw/pharaoh/.

slide-127
SLIDE 127

127 IUI 20 0 7 tutoria l

References (2/2)

  • P. Koehn, A. Axelrod, A. Birch Mayne, C. Callison-Burch, M. Osborne

and D. Talbot. 2005. Edinburgh System Description for the 2005 IWSLT Speech Translation Evaluation. IWSLT.

  • J. Lee, D. Lee, and G. G. Lee. 2006. Improving phrase-based

Korean-English statistical machine translation. ICSLP-06

  • F.J. Och and H. Ney. Improved statistical alignment models. 38th annual

meeting of the ACL, pages 440-447.

  • K. Papineni, S.Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU a method

for automatic evaluation of machine translation. 40th Annual Meeting of the ACL pages 311-318. Philadelphia, PA, Jul.

  • R. Zhang, G. Kikui. 2006. Integration of Speech Recognition and

Machine Translation: Speech Recognition word Lattice Translation. Speech Communication, Vol. 48, Issues 3-4.

slide-128
SLIDE 128

128 IUI 20 0 7 tutoria l

Thanks To

Minwoo Jung

  • Cheongjae Lee
  • SangKeun Jung
  • Seungwon Kim
  • Jinsik Lee
  • Jonghun Lee
  • Kyungdeok Kim
  • Sukwhan Kim
  • DonghHyeon Lee
  • HyungJong Noh
  • And others…
slide-129
SLIDE 129

129 IUI 20 0 7 tutoria l

Thank you! Any Question??