Natural Language Generation and Dialog System Evaluation
EE596/LING580 -- Conversational Artificial Intelligence Hao Cheng University of Washington
Natural Language Generation and Dialog System Evaluation - - PowerPoint PPT Presentation
Natural Language Generation and Dialog System Evaluation EE596/LING580 -- Conversational Artificial Intelligence Hao Cheng University of Washington Conv AI System Diagram 1 Natural Language Generation 2 NLG Approaches Template
EE596/LING580 -- Conversational Artificial Intelligence Hao Cheng University of Washington
1
2
Microsoft XiaoIce
3
A big conversation corpus
A: How old are you B: I am eight A: What’s your name ? B: I am john A: How do you like CS224n? B: I cannot hate it more. A: How do you like Jiwei ? B: He’s such a Jerk !!!!!
An new input : What’s your age ?
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
(Ritter et al., 2010)
Slide borrowed from Michel Galley Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ? I’m fine . EOS
Encoding Decoding
EOS I’m fine .
Source : Input Messages Target : Responses
(Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015) Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ? Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ?
Encoding
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ?
Encoding
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ?
Encoding
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ?
Encoding
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ?
Encoding Decoding
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ? I’m
Encoding Decoding
eos Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ? I’m fine
Encoding Decoding
eos I’m Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ? I’m fine .
Encoding Decoding
eos I’m fine Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
how are you ? I’m fine . EOS
Encoding Decoding
eos I’m fine . Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
what is moral ? what empowered humanity what is immoral ? the fact that you have a child . what is the purpose of existence ? to find out what happens when we get to the planet earth . what do you think about bill gates ? He’s a good man
Slide from Jiwei Li, Lecture at CS224S / LINGUIST285 “Spoken Language Processing”
18
19
project (King et al., 1996)
(SQALE) project (Young et al., 1997)
environment
project (Jacquemin et al., 2000)
20
21
DialogFlow, Microsoft BotFramework & LUIS, Rasa, …
at https://breakend.github.io/DialogDatasets/)
collection and user studies
22
23
activities in actual situations
usability laboratory
24
answers
25
compared and elaborated
26
27
metrics are regarded as objective
agreement (e.g., the Cohen’s kappa coefficient you computed in Lab 3)
28
time, the number of help requests/barge-ins/repair utterances, correction rate, timeouts, …
transaction success, …
through questionnaires and personal interviews
29
corresponding reference responses, e.g., perplexity, BLUE, METEOR, ROUGE
can be equally acceptable
30
information flow, and semantic coherence)
31
32
related to task success and task costs
satisfaction using a set of features
33
34
35
Task success
Efficiency cost
Quality cost
36
does not allow barge-ins), a straightforward comparison is not possible.
annotation and analysis of the collected data
for such a complex task
(Moller 2009)
system.
novices to experts (Walker 2000 et al.)
(Moller 2005)
37
38
39
correlation is relatively weak
the correlation is as low as the number of total turns
40
higher than those classified as unhappy.
rate the conversation higher
41
and socialbot response, conversation duration, number of turns, and mean response time
appropriateness
42