BERTScore: Evaluating Text Generation with BERT Varsha Kishore - - PowerPoint PPT Presentation

bertscore evaluating text generation with bert
SMART_READER_LITE
LIVE PREVIEW

BERTScore: Evaluating Text Generation with BERT Varsha Kishore - - PowerPoint PPT Presentation

BERTScore: Evaluating Text Generation with BERT Varsha Kishore Tianyi Zhang Felix Wu Kilian Q. Weinberger Yoav Artzi I am like I like translate ich liebe es I like it I love it I am loving it I am like Candidate Reference I like


slide-1
SLIDE 1

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang Felix Wu Kilian Q. Weinberger Yoav Artzi Varsha Kishore

slide-2
SLIDE 2

ich liebe es I like it I like I am like I love it I am loving it

translate

slide-3
SLIDE 3

ich liebe es I like it I love it 0.88/1.00 Metric

translate

Candidate Reference I like I am like I love it I am loving it

slide-4
SLIDE 4

Text Generation Evaluation Metrics N-gram matching approaches Embedding-based metrics

BLEU (Papineni et al., 2002) METEOR (Banerjee & Lavie, 2005) ROUGE (Lin, 2004) chrF (Popovic, 2015) Meant 2.0 (Lo, 2017) YiSi -1 (Lo et al., 2018) BERTScore

slide-5
SLIDE 5

Candidate 2
 It is freezing today

BLEU N-gram Matching

Reference
 The weather is cold today

BLEU cannot identify synonyms BLEU gives higher score to candidate 1

Candidate 1
 The weather is sunny today

slide-6
SLIDE 6

BERTScore: an evaluation metric that uses BERT embeddings

slide-7
SLIDE 7

Transformer model pre-trained on 
 masked language modeling and next sentence prediction Generates word token embeddings that reflect their context BERT

slide-8
SLIDE 8

Reference the weather is cold today Candidate it is freezing today

Contextual embedding Pairwise cosine similarity

BERTScore

the it weather is cold today is freezing today

slide-9
SLIDE 9

Greedy Matching

ō ō

it is freezing today weather is cold today the Reference Candidate

slide-10
SLIDE 10

Greedy Matching

Precision Recall

Match words in candidate to reference Match words in reference to candidate

slide-11
SLIDE 11

Greedy Matching

Precision Recall

Match words in candidate to reference Match words in reference to candidate

slide-12
SLIDE 12

Greedy Matching

Precision Recall

Match words in candidate to reference Match words in reference to candidate

slide-13
SLIDE 13

Greedy Matching

Precision Recall

Match words in candidate to reference Match words in reference to candidate

slide-14
SLIDE 14

0.713 0.515 0.858 0.796 0.913 0.913 0.796 0.858 0.713

Greedy Matching

Precision Recall

Match words in candidate to reference Match words in reference to candidate

slide-15
SLIDE 15

0.713 0.515 0.858 0.796 0.913 0.913 0.796 0.858 0.713

Greedy Matching - Aggregate

Precision Recall

slide-16
SLIDE 16

0.713 0.515 0.858 0.796 0.913 0.913 0.796 0.858 0.713

Greedy Matching - Aggregate

Precision Recall

0.759 0.820

slide-17
SLIDE 17

F1 = 2 Precision · Recall Precision + Recall

<latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit><latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit><latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit><latexit sha1_base64="ft0u0d5gLkLIktmgMFQEAFxGQ3o=">ACRnicbZDNaxNBGMbfTW0bo61rPXoZDIghN1SUJBCQCgeo5gPyC5hdvJuMmT2g5l3S8Oy4P/mpWdv/glePLSIV2ezOWjiCwMPz/O8zMwvypU05HnfndbBg8Oj4/bDzqPHJ6dP3KdnI5MVWuBQZCrTk4gbVDLFIUlSOMk18iRSOI5W7+t8fI3ayCz9TOscw4QvUhlLwclaMzcMCG+ovPIrdsnOWfAuiDUXZeMONApZr1YsEPOMWGN/QsGVqr91udxsztej1vM2xf+FvRhe0MZu63YJ6JIsGUhOLGTH0vp7DkmqRQWHWCwmDOxYovcGplyhM0YbnBULGX1pmzONP2pMQ27t8bJU+MWSeRbSaclmY3q83/ZdOC4rdhKdO8IExFc1FcKEYZq5myubQISK2t4EJL+1YmltxyJEu+YyH4u1/eF6Pznu/1/I8X3f7oS4OjDc/hBbwCH95AHz7AIYg4Cv8gDu4d26dn84v53dTbTlbhM/gn2nBHwO+s9M=</latexit>
slide-18
SLIDE 18

Reference the weather is cold today Candidate it is freezing today Contextual embedding Pairwise cosine similarity F1 Score

slide-19
SLIDE 19

Evaluation: WMT Translation Benchmark

Reference: The weather is cold today. Candidate: It is freezing today. Reference: The garden is nice. Candidate: The garden was pretty. Reference: I like apples very much. Candidate: I love apples.

0.85 0.71 0.79 Human 0.77 0.77 0.80 Metric

compute correlation

slide-20
SLIDE 20

Correlation

0.2 0.4 0.6 0.8

Language Pair

Czech-English German-English English-Czech English-German

BLEU ITER YiSi-1 RUSE BertScore F1

Correlation Study

slide-21
SLIDE 21

4 tasks 8 languages 363 systems

slide-22
SLIDE 22

Download here:https://pypi.org/project/bert-score/ Or Just: pip install bert_score Github