[PPT] - Argument Retrieval in Project Debater Yufang Hou IBM Research PowerPoint Presentation

SLIDE 1

Argument Retrieval in Project Debater

Yufang Hou

IBM Research Europe, Dublin

SLIDE 2

1997 First computer to defeat a world champion in Chess (Deep Blue) 2011 First computer to defeat best human Jeopardy! players (Watson) 2019 First computer to successfully debate champion debaters (Proje Project Debat ater)

IBM Research: History of Grand Challenges

SLIDE 3

Motion: We should subsidize preschool

Selected from test set based on assessment of chances to have a meaningful debate

Format: Oxford style debating Fully automatic debate No human intervention

Fully automatic debate

No human intervention

Segments from a Live Debate (San Francisco, Feb 11th 2019) Expert human debater: Mr. Harish Natarajan

SLIDE 4

Project Debater: Media Exposure

Hundreds

f press articles in all

leading news papers

100 Million

people reached

Millions

f video views

2.1 Billion

social media impressions

SLIDE 5

Fu

Full Li Live e Deba ebate, te, Feb Feb-2019 2019

https://www.youtube.com/watch?v=m3u-1yttrVw&t=2469s

“T

“The e Deb ebater ter” Doc

cumen

enta tary

https://www.youtube.com/watch?v=7pHaNMdWGsk&t=1383s

SLIDE 6

Outline

q System overview q Argument retrieval in Project Debater q Some retrospective thoughts

SLIDE 7

Current Publications Highlight Various As Aspects of the System

SLIDE 8

Pub Public licatio ions ns an and Dat Datas asets are are av avai ailab able at at - https://www.research.ibm.com/artificial- intelligence/project-debater/research/

SLIDE 9

Outline

q System overview q Argument retrieval in Project Debater q Some retrospective thoughts

SLIDE 10

Li

Lippi ppi an and d To Toroni ni, IJCAI AI, 2015

Al

Al-Khatib et al, NAAC AACL 2016; Wachsmuth et al, Ar Argument-Mi Mining W Workshop, 2 2017, … …

St

Stab and nd Gu Gurevy vych, EMNLP 2014; Stab et al, NAAC AACL 2018, …

Re

Recent nt reviews

Fiv

Five years rs of

f argu

rgument min inin ing: g: a data-dr driven en an anal alysis, Cabr abrio an and d Vi Villata, IJCAI AI, 2018

Ar

Argumentation Mining, St Stede an and d Sch chnei eider der, Synthes esis Lect Lectures es on HLT LT, 2018 2018

Ar

Argument Mining: A A Survey, Lawrence and Reed, CL, 2019

Related Work

SLIDE 11

Con

ntext Dependent Claim Detection
n, Levy et al, COLING 2014.

2014. Show

w Me You

Your Evidence - an Au Autom

matic Method
d for
r Con
ntext Dependent

Evidence Detection

n, Rinot
tt et al, EMNLP 2015.

2015.

Wikipedia Stage

SLIDE 12

Wikiped

edia Claim/Eviden ence e Label eled ed Data – Label eling Proces ess

Con

ntrov
versial Top
pic

Select Wikipedia Ar Articles Fi Find Claim Candidates per Ar Article Con

nfirm/Reject Each Claim Candidate

Fi Find Candidate Evidence per Claim Con

nfirm/Reject Each Candidate Evidence

ü 5 5 In-house An Annotators Per Stage ü Ex Exhau austive e an annotat ation

Wikipedia Stage

SLIDE 13

Wikiped

edia Claim/Eviden ence e Label eled ed Data - Res esults

ü 58

58 Controver ersial al Topi pics cs se selected from rom De Debatabase

ü 547

547 rel elev evan ant Wikipedi pedia a ar articl cles es car caref efully label abeled ed by by in-ho hous use team

§

E.g., Ban the sale of

f Viol
lent Video
Games for
r Children

ü 2.

2.6K 6K Clai aims ms & & 4. 4.5K 5K Ev Eviden dence ce th that s t support/c t/conte test th t the c claims

§

Evidence length vary from

m on
ne sentence to
a whol
le paragraph

§

Three types of

f Evidence: Study, Expert, and An

Anecdot

tal

ü Pr

Pre-def defined ed trai ain/dev dev/tes est spl plit

Wikipedia Stage

SLIDE 14

System

em Des esign for Ar Argumen ent Mining

Wikipedia Stage

Topic Topic Analysis Document Level IR Claim Detection

We should subsidize preschool

Evidence Detection

Retrieve documents that directly

address the topic and are likely to contain argumentative text segments

Simple logistic regression model with lots of

carefully designed features

GrASP: Rich Patterns for Argumentation

Mining, Shnarch et al., EMNLP 2017

Static train/dev/test datasets
Moderate success over a range of test topics
Only positive instances are annotated
Limited coverage

SLIDE 15

Cor

rpus wide argument mining - a wor
rking sol
lution
n, Ein-Dor
r et al, AAAI

AAAI 2020. 2020.

VLC (Very Large Corpus) Stage

SLIDE 16

Se

Sent ntenc nce Level (SL (SL) ) strategy, vs. Docum ument nt Level us used before

SCAL

ALE

~240

~240 trai ain/dev dev topi pics cs & ~100 ~100 tes est topi pics cs

~200,

~200,000 000 sen enten ences ces car caref efully an annotat ated ed for trai ain/dev dev à Re Retrospective Lab abeling ng Par arad adigm

~10,

~10,000, 000,000, 000,000 000 Sen enten ences ces - Re Reporting ng resul ults over a a mas massive corpus us

Clos Closer tha han n ever to

a wor
rking

ng solut

lution
n

Mai Main n Di Disti tincti nction n from Prev. Wo Work

VLC (Very Large Corpus) Stage

SLIDE 17

System Ar Architecture

Massive Corpus ~10B Sentences Queries Controversial Topic Ranking Model BERT Retrieved Sentences High-precision Evidence Set

VLC (Very Large Corpus) Stage

Support flexible patterns to retrieve

argumentative sentences § Topic terms § Evidence connectors § sentiment lexicon § NER

Iteratively Collected Labeled-Data

Retrieve 12, 000 sentences per

evidence type per topic

Starting with LR from Rinott et

al, EMNLP 2015

Re

Retrospectiv ive Labelin ing g Paradigm igm

An

An infrastructure that supports qu quick dy dynamic expe periments and d mo monitors annotation quality

SLIDE 18

Co

Collecting labeled data poses a two wo-fo fold c challenge -

Low
w prior
r of
f pos
sitive examples
An

Annot

tation
n throu
ugh crow
wd requires expertise –

simple guidelines, careful mon

nitor
ring…
BT

BTW - Kappa of

f ~0.

~0.4 4 is ac actual ally quite good

od
De

Developing corpus-wi wide a argument m t mining p poses a anoth ther c challenge

Imagine ~2,

~2,000 000 new prediction

ns every week… à As

Assoc

ciated infrastructure is a must
Re

Retrospective lab abeling ng of top predictions ns is a a nat natur ural al and and effective solut ution

Ho How to to Collect ct Lab Labeled Data? ata?

VLC (Very Large Corpus) Stage

SLIDE 19

Why Eviden ence e Det etec ection is Hard?

Mo Moti tion: n: Blood donation should be mandatory According to studies, blood donors are 88 percent less likely to suffer a heart attack… Statistics … show that students are the main blood donors contributing about 80 percent…

REJECTED CONFIRMED

SLIDE 20

Number of candidates Macro-Average Precision

Re Results

Re

Results by va various BERT RT Models ove ver a a mas assive corpus of ~10B B sentences

BA

A baselines: Bl BlendNet, At Attention based bi bidi directional LS LSTM mode del [Shnarch et al. (2018)] )]

Hig

High p precis isio ion

Wi

Wide coverage wi with th diverse evidences (hi (highl hly simi milar sent ntenc nces are remo moved)

VLC (Very Large Corpus) Stage

SLIDE 21

Outline

q System overview q Argument retrieval in Project Debater q Some retrospective thoughts

SLIDE 22

Modeling human dilemmas

Modeling the world of human

controversy and

Enabling the system to suggest

principled arguments

Listening comprehension

Identify key claims hidden in

long continuous spoken language

Compare to personal assistants
simple short commands

Data-driven speech writing and delivery

Digest massive corpora
Write a well-structured speech
Deliver with clarity and purpose

Challenges to Consider while developing a Live Debate System

Ar Argumen ent ret etriev eval is the e first step ep to build such a system em

SLIDE 23

The Problem: Many things need to succeed simultaneously and many things can go wrong…

SLIDE 24

Many things can go wrong… / Examples

Ge

Getti tting th the s sta tance wr wrong m means y you s support y t your o

pponent…

t…

Dr

Drifting from the topic – fr from Ph Physical l Ed Education

n to

to Se Sex Edu ducat atio ion an and d back back…

The

The system is onl nly as good as its corpus us à … … gl global al war armin ing g wil ill lead ad ma malaria virus to to creep into nto hi hilly areas…

SLIDE 25

Progress over time / Improvement in Precision of Detecting Claims

Docu

Document le level l IR

Co

Corpus us: Wi Wikipedia

Ex

Exhaustive ve labe belling g

f
f pos
sitive instance

ces

LR

LR + Rich feat ature res

Se

Sentence level IR

Ve

Very Large Corpus: 400 400 mi million articles (50 times larger than Wikipedia)

Retrospective labelling
Bert fine-tuning

Attention-based Bi-LSTM with weak supervision Sentence level IR Flexible query Very large corpus Retrospective labelling

SLIDE 26

Beyond Project Debater

Computational Argumentation

Argument retrieval
Argument Unit Identification
Argument Relation Prediction
Argument(ation) Quality
Argument Generation
…

Social NLP

Sentiment
Persuasiveness
Social bias
Framing
Fact verification
…

Discourse and Pragmatics

Argumentative discourse
Argumentative coherence
…

Dialogue System Natural Language Generation Text Summarization

Computational argumentation

is emerging as an interesting research area

“Argument mining” is the new

keyword in the list of topics in recent *ACL conferences

SLIDE 27

Argument Retrieval in Project Debater Yufang Hou IBM Research - - PowerPoint PPT Presentation

Argument Retrieval in Project Debater

Yufang Hou

IBM Research: History of Grand Challenges

Project Debater: Media Exposure

Full Li Live e Deba ebate, te, Feb Feb-2019 2019

“The e Deb ebater ter” Doc

enta tary

Outline

q System overview q Argument retrieval in Project Debater q Some retrospective thoughts

Pub Public licatio ions ns an and Dat Datas asets are are av avai ailab able at at - https://www.research.ibm.com/artificial- intelligence/project-debater/research/

Outline

q System overview q Argument retrieval in Project Debater q Some retrospective thoughts

Related Work

Wikipedia Stage

Wikipedia Stage

ü 58

ü 547

ü 2.

ü Pr

Wikipedia Stage

Wikipedia Stage

VLC (Very Large Corpus) Stage

VLC (Very Large Corpus) Stage

VLC (Very Large Corpus) Stage

VLC (Very Large Corpus) Stage

VLC (Very Large Corpus) Stage

Outline

q System overview q Argument retrieval in Project Debater q Some retrospective thoughts

Challenges to Consider while developing a Live Debate System

The Problem: Many things need to succeed simultaneously and many things can go wrong…

Many things can go wrong… / Examples

Progress over time / Improvement in Precision of Detecting Claims

Beyond Project Debater

Thanks! Q&A