The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind - PowerPoint PPT Presentation

The BeSt Eval at the 2017 NIST TAC KBP

BeSt: Evaluating Mind Reading People in real world: Barack Events in real world: what, who, Obama, when, … Marine Le Pen, CharlesInParis, … There was a Dislike demonstration against gay marriage Non- Committed in Paris yesterday. ? Belief CharlesInParis: I ! was at the pro- Committed Belief family demo yesterday! Participant Infer: It was such a good Like Like show of support for CharlesInParis French values! So different from the alleged riots…

BeSt Eval • BeSt Eval organized by the DEFT BeSt group – Albany, Columbia, Cornell, GWU, IHMC, LDC, MITRE, NIST, Pittsburgh • Task: Evaluate addition of belief and sentiment to existing KB objects (EREs) – Sources: Entities, – Targets: Entites (sentiment only), relations and events (EREs) – Want to evaluate KB population, not text tagging – Want to exclude ERE KBP tasks from belief and sentiment tasks • Allows component-level research improvements and system development • First evaluation to cover both belief and sentiment

BeSt Eval: The Role of ERE Annotation • Assume ERE annotation as input – ERE annotation (LDC): straightforward representation of entities, relations and events in KB with pointers to mentions in text • Distinction between object vs. object mention • Currently no cross-document co-reference in LDC gold or predicted ERE data, so analysis is one document at a time – If cross-document co-reference is available, nothing changes for evaluation framework – Most systems would not change given cross- document co-reference

BeSt Eval Tasks 24 conditions: - 2 cognitive attitudes (belief and sentiment) - 3 languages - 2 conditions (gold ERE and predicted ERE) - 2 genres Because of important differences in data, each condition is very different

Training Data: Same as for BeSt Eval 2016 English All data Discussion Forums (%) Newswire (%) Train 157K words 89% 11% Evaluation 2016 88K words 52% 48% Evaluation 2017 95K words 65% 35% Spanish All data Discussion Forums (%) Newswire (%) Train 79K words 100% 0% Evaluation 2016 67K words 61% 39% Evaluation 2017 89K words 66% 34% Chinese All data Discussion Forums (%) Newswire (%) Train 133K words 100% 0% Evaluation 2016 122K words 65% 35%

English Training Data: Belief vs. Sentiment Disc. Forums vs. Newswire Percentage of targets that have: All data Discussion Forums Newswire Sentiment from any source 18.9% Sentiment from author 16.3% Sentiment from other source 2.6% Belief from any source Belief from author Belief from other source

Data: Belief vs. Sentiment Disc. Forums vs. Newswire Percentage of targets that have: All data Discussion Forums Newswire Sentiment from any source 18.9% 21.2% 6.8% Sentiment from author 16.3% Sentiment from other source 2.6% Belief from any source Belief from author Belief from other source

Data: Belief vs. Sentiment Disc. Forums vs. Newswire Percentage of targets that have: All data Discussion Forums Newswire Sentiment from any source 18.9% 21.2% 6.8% Sentiment from author 16.3% 19.0% 1.8% Sentiment from other source 2.6% 2.2% 5.0% Belief from any source Belief from author Belief from other source

Data: Belief vs. Sentiment Disc. Forums vs. Newswire Percentage of targets that have: All data Discussion Forums Newswire Sentiment from any source 18.9% 21.2% 6.8% Sentiment from author 16.3% 19.0% 1.8% Sentiment from other source 2.6% 2.2% 5.0% Belief from any source 100% 100% 100% Belief from author 94.3% 99.3% 79.2% Belief from other source 13.7% 9.3% 26.6% Note: Belief includes “NA” tag which was not included in evaluation

Evaluation Script Eval script written at Columbia based on community consensus • Goal: evaluate accuracy of links added to KB • – Not focused on text annotation (except for Provenance) Target must be correct • Partial credit • – For incorrect source – If value of sentiment (pos, neg) or of belief (CB, NCB, ROB) is wrong – For target “provenance”, two conditions: • At least one span in list must be correct (WHAT WE USED) • Score weighted by the F-measure of predicted mentions against correct mentions • “At-least-one” condition gets pretty consistently 2% better scores than the weighted approach, with no change in order of system results

Participation • Participation increased over 2016, but still low – Hard and new problem – Decided not to advertise

BeSt Eval Participants Belief English Spanish Chinese Gold Predicted Gold Predicted Gold Predicted ERE ERE ERE ERE ERE ERE DF NW DF NW DF NW DF NW DF NW DF NW Baseline Albany --- --- --- --- --- --- --- --- --- --- --- --- Chinese Ac. Sci. X X X X X X X X X X X X Columbia/GWU X X X X X X X X X X X X Cornell/Mich --- --- --- --- --- --- --- --- X X X X Jaén Sinai X X --- --- --- --- --- --- --- --- --- --- IBM Dublin --- --- --- --- --- --- --- --- --- --- --- --- Best Perform.

BeSt Eval Participants Belief: Top Performers English Spanish Chinese Gold Predicted Gold Predicted Gold Predicted ERE ERE ERE ERE ERE ERE DF NW DF NW DF NW DF NW DF NW DF NW Baseline X X Albany --- --- --- --- --- --- --- --- --- --- --- --- Chinese Ac. Sci. X X X X X X X X X X X X Columbia/GWU X X X X X X X X X X X X Cornell/Mich --- --- --- --- --- --- --- --- X X X X Jaén Sinai X X --- --- --- --- --- --- --- --- --- --- IBM Dublin --- --- --- --- --- --- --- --- --- --- --- --- Best Perform. 78 63 1 1 78 68 0 1 82 64 0 0

BeSt Eval Participants Sentiment English Spanish Chinese Gold Predicted Gold Predicted Gold Predicted ERE ERE ERE ERE ERE ERE DF NW DF NW DF NW DF NW DF NW DF NW Baseline Albany X X X X --- --- --- --- --- --- --- --- Chinese Ac. Sci. X X X X X X X X X X X X Columbia/GWU X X X X X X X X X X X X Cornell/Mich --- --- --- --- --- --- --- --- X X X X Jaén Sinai --- --- --- --- --- --- --- --- --- --- --- --- IBM Dublin X X X X --- --- --- --- --- --- --- --- Best perform.

BeSt Eval Participants Sentiment: Top Performers English Spanish Chinese Gold Predicted Gold Predicted Gold Predicted ERE ERE ERE ERE ERE ERE DF NW DF NW DF NW DF NW DF NW DF NW Baseline X X X X X Albany X X X X --- --- --- --- --- --- --- --- Chinese Ac. Sci. X X X X X X X X X X X X Columbia/GWU X X X X X X X X X X X X Cornell/Mich --- --- --- --- --- --- --- --- X X X X Jaén Sinai --- --- --- --- --- --- --- --- --- --- --- --- IBM Dublin X X X X --- --- --- --- --- --- --- --- Best perform. 25 12 8 3 21 11 7 3 28 14 4 2

Observations • Predicted ERE is hard – Same in 2016 • Will continue to emerge as a central topic of research in NLP as we move towards deep understanding of language – Cognitive science – Pragmatics – Discourse & dialog

Many Thanks • NIST (Hoa Dang) for organizing the evaluation • LDC (Jennifer Tracey and Michael Arrigo) for annotations – The corpora will continue to be used • DARPA (Boyan Onyshkevych) for funding the research • All teams that participated in planning the evaluation • All teams that participated in the evaluation

The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind - PowerPoint PPT Presentation

The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind Reading People in real world: Barack Events in real world: what, who, Obama, when, Marine Le Pen, CharlesInParis, There was a Dislike demonstration against gay marriage

The BeSt Eval at the 2016 NIST TAC KBP Overview BeSt Eval Task

Overview of Event Nugget Track TAC KBP 2016 Teruko Mitamura Zhengzhong Liu Eduard Hovy

KBP 2017 Cold Start KB Construction and Slot Filling Hoa Dang Shahzad Rajput U.S. National

Events Detection, Coreference and Sequencing: Whats next? Overview of TAC KBP 2017 Event

TAC 2018 Streaming Multimedia KBP Pilot Hoa Trang Dang National Institute of Standards and

The Columbia-GWU System at the 2016 TAC KBP BeSt Evaluation Owen Rambow, Tao Yu, Axinia Radeva,

Text Analysis Conference TAC 2016 Sponsored by: Hoa Trang Dang National Institute of Standards

TAC KBP 2016 Linguistic Resources: Event Arguments (EA), Event Nuggets (EN) and Belief/Sentiment

TREC, TAC, takeoffs, tacks, tasks, and titillations for 2009 Ian Soboroff, NIST

Event Detection and Coreference TAC KBP 2015 Sean Monahan, Michael Mohler, Marc Tomlinson Amy

Overview of 2015 TAC KBP Event Nugget Tasks Teruko Mitamura Zhengzhong Liu Eduard Hovy

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, Christopher Manning, Valentin

NIST Gaithersburgs Approach to a Solar PV Array Project John.R.Bollinger@nist.gov 2 NIST

Federal Computer Security Managers Forum Meeting September 10, 2018 NIST Gaithersburg NIST

FEDERAL COMPUTER SECURITY MANAGERS FORUM MEETING FEBRUARY 6, 2020 NIST WEST SQUARE NIST

NIST Trustworthy Email Project High Assurance Domain Project Scott Rose, NIST scottr@nist.gov

FREE YOUR MIND FOUNDATION PRESENTATION FREE YOUR MIND FOUNDATION ORGANIGRAM Johan Peijen (Riki)

Supporting My Child In Reading March 5, 2020 Flora Vista Elementary Keri Hargrave M.Ed Reading

Richard Steele How to improve your childs reading at home Monday 3 rd February 2020 So- Why

The Increasing Importance of Ecosystem Services Pavan Sukhdev Study Leader TEEB CEO GIST

CS214:presentation Zero-sum Game & von Neumann minimax theorem Group:Second++ What is a

A Model Approach toward the Management of Future Uncertainties Shunsuke Mori (Tokyo University of

Generative Adversarial Networks Ian Goodfellow Research Scientist GPU Technology Conference

NextGeneration Neighborhoods NH & RA Summer Institute July 2018 1 NextGeneration

The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind - PowerPoint PPT Presentation

The BeSt Eval at the 2017 NIST TAC KBP BeSt: Evaluating Mind Reading People in real world: Barack Events in real world: what, who, Obama, when, Marine Le Pen, CharlesInParis, There was a Dislike demonstration against gay marriage

The BeSt Eval at the 2016 NIST TAC KBP Overview BeSt Eval Task

Overview of Event Nugget Track TAC KBP 2016 Teruko Mitamura Zhengzhong Liu Eduard Hovy

KBP 2017 Cold Start KB Construction and Slot Filling Hoa Dang Shahzad Rajput U.S. National

Events Detection, Coreference and Sequencing: Whats next? Overview of TAC KBP 2017 Event

TAC 2018 Streaming Multimedia KBP Pilot Hoa Trang Dang National Institute of Standards and

The Columbia-GWU System at the 2016 TAC KBP BeSt Evaluation Owen Rambow, Tao Yu, Axinia Radeva,

Text Analysis Conference TAC 2016 Sponsored by: Hoa Trang Dang National Institute of Standards

TAC KBP 2016 Linguistic Resources: Event Arguments (EA), Event Nuggets (EN) and Belief/Sentiment

TREC, TAC, takeoffs, tacks, tasks, and titillations for 2009 Ian Soboroff, NIST

Event Detection and Coreference TAC KBP 2015 Sean Monahan, Michael Mohler, Marc Tomlinson Amy

Overview of 2015 TAC KBP Event Nugget Tasks Teruko Mitamura Zhengzhong Liu Eduard Hovy

Stanford-UBC at TAC-KBP Eneko Agirre , Angel Chang, Dan Jurafsky, Christopher Manning, Valentin

NIST Gaithersburgs Approach to a Solar PV Array Project John.R.Bollinger@nist.gov 2 NIST

Federal Computer Security Managers Forum Meeting September 10, 2018 NIST Gaithersburg NIST

FEDERAL COMPUTER SECURITY MANAGERS FORUM MEETING FEBRUARY 6, 2020 NIST WEST SQUARE NIST

NIST Trustworthy Email Project High Assurance Domain Project Scott Rose, NIST scottr@nist.gov

FREE YOUR MIND FOUNDATION PRESENTATION FREE YOUR MIND FOUNDATION ORGANIGRAM Johan Peijen (Riki)

Supporting My Child In Reading March 5, 2020 Flora Vista Elementary Keri Hargrave M.Ed Reading

Richard Steele How to improve your childs reading at home Monday 3 rd February 2020 So- Why

The Increasing Importance of Ecosystem Services Pavan Sukhdev Study Leader TEEB CEO GIST

CS214:presentation Zero-sum Game &amp; von Neumann minimax theorem Group:Second++ What is a

A Model Approach toward the Management of Future Uncertainties Shunsuke Mori (Tokyo University of

Generative Adversarial Networks Ian Goodfellow Research Scientist GPU Technology Conference

NextGeneration Neighborhoods NH &amp; RA Summer Institute July 2018 1 NextGeneration

CS214:presentation Zero-sum Game & von Neumann minimax theorem Group:Second++ What is a

NextGeneration Neighborhoods NH & RA Summer Institute July 2018 1 NextGeneration