Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 - PowerPoint PPT Presentation

Evaluation measures in NLP Zdeněk Žabokrtský October 30, 2020 NPFL070 Language Data Resources Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Outline Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 1/ 39

Evaluation goals and basic principles

Basic goals of evaluation system. NLP problem (don’t underestimate this!). reasons why we need language data resources). Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 2/ 39 • The goal of NLP evaluation is to measure one or more qualities of an algorithm or a • A side efgect: Defjnition of proper evaluation criteria is one way to specify precisely an • Need for evaluation metrics and evaluation data (actually evaluation is one of the main

Automatic vs. manual evaluation impossible (e.g., when inter-annotator agreement is insuffjcient) Final remarks Big picture Selected task-specifjc measures Selected good practices in experiment evaluation Evaluation goals and basic principles quality of a system, based on a number of criteria 3/ 39 percentage of correctly predicted answers • automatic • comparing the system’s output with the gold standard output and evaluate e.g. the • the cost of producing the gold standard data… • …but then easily repeatable without additional cost • manual • manual evaluation is performed by human judges, which are instructed to estimate the • for many NLP problems, the defjnition of a gold standard (the “ground truth”) can prove

Intrinsic vs. extrinsic evaluation Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 4/ 39 • Intrinsic evaluation • considers an isolated NLP system and characterizes its performance mainly • Extrinsic evaluation • considers the NLP system as a component in a more complex setting

The simplest case: accuracy Evaluation goals and basic principles Final remarks Big picture Selected task-specifjc measures Selected good practices in experiment evaluation 5/ 39 × 100% data 𝑏𝑚𝑚 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 • accuracy – just a percentage of correctly predicted instances 𝑏𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑑𝑝𝑠𝑠𝑓𝑑𝑢𝑚𝑧 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑓𝑒 𝑗𝑜𝑡𝑢𝑏𝑜𝑑𝑓𝑡 • correctly predicted = identical with human decisions as stored in manually annotated • an example – a part-of-speech tagger • an expert (trained annotator) chooses the correct POS value for each word in a corpus, • the annotated data is split into two parts • the fjrst part is used for training a POS tagger • the trained tagger is applied on the second part • POS accuracy = the ratio of correctly tokens with correctly predicted POS values

Selected good practices in experiment evaluation

Experiment design 1. train a model using the training data 2. evaluate the model using the evaluation data 3. improve the model 4. goto 1 Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 6/ 39 • the simplest loop: • What’s wrong with it?

A better data division hyperparameter values) Final remarks Big picture Selected task-specifjc measures Selected good practices in experiment evaluation Evaluation goals and basic principles 7/ 39 ultimate evaluation evaluations (the more evaluations, the worse) evaluation phases • the more iterations of the loop, the more fuzzy distinction between training and • in other words, evaluation data efgectively becomes training data after repeated • but if you “spoil” all the data by using it for training, then there’s nothing left for an • this problem should be foreseen: divide the data into three portions, not two: • training data • development data (devset, devtest, tuning set) • evaluation data (etest) – to be used only once (in a very long time) • sometimes even more complicated schemes are needed (e.g. for choosing

K-fold cross-validation have huge impact on the measured quantities 1. partition the data into 𝐿 roughly equally-sized subsamples 2. perform cyclically 𝐿 iterations: 3. compute arithmetic average value of the 𝐿 results 4. more reliable results, especially if you face very small or in some sense highly diverse data Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 8/ 39 • used especially in the case of very small data, when an actual train/test division could • averaging more training/test divisions, usually K=10 • use 𝐿 − 1 subsamples for training • use 1 sample for testing

Interpreting the results of evaluation number … Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 9/ 39 • You run an experiment, you evaluate it, you get some evaluation result, i.e. some • …is the number good or bad? • No easy answer • However, you can interpret the results with respect to expected upper and lower bounds.

Lower and upper bounds this interval is a reasonable result given by performance of which is supposed to be easily surpassed) annotators Evaluation goals and basic principles Selected good practices in experiment evaluation Selected task-specifjc measures Big picture Final remarks 10/ 39 • the fact that the domain of accuracy is 0–100%̃ does not imply that any number within • the performance of a system under study is always expected to be inside the interval • lower bound – result of a baseline solution (less complex or even trivial system the • upper bound – result of an oracle experiment, or level of agreement between human

Baseline 𝑏𝑠𝑕𝑛𝑏𝑦(𝑄(𝑢𝑏𝑕|𝑥𝑝𝑠𝑒)) Final remarks Big picture Selected task-specifjc measures Selected good practices in experiment evaluation Evaluation goals and basic principles 11/ 39 methods • usually there’s a sequence of gradually more complex (and thus more successful) • in the case of classifjcation-like tasks: • random-choice baseline • most-frequent-value baseline • … • some intermediate solution, e.g. a “unigram model” in POS tagging • … • the most ambitious baseline: the previous state-of-the-art result • in structured tasks: • again, start with predictions based on simple rules • examples for dependency parsing • attach each node below its left neighbor • attach each node below the nearest verb • …

Oracle experiment tags for the given word Final remarks Big picture Selected task-specifjc measures Selected good practices in experiment evaluation Evaluation goals and basic principles 12/ 39 viewpoint of our evaluated component) imply that even oracle’s performance is less than 100 % • the purpose of oracle experiments – to fjnd a performance upper bound (from the • oracle • an imaginary entity that makes the best possible decisions • however, respecting other limitations of the experiment setup • naturally, oracle experiments make sense only in cases in which the “other limitations” • example – oracle experiment in POS tagging • for each word, collect all its possible POS tags from a hand-annotated corpus • on testing data, choose the correct POS tag whenever it is available in the set of possible

Noise in annotations instructions. Final remarks Big picture Selected task-specifjc measures Selected good practices in experiment evaluation Evaluation goals and basic principles issues etc., like with any other work. 13/ 39 detailed specifjcation of) annotation instructions. stable in decision making. are done …No, this is too naive! delivered by human experts (annotators) • recall that “ground truth” (=correct answers for given task instances) is usually • So let’s divide data among annotators, collect them back with their annotations, and we • Annotation is “spoiled” by various sources of inconsistencies: • It might take a while for an annotator to grasp all annotation instructions and become • Gradually gathered annotation experience typically leads to improvements of (esp. more • But still, difgerent annotators might make difgerent decisions, even under the same • …and of course many other errors due to speed, insuffjcient concentrations, working ethics • It is absolutely essential to quantify inter-annotator agreements.

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 - PowerPoint PPT Presentation

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Outline Evaluation

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 - PowerPoint PPT Presentation

Evaluation measures in NLP Zdenk abokrtsk October 30, 2020 NPFL070 Language Data Resources Charles University Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Outline Evaluation

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Facing NLP German Rigau i Claramunt http://adimen.si.ehu.es/~rigau IXA group Departamento de

IXA pipes: Efficient and Ready to Use Multilingual NLP tools Rodrigo Agerri IXA NLP Group,

Prominent Research Directions in NLP Alexander Panchenko Assistant Professor for NLP About

Natural Language Processing (NLP) In 11-711 Algorithms for NLP we take an

Deep Learning for NLP Kiran Vodrahalli Feb 11, 2015 Overview What is NLP? Natural

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

NLP Programming Tutorial 4 - Word Segmentation Graham Neubig Nara Institute of Science and

SI485i : NLP Set 12 Features and Prediction What is NLP, really? Many of our tasks boil down

Capsule Networks for NLP Will Merrill Advanced NLP 10/25/18 Capsule Networks: A Better ConvNet

Evaluation Philipp Koehn 22 September 2020 Philipp Koehn Machine Translation: Evaluation 22

A method for primary calibration of AM and PM noise measurements TimeNav 07 May 31, 2007

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels Jiyan Yang Stanford University June

More Distributional Semantics: New Models &amp; Applications CMSC 723 / LING 723 / INST 725 M

Dynamic Embeddings for User Profiling in Twitter Shangsong Liang 1 , Xiangliang Zhang 1 , Zhaochun

Z YNGA Q3 2018 F INANCIAL R ESULTS October 31, 2018 T ABLE OF C ONTENTS Overview of Q3 2018

Distributed Representations CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap

More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 M