Statistical Significance Tests in NLP Natural Language Processing - PowerPoint PPT Presentation

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020

Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing ● ● Evaluation metrics Metric to test types ● ● Advantages and drawback Decision tree for applying ● ● of the metrics statistical tests 2 26/03/2020

Statistical Significance Tests in NLP NLP Tasks & Metrics 3 26/03/2020

Statistical Significance Tests in NLP Text Classification Binary classification: ● Given a sentence classify it as positive or negative (i.e. cats and dogs) ○ Results presented via a confusion matrix: ○ Why do we need a metric? ● To compare different algorithms ○ Accuracy = (TP+TN)/(TP+TN+FP+FN) ○ What is the problem with accuracy? ○ 4 26/03/2020 Word split ■ Sentence split ■ Paragraphs, etc ■ Is there any other information that we can collect? ○

Statistical Significance Tests in NLP Text Classification (2) If data contains 99 dogs and 1 cat: ● Classify everything as dogs ○ Receive 99% of accuracy ○ Not classifying as we want ○ Precision, Recall and F-measure: ● ○ 5 26/03/2020

Statistical Significance Tests in NLP Correlation Tasks Sentiment analysis: ● Annotators evaluate a piece of text with a sentiment from 1-5 ○ Sentence semantic similarity: ● Given two sentences, annotators decide to give a 1-10 semantic similarity ○ between the two sentences How to make the correlation between human annotation and the algorithm? ● Correlations: ○ It is a statistical technique that shows how related two random ■ variables are 6 26/03/2020 ○

Statistical Significance Tests in NLP Correlation Metrics Two most used correlations: ● Pearson correlation: ○ Spearmann correlation: ○ It is the Pearson correlation between ranked variables: ■ 7 26/03/2020

Statistical Significance Tests in NLP Language model Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ● Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ○ We need to identify a metric that assesses how well the language model works: ● How well a probability model predicts a sample! ○ 8 26/03/2020

Statistical Significance Tests in NLP Language model (2) Metric: Perplexity ● We want to model a probability distribution P ○ How close is our distribution Q with the real one? ■ We can approximate by testing against drawn samples in P ● The larger is the perplexity, the worse ○ The N-th root is a normalizing factor ○ 9 26/03/2020

Statistical Significance Tests in NLP Machine Translation Translates one piece of text to the other: ● Usually sentence/phrase based ○ Evaluation metric: ● BLEU : BiLingual Evaluation Understudy ○ Accuracy cannot be used because we cannot have the “Exact” same translation ■ Usually sentences are compared with very good quality translations ■ Average of present words in any of the reference translations ■ Returns [0, 1] with 1 reflecting very good translation ■ 10 26/03/2020

Statistical Significance Tests in NLP BLEU Only Precision on word level doesn’t work: ● Precision here is 1 ○ Naive solution: ○ Number of words occurring in both / by the max number of words ■ What if we have the reversed order words in the translation? ■ To overcome this problem, n -grams match: ■ Average the precision on each n from 1 to 4 ● Penalize if the translation is shorter than one of the references ● 11 26/03/2020

Statistical Significance Tests in NLP Deep Parsing Deep Parsing (Dependency Parsing): ● Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) ○ Example: ○ “There are different examples that we might use!” ■ 12 26/03/2020

Statistical Significance Tests in NLP Deep Parsing (2) Attachment score: ● The percentage of the words that have the correct head ○ Two measures for the dependency parsing: ● LAS: Labeled attachment score ○ Each word, it is considered correct if the head and the label is correct ■ UAS: Unlabeled Attachment Score: ○ Considered to be correct for a word if the identified head of the word is ■ correct (without necessarily having the correct label) 13 26/03/2020

Statistical Significance Tests in NLP Summarization Given a long text, the goal is to find a smaller textual representation: ● Usually a list of phrases or of sentences ○ Evaluation: ● ROUGE: Recall-Oriented Understudy for Gisting Evaluation ○ Tries to evaluate the summary (or translation) based on a list of references ○ ROUGE-N: Overlap of N-grams between the system and reference summaries. ○ 14 26/03/2020

Statistical Significance Tests in NLP Rouge ROUGE-1 refers to the overlap of unigram (each word) between the system and ● reference summaries ROUGE-2 refers to the overlap of bigrams between the system and references ● Other Rouge variants: ● ROUGE-L — measures longest matching sequence of words using LCS ○ ROUGE-S — measuring matching ordered words by allowing for arbitrary ○ gaps 15 26/03/2020

Statistical Significance Tests in NLP METEOR METEOR (Metric for Evaluation of Translation with Explicit ORdering): ● Using sentences as basic units ○ Based on Precision and Recall with more emphasis to Recall ○ ■ Adds some other features: stemming + synonyms ○ Create an alignment between the two sentences ○ How good is the alignment? ■ 16 26/03/2020

Statistical Significance Tests in NLP METEOR (2) Calculating the precision: ● Unigram Precision and Recall: ○ Where is the number of common unigrams ■ is the number of unigrams in the candidate translation ■ is the number of unigrams in the reference translation ■ Longer n -gram matches are used to compute a penalty p : ○ Where is the minimum number of matching chunks ■ is the number of unigram mapping ■ ■ 17 26/03/2020

Statistical Significance Tests in NLP Metrics on Papers Papers ACL + TACL 18 26/03/2020

Statistical Significance Tests in NLP NLP Tasks & Metrics End of Module 19 26/03/2020

Statistical Significance Tests in NLP Statistical Significance Tests 20 26/03/2020

Statistical Significance Tests in NLP Current Status How it is done: ● Find an dependent variable ○ Try the state of the art on a benchmark and the current algorithm ○ Find out that the current algorithm performs better on the benchmark ○ What if we have an improvement (i.e. accuracy) of 1%? ● Is it statistically significant? ○ We should apply statistical significance to the results ○ 21 26/03/2020

Statistical Significance Tests in NLP Significance Tests Given two algorithms A and B , a dataset X and a measure M : ● M(Alg, X) is the value of the evaluation measure on dataset X ○ So the difference between the two algorithms is: ○ ■ The statistical hypothesis: ○ ■ 22 26/03/2020

Statistical Significance Tests in NLP Significance Tests (2) The p-value for rejecting/accepting the null hypothesis is: ● ○ Types of error (as learned in the class) ● Type 1 error: ○ The Null hypothesis is rejected when it is actually true ■ Type 2 error: ○ The Null hypothesis is not rejected, although it should be ■ 23 26/03/2020

Statistical Significance Tests in NLP Significance Tests (3) Parametric vs. Non-Parametric ● If distribution is known, a parametric test is better: ○ Lower probability of making type 2 errors ■ Non-Parametric don’t make any assumption on the distribution: ○ Less powerful ■ More sound if the distribution isn’t known ■ 24 26/03/2020

Statistical Significance Tests in NLP Normality Test Null-hypothesis: ● Sample comes from a normal distribution ○ Shapiro-Wilk: ● Checks if the variance of QQ plots and variance samples are similar ○ ● Kolmogorov-Smirnov: ○ Tests between the shapes of two distributions (not only normal) ○ Absolute difference between two cumulative distribution function ● Anderson-Darling: ○ Tests whether a given which also uses a cumulative distribution function 25 26/03/2020

Statistical Significance Tests in NLP Parametric Test Parametric tests assume that the data are drawn from a known distribution: ● With defined parameters ○ Usually from normally distributed ○ Paired Student’s t-test: ● Typically applied to Accuracy, UAS and LAS: ■ Compute the mean number of predictions per sample ● Based on the Central Limit Theorem: ● Sum of the independently drawn variables follow a Normal Distribution ○ Also Pearson’s correlation (with n-2 degree of freedom) ■ 26 26/03/2020

Statistical Significance Tests in NLP Non Parametric Tests Two dimensions: ● Statistical power ○ Computational complexity ○ Two families: ● Sample free: ○ Doesn’t consider the values of the evaluation measure ■ Sample based: ○ It considers the values of the evaluation measures and sample from them ■ 27 26/03/2020

Statistical Significance Tests in NLP Natural Language Processing - PowerPoint PPT Presentation

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020 Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing Evaluation

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

SOME STATISTICAL TESTS Overview Theory of statistical tests Test for a difference in mean

Comparing User-Provided Tests to Developer-Provided Tests Ren Just, Chris Parnin, Ian Drosos,

STAT 113 Tests and Confidence Intervals Colin Reimer Dawson Oberlin College October 10th, 2016

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

The Hitchhikers Guide to Testing Statistical Significance in NLP Rotem Dror , Gili Baumer,

there is something beyond the twitter network Karol Wgrzycki 2016-07-11 1 modeling

Normality tests P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna Chmielewska

Operational Trials: Data Analysis Wendy Bergerud Research Branch BC Min. of Forests May 2003

Clinical trial design for renal MRI I studies Richard Haynes Professor of Renal Medicine &

Imputation by Gaussian Copula Model with an Application to Incomplete Customer Satisfaction Data

Agreement between the Xmax distributions measured by the Pierre Auger and Telescope Array

and Timing Analysis for Real-Time Networks RTN 2018 Stefan Reif, Timo Hnig, Wolfgang

Predictor-Corrector and Morphing Ensemble Filters for the Assimilation of Sparse Data into

Statistical Significance Tests in NLP Natural Language Processing - PowerPoint PPT Presentation

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020 Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing Evaluation

Statistical-Significance Background &amp; Goal Shortcuts Statistical significance is one of

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

Greenhouse Gas CEQA Greenhouse Gas CEQA Significance Threshold Significance Threshold

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Null Hypothesis Significance Testing p -values, significance level, power, t -tests 18.05 Spring

Significance How important is it? Thoughts on historical significance A property must have

CSE 427 Computational Biology Autumn 2015 3: BLAST, Alignment score significance 1 Significance

SOME STATISTICAL TESTS Overview Theory of statistical tests Test for a difference in mean

Comparing User-Provided Tests to Developer-Provided Tests Ren Just, Chris Parnin, Ian Drosos,

STAT 113 Tests and Confidence Intervals Colin Reimer Dawson Oberlin College October 10th, 2016

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

The Hitchhikers Guide to Testing Statistical Significance in NLP Rotem Dror , Gili Baumer,

there is something beyond the twitter network Karol Wgrzycki 2016-07-11 1 modeling

Normality tests P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna Chmielewska

Operational Trials: Data Analysis Wendy Bergerud Research Branch BC Min. of Forests May 2003

Clinical trial design for renal MRI I studies Richard Haynes Professor of Renal Medicine &amp;

Imputation by Gaussian Copula Model with an Application to Incomplete Customer Satisfaction Data

Agreement between the Xmax distributions measured by the Pierre Auger and Telescope Array

and Timing Analysis for Real-Time Networks RTN 2018 Stefan Reif, Timo Hnig, Wolfgang

Predictor-Corrector and Morphing Ensemble Filters for the Assimilation of Sparse Data into

Statistical-Significance Background & Goal Shortcuts Statistical significance is one of

Clinical trial design for renal MRI I studies Richard Haynes Professor of Renal Medicine &