Statistical Significance Tests in NLP
Natural Language Processing VU (706.230) - Andi Rexha
26/03/2020
Statistical Significance Tests in NLP Natural Language Processing - - PowerPoint PPT Presentation
Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020 Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing Evaluation
Natural Language Processing VU (706.230) - Andi Rexha
26/03/2020
2 26/03/2020 Statistical Significance Tests in NLP
NLP Tasks
Significance Tests
statistical tests
3 26/03/2020 Statistical Significance Tests in NLP
4 26/03/2020 Statistical Significance Tests in NLP
○ Given a sentence classify it as positive or negative (i.e. cats and dogs) ○ Results presented via a confusion matrix:
○ To compare different algorithms ○ Accuracy = (TP+TN)/(TP+TN+FP+FN) ○ What is the problem with accuracy? ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect?
5 26/03/2020 Statistical Significance Tests in NLP
○ Classify everything as dogs ○ Receive 99% of accuracy ○ Not classifying as we want
○
6 26/03/2020 Statistical Significance Tests in NLP
○ Annotators evaluate a piece of text with a sentiment from 1-5
○ Given two sentences, annotators decide to give a 1-10 semantic similarity between the two sentences
○ Correlations: ■ It is a statistical technique that shows how related two random variables are ○
7 26/03/2020 Statistical Significance Tests in NLP
○ Pearson correlation: ○ Spearmann correlation: ■ It is the Pearson correlation between ranked variables:
8 26/03/2020 Statistical Significance Tests in NLP
○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence:
○ How well a probability model predicts a sample!
9 26/03/2020 Statistical Significance Tests in NLP
○ We want to model a probability distribution P ■ How close is our distribution Q with the real one?
○ The larger is the perplexity, the worse ○ The N-th root is a normalizing factor
10 26/03/2020 Statistical Significance Tests in NLP
○ Usually sentence/phrase based
○ BLEU : BiLingual Evaluation Understudy ■ Accuracy cannot be used because we cannot have the “Exact” same translation ■ Usually sentences are compared with very good quality translations ■ Average of present words in any of the reference translations ■ Returns [0, 1] with 1 reflecting very good translation
11 26/03/2020 Statistical Significance Tests in NLP
○ Precision here is 1 ○ Naive solution: ■ Number of words occurring in both / by the max number of words ■ What if we have the reversed order words in the translation? ■ To overcome this problem, n-grams match:
12 26/03/2020 Statistical Significance Tests in NLP
○ Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) ○ Example: ■ “There are different examples that we might use!”
13 26/03/2020 Statistical Significance Tests in NLP
○ The percentage of the words that have the correct head
○ LAS: Labeled attachment score ■ Each word, it is considered correct if the head and the label is correct ○ UAS: Unlabeled Attachment Score: ■ Considered to be correct for a word if the identified head of the word is correct (without necessarily having the correct label)
14 26/03/2020 Statistical Significance Tests in NLP
○ Usually a list of phrases or of sentences
○ ROUGE: Recall-Oriented Understudy for Gisting Evaluation ○ Tries to evaluate the summary (or translation) based on a list of references ○ ROUGE-N: Overlap of N-grams between the system and reference summaries.
15 26/03/2020 Statistical Significance Tests in NLP
reference summaries
○ ROUGE-L — measures longest matching sequence of words using LCS ○ ROUGE-S — measuring matching ordered words by allowing for arbitrary gaps
16 26/03/2020 Statistical Significance Tests in NLP
○ Using sentences as basic units ○ Based on Precision and Recall with more emphasis to Recall ■ ○ Adds some other features: stemming + synonyms ○ Create an alignment between the two sentences ■ How good is the alignment?
17 26/03/2020 Statistical Significance Tests in NLP
○ Unigram Precision and Recall: ■ Where is the number of common unigrams ■ is the number of unigrams in the candidate translation ■ is the number of unigrams in the reference translation ○ Longer n-gram matches are used to compute a penalty p: ■ Where is the minimum number of matching chunks ■ is the number of unigram mapping ■
18 26/03/2020 Statistical Significance Tests in NLP
Papers ACL + TACL
19 26/03/2020 Statistical Significance Tests in NLP
End of Module
20 26/03/2020 Statistical Significance Tests in NLP
21 26/03/2020 Statistical Significance Tests in NLP
○ Find an dependent variable ○ Try the state of the art on a benchmark and the current algorithm ○ Find out that the current algorithm performs better on the benchmark
○ Is it statistically significant? ○ We should apply statistical significance to the results
22 26/03/2020 Statistical Significance Tests in NLP
○ M(Alg, X) is the value of the evaluation measure on dataset X ○ So the difference between the two algorithms is: ■ ○ The statistical hypothesis: ■
23 26/03/2020 Statistical Significance Tests in NLP
○
○ Type 1 error: ■ The Null hypothesis is rejected when it is actually true ○ Type 2 error: ■ The Null hypothesis is not rejected, although it should be
24 26/03/2020 Statistical Significance Tests in NLP
○ If distribution is known, a parametric test is better: ■ Lower probability of making type 2 errors ○ Non-Parametric don’t make any assumption on the distribution: ■ Less powerful ■ More sound if the distribution isn’t known
25 26/03/2020 Statistical Significance Tests in NLP
○ Sample comes from a normal distribution
○ Checks if the variance of QQ plots and variance samples are similar
○ Tests between the shapes of two distributions (not only normal) ○ Absolute difference between two cumulative distribution function
○ Tests whether a given which also uses a cumulative distribution function
26 26/03/2020 Statistical Significance Tests in NLP
○ With defined parameters ○ Usually from normally distributed
■ Typically applied to Accuracy, UAS and LAS:
○ Sum of the independently drawn variables follow a Normal Distribution ■ Also Pearson’s correlation (with n-2 degree of freedom)
27 26/03/2020 Statistical Significance Tests in NLP
○ Statistical power ○ Computational complexity
○ Sample free: ■ Doesn’t consider the values of the evaluation measure ○ Sample based: ■ It considers the values of the evaluation measures and sample from them
28 26/03/2020 Statistical Significance Tests in NLP
○ Whether matched pair samples are drawn from distributions with equal medians ○ The number of examples which algorithm A is better than algorithm B ■ Just checks the sign (not how big or small the examples are) ○ Null-hypothesis states that given a new pair of measurements: ■ They are equally likely to be larger than the other ○ Expects 50% of A to be larger than B and vice versa ○ Requires an Identically and Independently Distributed variables
29 26/03/2020 Statistical Significance Tests in NLP
○ For paired nominal observations (binary labels) ○ The test is applied to a 2 × 2 contingency table: ■ Tabulates the outcomes of two algorithms ○ The Null-hypothesis states that the marginal probability for each outcome (label one or label two) is the same for both algorithms ○ Used for binary classification
○ Generalizes the McNemar’s test with multiclasses ○ Which algorithm performs better on each example
30 26/03/2020 Statistical Significance Tests in NLP
○ Compares two matched samples ○ Null hypothesis: ■ The differences follow a symmetric distribution around 0 ○ The absolute values are ranked ■ Each rank then gets a sign according to the sign of the difference ○ The Wilcoxon test then sums all these ranks ○ Used in NLP for the UAS measure
31 26/03/2020 Statistical Significance Tests in NLP
○ Null hypothesis: ■ Calculates the values of this statistic under all possible labellings (permutations) of the test set ○ Intuitively: ■ Makes a run over all the possible outcomes
○ Used in literature: ■ In the context of Machine Translation
32 26/03/2020 Statistical Significance Tests in NLP
○ Null hypothesis: ■ Calculates the values of this statistic under possible labels but with replacement (hence bootstrap) ○ Intuitively: ■ Makes of randomly picked examples with repetition:
○ Used in literature: ■ Machine Translation, Semantic Parsing, Text Summarization
33 26/03/2020 Statistical Significance Tests in NLP
○ The hypothesis on applying the tests: ■ Data are drawn independently ■ Which isn’t necessarily true
○ The folds are not independent
34 26/03/2020 Statistical Significance Tests in NLP
35 26/03/2020 Statistical Significance Tests in NLP