Statistical Significance Tests in NLP Natural Language Processing - - PowerPoint PPT Presentation

statistical significance tests in nlp
SMART_READER_LITE
LIVE PREVIEW

Statistical Significance Tests in NLP Natural Language Processing - - PowerPoint PPT Presentation

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020 Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing Evaluation


slide-1
SLIDE 1

Statistical Significance Tests in NLP

Natural Language Processing VU (706.230) - Andi Rexha

26/03/2020

slide-2
SLIDE 2

2 26/03/2020 Statistical Significance Tests in NLP

Agenda

NLP Tasks

  • Presentation of tasks
  • Evaluation metrics
  • Advantages and drawback
  • f the metrics

Significance Tests

  • Types of testing
  • Metric to test types
  • Decision tree for applying

statistical tests

slide-3
SLIDE 3

3 26/03/2020 Statistical Significance Tests in NLP

NLP Tasks & Metrics

slide-4
SLIDE 4

4 26/03/2020 Statistical Significance Tests in NLP

  • Binary classification:

○ Given a sentence classify it as positive or negative (i.e. cats and dogs) ○ Results presented via a confusion matrix:

  • Why do we need a metric?

○ To compare different algorithms ○ Accuracy = (TP+TN)/(TP+TN+FP+FN) ○ What is the problem with accuracy? ■ Word split ■ Sentence split ■ Paragraphs, etc ○ Is there any other information that we can collect?

Text Classification

slide-5
SLIDE 5

5 26/03/2020 Statistical Significance Tests in NLP

  • If data contains 99 dogs and 1 cat:

○ Classify everything as dogs ○ Receive 99% of accuracy ○ Not classifying as we want

  • Precision, Recall and F-measure:

Text Classification (2)

slide-6
SLIDE 6

6 26/03/2020 Statistical Significance Tests in NLP

  • Sentiment analysis:

○ Annotators evaluate a piece of text with a sentiment from 1-5

  • Sentence semantic similarity:

○ Given two sentences, annotators decide to give a 1-10 semantic similarity between the two sentences

  • How to make the correlation between human annotation and the algorithm?

○ Correlations: ■ It is a statistical technique that shows how related two random variables are ○

Correlation Tasks

slide-7
SLIDE 7

7 26/03/2020 Statistical Significance Tests in NLP

  • Two most used correlations:

○ Pearson correlation: ○ Spearmann correlation: ■ It is the Pearson correlation between ranked variables:

Correlation Metrics

slide-8
SLIDE 8

8 26/03/2020 Statistical Significance Tests in NLP

  • Generating the next token of a sequence
  • Usually based on the collection of co-occurrence of words in a window:

○ Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence:

  • We need to identify a metric that assesses how well the language model works:

○ How well a probability model predicts a sample!

Language model

slide-9
SLIDE 9

9 26/03/2020 Statistical Significance Tests in NLP

  • Metric: Perplexity

○ We want to model a probability distribution P ■ How close is our distribution Q with the real one?

  • We can approximate by testing against drawn samples in P

○ The larger is the perplexity, the worse ○ The N-th root is a normalizing factor

Language model (2)

slide-10
SLIDE 10

10 26/03/2020 Statistical Significance Tests in NLP

  • Translates one piece of text to the other:

○ Usually sentence/phrase based

  • Evaluation metric:

○ BLEU : BiLingual Evaluation Understudy ■ Accuracy cannot be used because we cannot have the “Exact” same translation ■ Usually sentences are compared with very good quality translations ■ Average of present words in any of the reference translations ■ Returns [0, 1] with 1 reflecting very good translation

Machine Translation

slide-11
SLIDE 11

11 26/03/2020 Statistical Significance Tests in NLP

  • Only Precision on word level doesn’t work:

○ Precision here is 1 ○ Naive solution: ■ Number of words occurring in both / by the max number of words ■ What if we have the reversed order words in the translation? ■ To overcome this problem, n-grams match:

  • Average the precision on each n from 1 to 4
  • Penalize if the translation is shorter than one of the references

BLEU

slide-12
SLIDE 12

12 26/03/2020 Statistical Significance Tests in NLP

  • Deep Parsing (Dependency Parsing):

○ Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) ○ Example: ■ “There are different examples that we might use!”

Deep Parsing

slide-13
SLIDE 13

13 26/03/2020 Statistical Significance Tests in NLP

  • Attachment score:

○ The percentage of the words that have the correct head

  • Two measures for the dependency parsing:

○ LAS: Labeled attachment score ■ Each word, it is considered correct if the head and the label is correct ○ UAS: Unlabeled Attachment Score: ■ Considered to be correct for a word if the identified head of the word is correct (without necessarily having the correct label)

Deep Parsing (2)

slide-14
SLIDE 14

14 26/03/2020 Statistical Significance Tests in NLP

  • Given a long text, the goal is to find a smaller textual representation:

○ Usually a list of phrases or of sentences

  • Evaluation:

○ ROUGE: Recall-Oriented Understudy for Gisting Evaluation ○ Tries to evaluate the summary (or translation) based on a list of references ○ ROUGE-N: Overlap of N-grams between the system and reference summaries.

Summarization

slide-15
SLIDE 15

15 26/03/2020 Statistical Significance Tests in NLP

  • ROUGE-1 refers to the overlap of unigram (each word) between the system and

reference summaries

  • ROUGE-2 refers to the overlap of bigrams between the system and references
  • Other Rouge variants:

○ ROUGE-L — measures longest matching sequence of words using LCS ○ ROUGE-S — measuring matching ordered words by allowing for arbitrary gaps

Rouge

slide-16
SLIDE 16

16 26/03/2020 Statistical Significance Tests in NLP

  • METEOR (Metric for Evaluation of Translation with Explicit ORdering):

○ Using sentences as basic units ○ Based on Precision and Recall with more emphasis to Recall ■ ○ Adds some other features: stemming + synonyms ○ Create an alignment between the two sentences ■ How good is the alignment?

METEOR

slide-17
SLIDE 17

17 26/03/2020 Statistical Significance Tests in NLP

  • Calculating the precision:

○ Unigram Precision and Recall: ■ Where is the number of common unigrams ■ is the number of unigrams in the candidate translation ■ is the number of unigrams in the reference translation ○ Longer n-gram matches are used to compute a penalty p: ■ Where is the minimum number of matching chunks ■ is the number of unigram mapping ■

METEOR (2)

slide-18
SLIDE 18

18 26/03/2020 Statistical Significance Tests in NLP

Metrics on Papers

Papers ACL + TACL

slide-19
SLIDE 19

19 26/03/2020 Statistical Significance Tests in NLP

NLP Tasks & Metrics

End of Module

slide-20
SLIDE 20

20 26/03/2020 Statistical Significance Tests in NLP

Statistical Significance Tests

slide-21
SLIDE 21

21 26/03/2020 Statistical Significance Tests in NLP

  • How it is done:

○ Find an dependent variable ○ Try the state of the art on a benchmark and the current algorithm ○ Find out that the current algorithm performs better on the benchmark

  • What if we have an improvement (i.e. accuracy) of 1%?

○ Is it statistically significant? ○ We should apply statistical significance to the results

Current Status

slide-22
SLIDE 22

22 26/03/2020 Statistical Significance Tests in NLP

  • Given two algorithms A and B, a dataset X and a measure M:

○ M(Alg, X) is the value of the evaluation measure on dataset X ○ So the difference between the two algorithms is: ■ ○ The statistical hypothesis: ■

Significance Tests

slide-23
SLIDE 23

23 26/03/2020 Statistical Significance Tests in NLP

  • The p-value for rejecting/accepting the null hypothesis is:

  • Types of error (as learned in the class)

○ Type 1 error: ■ The Null hypothesis is rejected when it is actually true ○ Type 2 error: ■ The Null hypothesis is not rejected, although it should be

Significance Tests (2)

slide-24
SLIDE 24

24 26/03/2020 Statistical Significance Tests in NLP

  • Parametric vs. Non-Parametric

○ If distribution is known, a parametric test is better: ■ Lower probability of making type 2 errors ○ Non-Parametric don’t make any assumption on the distribution: ■ Less powerful ■ More sound if the distribution isn’t known

Significance Tests (3)

slide-25
SLIDE 25

25 26/03/2020 Statistical Significance Tests in NLP

  • Null-hypothesis:

○ Sample comes from a normal distribution

  • Shapiro-Wilk:

○ Checks if the variance of QQ plots and variance samples are similar

  • Kolmogorov-Smirnov:

○ Tests between the shapes of two distributions (not only normal) ○ Absolute difference between two cumulative distribution function

  • Anderson-Darling:

○ Tests whether a given which also uses a cumulative distribution function

Normality Test

slide-26
SLIDE 26

26 26/03/2020 Statistical Significance Tests in NLP

  • Parametric tests assume that the data are drawn from a known distribution:

○ With defined parameters ○ Usually from normally distributed

  • Paired Student’s t-test:

■ Typically applied to Accuracy, UAS and LAS:

  • Compute the mean number of predictions per sample
  • Based on the Central Limit Theorem:

○ Sum of the independently drawn variables follow a Normal Distribution ■ Also Pearson’s correlation (with n-2 degree of freedom)

Parametric Test

slide-27
SLIDE 27

27 26/03/2020 Statistical Significance Tests in NLP

  • Two dimensions:

○ Statistical power ○ Computational complexity

  • Two families:

○ Sample free: ■ Doesn’t consider the values of the evaluation measure ○ Sample based: ■ It considers the values of the evaluation measures and sample from them

Non Parametric Tests

slide-28
SLIDE 28

28 26/03/2020 Statistical Significance Tests in NLP

  • Sign test:

○ Whether matched pair samples are drawn from distributions with equal medians ○ The number of examples which algorithm A is better than algorithm B ■ Just checks the sign (not how big or small the examples are) ○ Null-hypothesis states that given a new pair of measurements: ■ They are equally likely to be larger than the other ○ Expects 50% of A to be larger than B and vice versa ○ Requires an Identically and Independently Distributed variables

Non Parametric - Sample Free

slide-29
SLIDE 29

29 26/03/2020 Statistical Significance Tests in NLP

  • McNemar’s test

○ For paired nominal observations (binary labels) ○ The test is applied to a 2 × 2 contingency table: ■ Tabulates the outcomes of two algorithms ○ The Null-hypothesis states that the marginal probability for each outcome (label one or label two) is the same for both algorithms ○ Used for binary classification

  • The Cochran’s Q test:

○ Generalizes the McNemar’s test with multiclasses ○ Which algorithm performs better on each example

Non Parametric - Sample Free (2)

slide-30
SLIDE 30

30 26/03/2020 Statistical Significance Tests in NLP

  • Wilcoxon signed-rank test

○ Compares two matched samples ○ Null hypothesis: ■ The differences follow a symmetric distribution around 0 ○ The absolute values are ranked ■ Each rank then gets a sign according to the sign of the difference ○ The Wilcoxon test then sums all these ranks ○ Used in NLP for the UAS measure

Non Parametric - Sample Free (3)

slide-31
SLIDE 31

31 26/03/2020 Statistical Significance Tests in NLP

  • Pitman’s permutation test:

○ Null hypothesis: ■ Calculates the values of this statistic under all possible labellings (permutations) of the test set ○ Intuitively: ■ Makes a run over all the possible outcomes

  • Checks whether our results is better than the p-value

○ Used in literature: ■ In the context of Machine Translation

Non Parametric - Sample Based

slide-32
SLIDE 32

32 26/03/2020 Statistical Significance Tests in NLP

  • Paired bootstrap tests

○ Null hypothesis: ■ Calculates the values of this statistic under possible labels but with replacement (hence bootstrap) ○ Intuitively: ■ Makes of randomly picked examples with repetition:

  • Checks whether our results is better than the p-value

○ Used in literature: ■ Machine Translation, Semantic Parsing, Text Summarization

Non Parametric - Sample Based (2)

slide-33
SLIDE 33

33 26/03/2020 Statistical Significance Tests in NLP

  • Dependent observations:

○ The hypothesis on applying the tests: ■ Data are drawn independently ■ Which isn’t necessarily true

  • Significance on cross validation:

○ The folds are not independent

Opened issues

slide-34
SLIDE 34

34 26/03/2020 Statistical Significance Tests in NLP

Decision Tree for Statistical Significance

slide-35
SLIDE 35

35 26/03/2020 Statistical Significance Tests in NLP

Thank you for your attention!