statistical significance tests in nlp
play

Statistical Significance Tests in NLP Natural Language Processing - PowerPoint PPT Presentation

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020 Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing Evaluation


  1. Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha 26/03/2020

  2. Statistical Significance Tests in NLP Agenda NLP Tasks Significance Tests Presentation of tasks Types of testing ● ● Evaluation metrics Metric to test types ● ● Advantages and drawback Decision tree for applying ● ● of the metrics statistical tests 2 26/03/2020

  3. Statistical Significance Tests in NLP NLP Tasks & Metrics 3 26/03/2020

  4. Statistical Significance Tests in NLP Text Classification Binary classification: ● Given a sentence classify it as positive or negative (i.e. cats and dogs) ○ Results presented via a confusion matrix: ○ Why do we need a metric? ● To compare different algorithms ○ Accuracy = (TP+TN)/(TP+TN+FP+FN) ○ What is the problem with accuracy? ○ 4 26/03/2020 Word split ■ Sentence split ■ Paragraphs, etc ■ Is there any other information that we can collect? ○

  5. Statistical Significance Tests in NLP Text Classification (2) If data contains 99 dogs and 1 cat: ● Classify everything as dogs ○ Receive 99% of accuracy ○ Not classifying as we want ○ Precision, Recall and F-measure: ● ○ 5 26/03/2020

  6. Statistical Significance Tests in NLP Correlation Tasks Sentiment analysis: ● Annotators evaluate a piece of text with a sentiment from 1-5 ○ Sentence semantic similarity: ● Given two sentences, annotators decide to give a 1-10 semantic similarity ○ between the two sentences How to make the correlation between human annotation and the algorithm? ● Correlations: ○ It is a statistical technique that shows how related two random ■ variables are 6 26/03/2020 ○

  7. Statistical Significance Tests in NLP Correlation Metrics Two most used correlations: ● Pearson correlation: ○ Spearmann correlation: ○ It is the Pearson correlation between ranked variables: ■ 7 26/03/2020

  8. Statistical Significance Tests in NLP Language model Generating the next token of a sequence ● Usually based on the collection of co-occurrence of words in a window: ● Statistics are collected and the next word is predicted based on these information ○ Mainly, models the probability of the sequence: ○ We need to identify a metric that assesses how well the language model works: ● How well a probability model predicts a sample! ○ 8 26/03/2020

  9. Statistical Significance Tests in NLP Language model (2) Metric: Perplexity ● We want to model a probability distribution P ○ How close is our distribution Q with the real one? ■ We can approximate by testing against drawn samples in P ● The larger is the perplexity, the worse ○ The N-th root is a normalizing factor ○ 9 26/03/2020

  10. Statistical Significance Tests in NLP Machine Translation Translates one piece of text to the other: ● Usually sentence/phrase based ○ Evaluation metric: ● BLEU : BiLingual Evaluation Understudy ○ Accuracy cannot be used because we cannot have the “Exact” same translation ■ Usually sentences are compared with very good quality translations ■ Average of present words in any of the reference translations ■ Returns [0, 1] with 1 reflecting very good translation ■ 10 26/03/2020

  11. Statistical Significance Tests in NLP BLEU Only Precision on word level doesn’t work: ● Precision here is 1 ○ Naive solution: ○ Number of words occurring in both / by the max number of words ■ What if we have the reversed order words in the translation? ■ To overcome this problem, n -grams match: ■ Average the precision on each n from 1 to 4 ● Penalize if the translation is shorter than one of the references ● 11 26/03/2020

  12. Statistical Significance Tests in NLP Deep Parsing Deep Parsing (Dependency Parsing): ● Parses the sentence in its grammatical structure ○ “Head” - “Dependent” form ○ It is an acyclic directed graph (mostly implemented as a tree) ○ Example: ○ “There are different examples that we might use!” ■ 12 26/03/2020

  13. Statistical Significance Tests in NLP Deep Parsing (2) Attachment score: ● The percentage of the words that have the correct head ○ Two measures for the dependency parsing: ● LAS: Labeled attachment score ○ Each word, it is considered correct if the head and the label is correct ■ UAS: Unlabeled Attachment Score: ○ Considered to be correct for a word if the identified head of the word is ■ correct (without necessarily having the correct label) 13 26/03/2020

  14. Statistical Significance Tests in NLP Summarization Given a long text, the goal is to find a smaller textual representation: ● Usually a list of phrases or of sentences ○ Evaluation: ● ROUGE: Recall-Oriented Understudy for Gisting Evaluation ○ Tries to evaluate the summary (or translation) based on a list of references ○ ROUGE-N: Overlap of N-grams between the system and reference summaries. ○ 14 26/03/2020

  15. Statistical Significance Tests in NLP Rouge ROUGE-1 refers to the overlap of unigram (each word) between the system and ● reference summaries ROUGE-2 refers to the overlap of bigrams between the system and references ● Other Rouge variants: ● ROUGE-L — measures longest matching sequence of words using LCS ○ ROUGE-S — measuring matching ordered words by allowing for arbitrary ○ gaps 15 26/03/2020

  16. Statistical Significance Tests in NLP METEOR METEOR (Metric for Evaluation of Translation with Explicit ORdering): ● Using sentences as basic units ○ Based on Precision and Recall with more emphasis to Recall ○ ■ Adds some other features: stemming + synonyms ○ Create an alignment between the two sentences ○ How good is the alignment? ■ 16 26/03/2020

  17. Statistical Significance Tests in NLP METEOR (2) Calculating the precision: ● Unigram Precision and Recall: ○ Where is the number of common unigrams ■ is the number of unigrams in the candidate translation ■ is the number of unigrams in the reference translation ■ Longer n -gram matches are used to compute a penalty p : ○ Where is the minimum number of matching chunks ■ is the number of unigram mapping ■ ■ 17 26/03/2020

  18. Statistical Significance Tests in NLP Metrics on Papers Papers ACL + TACL 18 26/03/2020

  19. Statistical Significance Tests in NLP NLP Tasks & Metrics End of Module 19 26/03/2020

  20. Statistical Significance Tests in NLP Statistical Significance Tests 20 26/03/2020

  21. Statistical Significance Tests in NLP Current Status How it is done: ● Find an dependent variable ○ Try the state of the art on a benchmark and the current algorithm ○ Find out that the current algorithm performs better on the benchmark ○ What if we have an improvement (i.e. accuracy) of 1%? ● Is it statistically significant? ○ We should apply statistical significance to the results ○ 21 26/03/2020

  22. Statistical Significance Tests in NLP Significance Tests Given two algorithms A and B , a dataset X and a measure M : ● M(Alg, X) is the value of the evaluation measure on dataset X ○ So the difference between the two algorithms is: ○ ■ The statistical hypothesis: ○ ■ 22 26/03/2020

  23. Statistical Significance Tests in NLP Significance Tests (2) The p-value for rejecting/accepting the null hypothesis is: ● ○ Types of error (as learned in the class) ● Type 1 error: ○ The Null hypothesis is rejected when it is actually true ■ Type 2 error: ○ The Null hypothesis is not rejected, although it should be ■ 23 26/03/2020

  24. Statistical Significance Tests in NLP Significance Tests (3) Parametric vs. Non-Parametric ● If distribution is known, a parametric test is better: ○ Lower probability of making type 2 errors ■ Non-Parametric don’t make any assumption on the distribution: ○ Less powerful ■ More sound if the distribution isn’t known ■ 24 26/03/2020

  25. Statistical Significance Tests in NLP Normality Test Null-hypothesis: ● Sample comes from a normal distribution ○ Shapiro-Wilk: ● Checks if the variance of QQ plots and variance samples are similar ○ ● Kolmogorov-Smirnov: ○ Tests between the shapes of two distributions (not only normal) ○ Absolute difference between two cumulative distribution function ● Anderson-Darling: ○ Tests whether a given which also uses a cumulative distribution function 25 26/03/2020

  26. Statistical Significance Tests in NLP Parametric Test Parametric tests assume that the data are drawn from a known distribution: ● With defined parameters ○ Usually from normally distributed ○ Paired Student’s t-test: ● Typically applied to Accuracy, UAS and LAS: ■ Compute the mean number of predictions per sample ● Based on the Central Limit Theorem: ● Sum of the independently drawn variables follow a Normal Distribution ○ Also Pearson’s correlation (with n-2 degree of freedom) ■ 26 26/03/2020

  27. Statistical Significance Tests in NLP Non Parametric Tests Two dimensions: ● Statistical power ○ Computational complexity ○ Two families: ● Sample free: ○ Doesn’t consider the values of the evaluation measure ■ Sample based: ○ It considers the values of the evaluation measures and sample from them ■ 27 26/03/2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend