Natural Language Processing CSCI 4152/6509 Lecture 11 IR Measures - PowerPoint PPT Presentation

Natural Language Processing CSCI 4152/6509 — Lecture 11 IR Measures and Text Mining Instructor: Vlado Keselj Time and date: 09:35–10:25, 30-Jan-2020 Location: Dunn 135 CSCI 4152/6509, Vlado Keselj Lecture 11 1 / 18

Previous Lecture Extracting n-grams and Perl list operators More examples in n-grams collection with Perl Using Ngrams module Elements of Information Retrieval Vector space model CSCI 4152/6509, Vlado Keselj Lecture 11 2 / 18

Cosine Similarity Measure q · � � m i =1 w i,q w i,d � d sim ( q, d ) = = q | · | � �� m �� m | � d | i =1 w 2 i =1 w 2 i,q · i,d z d cos = sim(d,q) α y q α x CSCI 4152/6509, Vlado Keselj Lecture 11 3 / 18

Side Note: Lucene and IR Book Lucene search engine http://lucene.apache.org Open-source, written in Java Uses the vector space model Another interesting link: Introduction to IR on-line book covers well text classification: http: //nlp.stanford.edu/IR-book/html/htmledition/irbook.html CSCI 4152/6509, Vlado Keselj Lecture 11 4 / 18

IR Evaluation: Precision and Recall Precision is the percentage of true positives out of all returned documents; i.e., TP P = TP + FP Recall is the percentage of true positives out of all relevant documents in the collection; i.e., TP R = TP + FN CSCI 4152/6509, Vlado Keselj Lecture 11 5 / 18

F-measure F-measure is a weighted harmonic mean between Precision and Recall: F = ( β 2 + 1) PR β 2 P + R We usually set β = 1 , in which case we have: F = 2 PR P + R CSCI 4152/6509, Vlado Keselj Lecture 11 6 / 18

Precision-Recall Curve A more appropriate way to evaluate a ranked list of relevant documents is the Precision-Recall Curve Connects (recall, precision) points for the sets of 1, 2, . . . most relevant documents on the list It typically looks as follows: 1 P 0 R 1 CSCI 4152/6509, Vlado Keselj Lecture 11 7 / 18

Precision-Recall Curve Example Results returned by a search engine: 1. relevant 2. relevant 3. relevant 4. not relevant 5. relevant 6. not relevant 7. relevant 8. not relevant 9. not relevant 10. relevant 11. not relevant 12. not relevant CSCI 4152/6509, Vlado Keselj Lecture 11 8 / 18

Task 1: Precision, Recall and F-measure Assuming that the total number of relevant documents in the collection is 8, calculate precision, recall, and F-measure ( β = 1 ) for the returned 12 results. CSCI 4152/6509, Vlado Keselj Lecture 11 9 / 18

Task 2: Precision-Recall Curve Task: Draw the precision-recall curve for these results First step: Form sets of n initial documents, and look at their relevance: ◮ Set 1: { R } ( R = 0 . 125 , P = 1 ) ◮ Set 2: { R, R } ( R = 0 . 25 , P = 1 ) ◮ Set 3: { R, R, R } , ( R = 0 . 375 , P = 1 ) ◮ Set 4: { R, R, R, NR } , ( R = 0 . 375 , P = 0 . 75 ) ◮ Set 5: { R, R, R, NR , R } , ( R = 0 . 5 , P = 0 . 8 ) ◮ . . . etc. CSCI 4152/6509, Vlado Keselj Lecture 11 10 / 18

Precision-Recall Curve 1 2 3 1 5 7 4 8 10 6 P 11 9 12 0 R 1 CSCI 4152/6509, Vlado Keselj Lecture 11 11 / 18

Task 3: Interpolated Precision-Recall Curve Task: Draw interpolated Precision-Recall curve Formula: IntPrec ( r ) = k,R ( k ) ≥ r P ( k ) max Based on the previous Task: 0 ≤ r ≤ R 4 = 3 8 = 0 . 375 ⇒ IntPrec ( r ) = 1 R 4 < r ≤ R 6 = 4 8 = 0 . 5 ⇒ IntPrec ( r ) = 0 . 8 R 6 < r ≤ R 9 = 5 8 = 0 . 625 ⇒ IntPrec ( r ) = 5 / 7 ≈ 0 . 714285714 R 9 < r ≤ R 12 = 6 8 = 0 . 75 ⇒ IntPrec ( r ) = 0 . 6 CSCI 4152/6509, Vlado Keselj Lecture 11 12 / 18

Interpolated Precision-Recall Curve 1 2 3 1 5 7 4 8 10 6 P 11 9 12 0 R 1 CSCI 4152/6509, Vlado Keselj Lecture 11 13 / 18

Some Other Similar Measures Fallout FP Fallout = FP + TN Specificity TN Specificity = TN + FP Sensitivity TP Sensitivity = (= R ) TP + FN Sensitivity and Specificity: useful in classification and contexts such as medical tests CSCI 4152/6509, Vlado Keselj Lecture 11 14 / 18

Some Text Mining Tasks Text Classification Text Clustering Information Extraction And some new and less prominent tasks: ◮ Text Visualization ◮ Filtering tasks, Event Detection ◮ Terminology Extraction CSCI 4152/6509, Vlado Keselj Lecture 11 15 / 18

Text Classification It is also known as Text Categorization. Additional reading: Manning and Sch¨ utze, Ch 16: Text Categorization Problem definition: Classify a document into a class (category) of documents Typical approach: Use of Machine Learning to learn classification model from previously labeled documents An example of supervised learning CSCI 4152/6509, Vlado Keselj Lecture 11 16 / 18

Types of Text Classification topic categorization sentiment classification authorship attribution and plagiarism detection authorship profiling (e.g., age and gender detection) spam detection and e-mail classification encoding and language identification automatic essay grading More specialized example: dementia detection using spontaneous speech CSCI 4152/6509, Vlado Keselj Lecture 11 17 / 18

Creating Text Classifiers Can be created manually ◮ typically rule-based classifier ◮ example: detect or count occurrences of some words, phrases, or strings Another approach: make programs that learn to classify ◮ In other words, classifiers are generated based on labeled data ◮ supervised learning CSCI 4152/6509, Vlado Keselj Lecture 11 18 / 18

Natural Language Processing CSCI 4152/6509 Lecture 11 IR Measures - PowerPoint PPT Presentation

Natural Language Processing CSCI 4152/6509 Lecture 11 IR Measures and Text Mining Instructor: Vlado Keselj Time and date: 09:3510:25, 30-Jan-2020 Location: Dunn 135 CSCI 4152/6509, Vlado Keselj Lecture 11 1 / 18 Previous Lecture

Natural Language Processing CSCI 4152/6509 Lecture 1 Course Introduction Instructor: Vlado

Natural Language Processing CSCI 4152/6509 Lecture 2 Introduction to Natural Language

Natural Language Processing CSCI 4152/6509 Lecture 7 Perl Processing Examples Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 29 Context-Free Grammars for Natural

Natural Language Processing CSCI 4152/6509 Lecture 31 Introduction to Semantic Processing

Natural Language Processing CSCI 4152/6509 Lecture 6 Regular Expressions; Text Processing in

Natural Language Processing CSCI 4152/6509 Lecture 27 Parsing with Prolog Instructor: Vlado

Natural Language Processing CSCI 4152/6509 Lecture 9 Elements of Morphology Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 26 CFGs and CYK Parsing Algorithm

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 14 Probabilistic Modeling Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 12 Classifier Evaluation Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 30 Efficient PCFG Inference Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Natural Language Processing CSCI 4152/6509 Lecture 10 Elements of Information Retrieval

Natural Language Processing CSCI 4152/6509 Lecture 18 POS Tags; Hidden Markov Model (HMM)

Distributed motion coordination of robotic networks Lecture 5 agreement Jorge Cort es

"Probabilistic" Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay -

Transducing for fun and profit simon@metabase.com @sbelak Clojure at a glance (lisp

Scientific Benchmarking of Parallel Computing Systems Paper Reading Group Torsten Hoefler

Lecture 10: Stereo databases Build a shape-based object detector using the generalized Hough

DB4SIL2 - Kernel assurance data for SIL2LinuxMP OpenTech Andreas Platschek <

EECS 442 Computer Vision Prof. David Fouhey Winter 2019, University of Michigan

Year 11 Information Evening Preparing for GCSEs Thursday 23 January 2020 Y11 Information Evening

Natural Language Processing CSCI 4152/6509 Lecture 11 IR Measures - PowerPoint PPT Presentation

Natural Language Processing CSCI 4152/6509 Lecture 11 IR Measures and Text Mining Instructor: Vlado Keselj Time and date: 09:3510:25, 30-Jan-2020 Location: Dunn 135 CSCI 4152/6509, Vlado Keselj Lecture 11 1 / 18 Previous Lecture

Natural Language Processing CSCI 4152/6509 Lecture 1 Course Introduction Instructor: Vlado

Natural Language Processing CSCI 4152/6509 Lecture 2 Introduction to Natural Language

Natural Language Processing CSCI 4152/6509 Lecture 7 Perl Processing Examples Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 29 Context-Free Grammars for Natural

Natural Language Processing CSCI 4152/6509 Lecture 31 Introduction to Semantic Processing

Natural Language Processing CSCI 4152/6509 Lecture 6 Regular Expressions; Text Processing in

Natural Language Processing CSCI 4152/6509 Lecture 27 Parsing with Prolog Instructor: Vlado

Natural Language Processing CSCI 4152/6509 Lecture 9 Elements of Morphology Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 26 CFGs and CYK Parsing Algorithm

Natural Language Processing CSCI 4152/6509 Lecture 17 N-gram Model Smoothing Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 14 Probabilistic Modeling Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 12 Classifier Evaluation Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 30 Efficient PCFG Inference Instructor:

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Natural Language Processing CSCI 4152/6509 Lecture 10 Elements of Information Retrieval

Natural Language Processing CSCI 4152/6509 Lecture 18 POS Tags; Hidden Markov Model (HMM)

Distributed motion coordination of robotic networks Lecture 5 agreement Jorge Cort es

&quot;Probabilistic&quot; Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay -

Transducing for fun and profit simon@metabase.com @sbelak Clojure at a glance (lisp

Scientific Benchmarking of Parallel Computing Systems Paper Reading Group Torsten Hoefler

Lecture 10: Stereo databases Build a shape-based object detector using the generalized Hough

DB4SIL2 - Kernel assurance data for SIL2LinuxMP OpenTech Andreas Platschek &lt;

EECS 442 Computer Vision Prof. David Fouhey Winter 2019, University of Michigan

Year 11 Information Evening Preparing for GCSEs Thursday 23 January 2020 Y11 Information Evening

"Probabilistic" Data Structures vs. PostgreSQL (and similar stuff) FOSDEM PgDay -

DB4SIL2 - Kernel assurance data for SIL2LinuxMP OpenTech Andreas Platschek <