Overview Introduction to Information Retrieval Text classification - PowerPoint PPT Presentation

Overview Introduction to Information Retrieval Text classification http://informationretrieval.org 1 IIR 13: Text Classification & Naive Bayes Naive Bayes 2 Hinrich Sch¨ utze Evaluation of TC 3 Institute for Natural Language Processing, Universit¨ at Stuttgart 2008.06.10 4 NB independence assumptions 1 / 54 2 / 54 Outline Relevance feedback In relevance feedback, the user marks a number of documents as relevant/nonrelevant. 1 Text classification We then use this information to return better search results. This is a form of text classification. Naive Bayes 2 Two “classes”: relevant, nonrelevant For each document, decide whether it is relevant or Evaluation of TC 3 nonrelevant The problem space relevance feedback belongs to is called classification. 4 NB independence assumptions The notion of classification is very general and has many applications within and beyond information retrieval. 3 / 54 4 / 54

Another TC task: spam filtering From: ‘‘’’ <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for From information retrieval to text similar courses I am 22 years old and I have already purchased 6 properties using the classification: methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: standing queries – Google Alerts http://www.wholesaledaily.com/sales/nmd.htm ================================================= How would you write a program that would automatically detect and delete this type of message? 5 / 54 6 / 54 Formal definition of TC: Training Formal definition of TC: Application/Testing Given: A document space X Documents are represented in this space, typically some type of high-dimensional space. A fixed set of classes C = { c 1 , c 2 , . . . , c J } The classes are human-defined for the needs of an application Given: a description d ∈ X of a document Determine: γ ( d ) ∈ C , (e.g., spam vs. non-spam). that is, the class that is most appropriate for d A training set D of labeled documents with each labeled document � d , c � ∈ X × C Using a learning method or learning algorithm, we then wish to learn a classifier γ that maps documents to classes: γ : X → C 7 / 54 8 / 54

Topic classification γ ( d ′ ) = China regions industries subject areas classes: poultry sports UK China coffee elections d ′ Many search engine functionalities are based first congestion Olympics feed roasting recount diamond training test private London Beijing chicken beans votes baseball Chinese set: set: on classification. airline tourism pate arabica seat forward Parliament Big Ben Great Wall ducks robusta run-off soccer Examples? Windsor Mao bird flu Kenya TV ads team the Queen communist turkey harvest campaign captain 9 / 54 10 / 54 Applications of text classification in IR Classification methods: 1. Manual Language identification (classes: English vs. French etc.) The automatic detection of spam pages (spam vs. nonspam, example: googel.org) The automatic detection of sexually explicit content (sexually Manual classification was used by Yahoo in the beginning of explicit vs. not) the web. Also: ODP, PubMed Sentiment detection: is a movie or product review positive or Very accurate if job is done by experts negative (positive vs. negative) Consistent when the problem size and team is small Topic-specific or vertical search – restrict search to a Manual classification is difficult and expensive to scale. “vertical” like “related to health” (relevant to vertical vs. not) → We need automatic methods for classification. Machine-learned ranking function in ad hoc retrieval (relevant vs. nonrelevant) Semantic Web: Automatically add semantic tags for non-tagged text (e.g., for each paragraph: relevant to a vertical like health or not) 11 / 54 12 / 54

Classification methods: 2. Rule-based A Verity topic (a complex classification rule) Our Google Alerts example was rule-based classification. There are “IDE” type development enviroments for writing very complex rules efficiently. (e.g., Verity) Often: Boolean combinations (as in Google Alerts) Accuracy is very high if a rule has been carefully refined over time by a subject expert. Building and maintaining rule-based classification systems is expensive. 13 / 54 14 / 54 Classification methods: 3. Statistical/Probabilistic Outline 1 Text classification As per our definition of the classification problem – text classification as a learning problem Supervised learning of a the classification function γ and its Naive Bayes 2 application to classifying new documents We will look at a couple of methods for doing this: Naive Bayes, Rocchio, kNN Evaluation of TC 3 No free lunch: requires hand-classified training data But this manual classification can be done by non-experts. 4 NB independence assumptions 15 / 54 16 / 54

The Naive Bayes classifier Maximum a posteriori class The Naive Bayes classifier is a probabilistic classifier. We compute the probability of a document d being in a class c as follows: Our goal is to find the “best” class. The best class in Naive Bayes classification is the most likely � P ( c | d ) ∝ P ( c ) P ( t k | c ) or maximum a posteriori (MAP) class c map : 1 ≤ k ≤ n d ˆ ˆ � ˆ P ( t k | c ) is the conditional probability of term t k occurring in a c map = arg max P ( c | d ) = arg max P ( c ) P ( t k | c ) c ∈ C c ∈ C document of class c 1 ≤ k ≤ n d P ( t k | c ) as a measure of how much evidence t k contributes We write ˆ P for P since these values are estimates from the that c is the correct class. training set. P ( c ) is the prior probability of c . If a document’s terms do not provide clear evidence for one class vs. another, we choose the one that has a higher prior probability. 17 / 54 18 / 54 Taking the log Naive Bayes classifier Classification rule: Multiplying lots of small probabilities can result in floating [ log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] point underflow. c ∈ C 1 ≤ k ≤ n d Since log( xy ) = log( x ) + log( y ), we can sum log probabilities instead of multiplying probabilities. Simple interpretation: Each conditional parameter log ˆ P ( t k | c ) is a weight that Since log is a monotonic function, the class with the highest indicates how good an indicator t k is for c . score does not change. The prior log ˆ P ( c ) is a weight that indicates the relative So what we usually compute in practice is: frequency of c . The sum of log prior and term weights is then a measure of [log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] how much evidence there is for the document being in the c ∈ C class. 1 ≤ k ≤ n d We select the class with the most evidence. Questions? 19 / 54 20 / 54

Naive Bayes classifier Naive Bayes classifier Classification rule: Classification rule: [ log ˆ � log ˆ [ log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] c map = arg max P ( c ) + P ( t k | c )] c ∈ C c ∈ C 1 ≤ k ≤ n d 1 ≤ k ≤ n d Simple interpretation: Simple interpretation: Each conditional parameter log ˆ Each conditional parameter log ˆ P ( t k | c ) is a weight that P ( t k | c ) is a weight that indicates how good an indicator t k is for c . indicates how good an indicator t k is for c . The prior log ˆ The prior log ˆ P ( c ) is a weight that indicates the relative P ( c ) is a weight that indicates the relative frequency of c . frequency of c . The sum of log prior and term weights is then a measure of The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the how much evidence there is for the document being in the class. class. We select the class with the most evidence. We select the class with the most evidence. Questions? Questions? 21 / 54 22 / 54 Naive Bayes classifier Naive Bayes classifier Classification rule: Classification rule: [ log ˆ � log ˆ [ log ˆ � log ˆ c map = arg max P ( c ) + P ( t k | c )] c map = arg max P ( c ) + P ( t k | c )] c ∈ C c ∈ C 1 ≤ k ≤ n d 1 ≤ k ≤ n d Simple interpretation: Simple interpretation: Each conditional parameter log ˆ Each conditional parameter log ˆ P ( t k | c ) is a weight that P ( t k | c ) is a weight that indicates how good an indicator t k is for c . indicates how good an indicator t k is for c . The prior log ˆ The prior log ˆ P ( c ) is a weight that indicates the relative P ( c ) is a weight that indicates the relative frequency of c . frequency of c . The sum of log prior and term weights is then a measure of The sum of log prior and term weights is then a measure of how much evidence there is for the document being in the how much evidence there is for the document being in the class. class. We select the class with the most evidence. We select the class with the most evidence. Questions? Questions? 23 / 54 24 / 54

Overview Introduction to Information Retrieval Text classification - PowerPoint PPT Presentation

Overview Introduction to Information Retrieval Text classification http://informationretrieval.org 1 IIR 13: Text Classification & Naive Bayes Naive Bayes 2 Hinrich Sch utze Evaluation of TC 3 Institute for Natural Language

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

HAZARD HAZARD ASSESSMENT ASSESSMENT adpc Asian Disaster Preparedness Center Hazard assessment

Research and Grid activities Research and Grid activities in Laboratory MSI of IFI in Laboratory

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

The Insurance Institute of London CII CPD accredited - demonstrates the quality of an event and

Q1 2019 Results 1 May 2019 Cautionary statement regarding forward-looking statements This

RETHINKING THE FUTURE OF SPACES POST THE GREAT LOCKDOWN 22 APRIL 2020 The Catalyst

https://xkcd.com/838/ Data Breaches This years study analyzed 524 breaches that occurred

Projections of Mandelbrot percolations Micha Rams 1 Kroly Simon 2 1 Institute of Mathematics

Overview Introduction to Information Retrieval Text classification - PowerPoint PPT Presentation

Overview Introduction to Information Retrieval Text classification http://informationretrieval.org 1 IIR 13: Text Classification & Naive Bayes Naive Bayes 2 Hinrich Sch utze Evaluation of TC 3 Institute for Natural Language

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI)

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS4611 Professor M. P. Schellekens Assistant: Ang Gao Slides adapted from

Information Retrieval CS276 Information Retrieval and Web Search Christopher

Information Retrieval CS276: Information Retrieval and Web Search Pandu

HAZARD HAZARD ASSESSMENT ASSESSMENT adpc Asian Disaster Preparedness Center Hazard assessment

Research and Grid activities Research and Grid activities in Laboratory MSI of IFI in Laboratory

Computational Linguistics I CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu

The Insurance Institute of London CII CPD accredited - demonstrates the quality of an event and

Q1 2019 Results 1 May 2019 Cautionary statement regarding forward-looking statements This

RETHINKING THE FUTURE OF SPACES POST THE GREAT LOCKDOWN 22 APRIL 2020 The Catalyst

https://xkcd.com/838/ Data Breaches This years study analyzed 524 breaches that occurred

Projections of Mandelbrot percolations Micha Rams 1 Kroly Simon 2 1 Institute of Mathematics

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models