An Introduction to Text Classification Jrg Steffen, DFKI - PowerPoint PPT Presentation

An Introduction to Text Classification Jörg Steffen, DFKI steffen@dfki.de 24.10.2011 1 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Overview • Application Areas • Rule-Based Approaches • Statistical Approaches � Naive Bayes � Vector-Based Approaches • Rocchio • K-nearest Neighbors • Support Vector Machine • Evaluation Measures • Evaluation Corpora • N-Gram Based Classification 2 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Example Application Scenario • Bertelsmann “Der Club” uses text classification to assign incoming emails to a category, e.g. � change of bank connection � change of address � delivery inquiry � cancellation of membership • Emails are forwarded to the responsible editor • Advantages � decrease of response time � more flexible resource management � happy customers ☺ 3 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Other Application Areas • Spam filtering • Language identification • News topic classification • Authorship attribution • Genre classification • Email surveillance 4 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Rule-based Classification Approaches • Use Boolean operators AND, OR and NOT • Example rule � if an email contains “address change” or “new address”, assign it to the category “address changes” • Organized as decision tree � nodes represent rules that route the document to a subtree � documents traverse the tree top down � leafs represent categories 5 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Rule-based Classification Approaches • Advantages � transparent � easy to understand � easy to modify � easy to expand • Disadvantages � complex and time consuming � intelligence is not in the system but with the system designer � not adaptive � only absolute assignment, no confidence values • Statistical classification approaches solve some of these disadvantages 6 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Hybrid Approaches • Use statistics to automatically create decision trees � e.g. ID3 or CART • Idea: identify the feature of the training data with the highest information content � most valuable to differentiate between categories � establish the top level node of the decision tree � recursively applied to the subtrees • Advanced approaches “tune” the decision tree � merging of nodes � pruning of branches 7 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Statistical Classification Approaches • Advantages � work with probabilities � allows thresholds � adaptive • Disadvantage � require a set of training documents annotated with a category • Most popular � Naive Bayes � Rocchio � K-nearest neighbor � Support Vector Machines (SVM) 8 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Linguistic Preprocessing • Remove HTML/XML tags and stop words • Perform word stemming • Replace all synonyms of a word with a single representative � e.g. { car, machine, automobile } � car • Composites analysis (for German texts) � split “Hausboot” into “Haus” and “Boot” • Set of remaining words is called “feature set” • Documents are considered as “Bag-of-Words” • Importance of linguistic preprocessing increases with � number of categories � lack of training data 9 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Naive Bayes • Based on Thomas Bayes theorem from the 18 th century • Idea: Use the training data to estimate the probability of d = { w ,..., w } a new, unclassified document belonging 1 M c ,..., c to each category 1 K P ( c ) P ( d | c ) j j = P ( c | d ) j P ( d ) M = • This simplifies to P ( c | d ) P ( c ) P ( w | c ) ∏ j j i j = i 1 10 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Naive Bayes • The following estimates can be done using the training documents + 1 N N ij j = P ( w | c ) j = P ( c ) i j ∑ = M + M N N kj k 1 where � N is the total number of training documents � N c is the number of training documents for category j j � w N is the number of times word occurred within documents i ij c of category j � M is the total number of words in the document 11 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Naive Bayes • Result is a ranking of categories • Adaptive � probabilities can be updated with each correctly classified document • Naive Bayes is used very effectively in adaptive spam filters • But why “naive”? � assumption of word independence � � � � Bag-of-Words model � generally not true for word appearances in documents • Conclusion � Text classification can be done by just counting words 12 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Documents as Vectors • Some classification approaches are based on vector models • Developed by Gerard Salton in the 60s • Documents have to be presented as vectors • Example � the vector space for two documents consisting of “I walk” and “I drive” consists of three dimension, one for each unique word � “I walk” � (1, 1, 0) � “I drive” � (1, 0, 1) • Collection of documents is represented by a word-by- A = ( a ) document matrix where each entry represents ik the occurrences of a word i in a document k 13 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Weight of Words in Document Vectors  > 1 if f 0 ik  = a • Boolean weighting ik  0 otherwise a = f • Word frequency weighting ik ik   N   = × a f log • tf.idf weighting ik ik   n i � considers distribution of words over the training corpus n � is the number of training documents that contain at least i one occurrence of word i 14 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Run Length Encoding • Vectors representing documents contain almost only zeros � only a fraction of the total words of a corpus appear in a single document • Run Length Encoding is used to compress vectors � Store sequences of length n of the same value v as nv � WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWW WWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW would be stored as 12W1B12W3B24W1B14W 15 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Dimensionality Reduction • Large training corpora contain hundreds of thousands of unique words, even after linguistic preprocessing • Result is a high dimensional feature space • Processing is extremely costly in computational terms • Use feature selection to remove non-informative words from documents � document frequency thresholding � information gain 2 χ � -statistic 16 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Document Frequency Thresholding • Compute document frequency for each word in the training corpus • Remove words whose document frequency is less than predetermined threshold • These words are non-informative or not influential for classification performance 17 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Information Gain • Measure for each word how much its presence or absence in a document contributes to category prediction • Remove words whose information gain is less than predetermined threshold K K ∑ ∑ = − + + IG ( w ) P ( c ) log P ( c ) P ( w ) P ( c | w ) log P ( c | w ) j j j j = j 1 = j 1 K ∑ P ( w ) P ( c | w ) log P ( c | w ) j j = j 1 18 Language Technology I - An Introduction to Text Classification - WS 2011/2012

Information Gain N N N j w jw j = = = P ( c ) P ( w ) P ( c | w ) j N N N w N N w j w = P ( w ) = P ( c | w ) j N N w N • total no. of documents N c • no. of docs in category j j N w • no. of docs containing w N w • no. of docs not containing w c N w • no. of docs in category containing jw j N no. of docs in category not containing w • c j w j 19 Language Technology I - An Introduction to Text Classification - WS 2011/2012

2 χ -Statistic • Measure dependance between words and categories 2 × − N ( N N N N ) jw j w j w j w 2 χ = ( w , c ) j + × + × + × + ( N N ) ( N N ) ( N N ) ( N N ) jw j w j w j w jw j w j w j w • Define measure as K ∑ 2 2 χ = χ ( w ) P ( c ) ( w , c ) j j = j 1 • Result is a word ranking • Select top section as feature set 20 Language Technology I - An Introduction to Text Classification - WS 2011/2012

An Introduction to Text Classification Jrg Steffen, DFKI - PowerPoint PPT Presentation

An Introduction to Text Classification Jrg Steffen, DFKI steffen@dfki.de 24.10.2011 1 Language Technology I - An Introduction to Text Classification - WS 2011/2012 Overview Application Areas Rule-Based Approaches Statistical

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

Text Classification and Sequence Labeling Graham Neubig Text Classification

Automatic text classification and extraction of Automatic text classification and extraction of

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Web Information Retrieval Lecture 13 Introduction to text classification and clustering

Graph Classification Classification Outline Introduction, Overview Classification using

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

E4180: Network Security Henning Schulzrinne Columbia University, New York

Introduction Nima Honarmand Spring 2017 :: CSE 506 What is an Operating System? (1) App 1 App

Advances in Programming Languages APL1: Whats so important about language? Ian Stark School

Advances in Programming Languages APL12: Coursework Assignment, Review David Aspinall School of

Collaborative Applications Prasun Dewan Department of Computer Science University of North

siunitx: Past, present and future Joseph Wright joseph.wright@morningstar2.co.uk An appeal for

CIS 500 Software Foundations Course Overview Fall 2005 7 September CIS

ETHOC: Entry Points ETHOC: Entry Points into a Smart Campus into a Smart Campus Environment