learning for categorization
play

Learning for Categorization Sample Category Learning Problem A - PDF document

Categorization (review) Given: A description of an instance, x X , where X is the instance language or instance space . Text Categorization A fixed set of categories: C= { c 1 , c 2 , c n } Determine: The category of x


  1. Categorization (review) • Given: – A description of an instance, x ∈ X , where X is the instance language or instance space . Text Categorization – A fixed set of categories: C= { c 1 , c 2 ,… c n } • Determine: – The category of x : c ( x ) ∈ C, where c ( x ) is a categorization function whose domain is X and whose range is C . 1 2 Learning for Categorization Sample Category Learning Problem • A training example is an instance x ∈ X, • Instance language: <size, color, shape> – size ∈ {small, medium, large} paired with its correct category c ( x ): – color ∈ {red, blue, green} < x , c ( x )> for an unknown categorization – shape ∈ {square, circle, triangle} function, c . • C = {positive, negative} • Given a set of training examples, D . • D : Example Size Color Shape Category • Find a hypothesized categorization function, 1 small red circle positive h ( x ), such that: 2 large red circle positive ∀ < > ∈ = x , c ( x ) D : h ( x ) c ( x ) 3 small red triangle negative Consistency 4 large blue circle negative 3 4 General Learning Issues Generalization • Many hypotheses are usually consistent with the • Hypotheses must generalize to correctly training data. classify instances not in the training data. • Bias • Simply memorizing training examples is a – Any criteria other than consistency with the training consistent hypothesis that does not data that is used to select a hypothesis. • Classification accuracy (% of instances classified generalize. correctly). • Occam’s razor : – Measured on independent test data. – Finding a simple hypothesis helps ensure • Training time (efficiency of training algorithm). generalization. • Testing time (efficiency of subsequent classification). 5 6 1

  2. Text Categorization Learning for Text Categorization • Assigning documents to a fixed set of categories, e.g. • Hard to construct text categorization functions. • Web pages • Learning Algorithms: – Categories in search (see microsoft.com) – Yahoo-like classification – Bayesian (naïve) • Newsgroup Messages – Neural network – Recommending – Relevance Feedback (Rocchio) – Spam filtering – Rule based (C4.5, Ripper, Slipper) • News articles – Nearest Neighbor (case based) – Personalized newspaper – Support Vector Machines (SVM) • Email messages – Routing – Prioritizing – Folderizing – spam filtering 7 8 Rocchio Text Categorization Algorithm Using Relevance Feedback (Rocchio) (Training) • Use standard TF/IDF weighted vectors to represent text documents (normalized by Assume the set of categories is { c 1 , c 2 ,… c n } maximum term frequency). For i from 1 to n let p i = <0, 0,…,0> ( init. prototype vectors ) For each training example < x , c ( x )> ∈ D • For each category, compute a prototype vector by summing the vectors of the training documents in Let d be the frequency normalized TF/IDF term vector for doc x the category. Let i = j such that ( c j = c ( x )) ( sum all the document vectors in c i to get p i ) • Assign test documents to the category with the Let p i = p i + d closest prototype vector based on cosine similarity. 9 10 Rocchio Text Categorization Algorithm Illustration of Rocchio Text Categorization (Test) Given test document x Let d be the TF/IDF weighted term vector for x Let m = –2 ( init. maximum cosSim ) For i from 1 to n : ( compute similarity to prototype vector ) Let s = cosSim( d , p i ) if s > m let m = s let r = c i ( update most similar class prototype ) Return class r 11 12 2

  3. Rocchio Properties Rocchio Anomaly • Does not guarantee a consistent hypothesis. • Prototype models have problems with polymorphic (disjunctive) categories. • Forms a simple generalization of the examples in each class (a prototype ). • Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length. • Classification is based on similarity to class prototypes. 13 14 Rocchio Time Complexity Nearest-Neighbor Learning Algorithm • Note: The time to add two sparse vectors is • Learning is just storing the representations of the proportional to minimum number of non-zero training examples in D . entries in the two vectors. • Testing instance x : • Training Time: O(| D |( L d + | V d |)) = O(| D | L d ) – Compute similarity between x and all examples in D . where L d is the average length of a document in D and V d – Assign x the category of the most similar example in D . is the average vocabulary size for a document in D. • Does not explicitly compute a generalization or • Test Time: O( L t + |C||V t | ) category prototypes. where L t is the average length of a test document and | V t | • Also called: is the average vocabulary size for a test document. – Assumes lengths of p i vectors are computed and stored during – Case-based training, allowing cosSim( d , p i ) to be computed in time – Memory-based proportional to the number of non-zero entries in d (i.e. |V t | ) – Lazy learning 15 16 K Nearest-Neighbor Similarity Metrics • Using only the closest example to determine • Nearest neighbor method depends on a categorization is subject to errors due to: similarity (or distance) metric. – A single atypical example. • Simplest for continuous m -dimensional instance space is Euclidian distance . – Noise (i.e. error) in the category label of a single training example. • Simplest for m -dimensional binary instance • More robust alternative is to find the k space is Hamming distance (number of most-similar examples and return the feature values that differ). majority category of these k examples. • For text, cosine similarity of TF-IDF • Value of k is typically odd to avoid ties, 3 weighted vectors is typically most effective. and 5 are most common. 17 18 3

  4. 3 Nearest Neighbor Illustration K Nearest Neighbor for Text (Euclidian Distance) Training: For each each training example < x , c ( x )> ∈ D . . Compute the corresponding TF-IDF vector, d x , for document x . . . Test instance y : . . Compute TF-IDF vector d for document y . For each < x , c ( x )> ∈ D . . Let s x = cosSim( d , d x ) . Sort examples, x , in D by decreasing value of s x Let N be the first k examples in D. ( get most similar neighbors ) Return the majority class of examples in N 19 20 Illustration of 3 Nearest Neighbor for Text 3NN on Rocchio Anomaly • Nearest Neighbor handles polymorphic categories better. 21 22 Nearest Neighbor Time Complexity Nearest Neighbor with Inverted Index • Training Time: O(| D | L d ) to compose • Determining k nearest neighbors is the same as determining the k best retrievals using the test TF-IDF vectors. document as a query to a database of training • Testing Time: O( L t + |D||V t | ) to compare to documents. all training vectors. • Use standard VSR inverted index methods to find – Assumes lengths of d x vectors are computed and stored the k nearest neighbors. during training, allowing cosSim( d , d x ) to be computed • Testing Time: O( B|V t | ) in time proportional to the number of non-zero entries where B is the average number of training documents in in d (i.e. |V t | ) which a test-document word appears. • Testing time can be high for large training • Therefore, overall classification is O( L t + B|V t | ) sets. – Typically B << | D | 23 24 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend