text categorization
play

Text Categorization P2P Security Datamining Semantic Web Case - PDF document

Course Overview Info Extraction Ecommerce Web Services Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google, Altavista CSE 454 Information Retrieval Crawler Architecture Precision vs Recall


  1. Course Overview Info Extraction Ecommerce Web Services Text Categorization P2P Security Datamining Semantic Web Case Studies: Nutch, Google, Altavista CSE 454 Information Retrieval Crawler Architecture Precision vs Recall Synchronization & Monitors Inverted Indicies Systems Foundation: Networking & Clusters 1 2 Why is Learning Possible? Bias Experience alone never justifies any • The nice word for prejudice is “bias”. conclusion about any unseen instance. • What kind of hypotheses will you consider? –What is allowable range of functions you use when approximating? • What kind of hypotheses do you prefer? Learning occurs when PREJUDICE meets DATA! Learning a “FOO” 3 4 A Learning Problem Some Typical Bias: The World is Simple • Occam’s razor “It is needless to do more when less will suffice” – William of Occam, died 1349 of the Black plague • MDL – Minimum description length • Concepts can be approximated by ... conjunctions of predicates ... by linear functions ... by short decision trees 5 6 1

  2. Hypothesis Spaces 7 8 Terminology Two Strategies for ML • Restriction bias: use prior knowledge to specify a restricted hypothesis space. –Naïve Bayes • Preference bias: use a broad hypothesis space, but impose an ordering on the hypotheses. –Decision Trees. 9 10 Key Issues for ML Framework for Learning Algos 11 12 2

  3. Categorization (review) Learning for Categorization • A training example is an instance x ∈ X, • Given: – A description of an instance, x ∈ X , where X is paired with its correct category c ( x ): < x , c ( x )> for an unknown categorization the instance language or instance space . function, c . – A fixed set of categories: C= { c 1 , c 2 ,… c n } • Given a set of training examples, D . • Determine: • Find a hypothesized categorization function, – The category of x : c ( x ) ∈ C, where c ( x ) is a h ( x ), such that: categorization function whose domain is X and ∀ < > ∈ = x , c ( x ) D : h ( x ) c ( x ) whose range is C . Consistency 13 14 Sample Category Learning More to the Point Problem • Instance language: <size, color, shape> • C(X) = true if X is a Webcam page – size ∈ {small, medium, large} • Features – color ∈ {red, blue, green} Words on page – shape ∈ {square, circle, triangle} …. • C = {positive, negative} • Hypothesis Language • D : Example Size Color Shape Category 1 small red circle positive 2 large red circle positive 3 small red triangle negative 4 large blue circle negative 15 16 Text Categorization Generalization • Assigning documents to a fixed set of categories. • Applications: • Hypotheses must generalize to correctly classify – Web pages instances not in the training data. • Categories in search (see microsoft.com) – Simply memorizing training examples gives a • Yahoo-like classification consistent hypothesis that does not generalize. – Newsgroup Messages / News articles • Occam’s razor : • Recommending – Finding a simple hypothesis helps ensure • Personalized newspaper generalization. – Email messages • Routing • Prioritizing • Folderizing • spam filtering 17 18 3

  4. General Learning Issues Learning for Text Categorization • Many hypotheses often consistent w/ training data. • Manual development of text categorization • Bias functions is difficult. – Any criteria other than consistency with the training • Learning Algorithms: data that is used to select a hypothesis. – Bayesian (naïve) • Classification accuracy – Neural network – % of instances classified correctly – Relevance Feedback (Rocchio) – Measured on independent test data. – Rule based (C4.5, Ripper, Slipper) • Training time – Nearest Neighbor (case based) – Efficiency of training algorithm – Support Vector Machines (SVM) • Testing time – Efficiency of subsequent classification 19 20 Using Relevance Feedback Rocchio Text Categorization Algorithm (Rocchio) (Training) • Adapt relevance feedback for text categorization. Assume the set of categories is { c 1 , c 2 ,… c n } • Use standard TF/IDF weighted vectors to represent For i from 1 to n let p i = <0, 0,…,0> ( init. prototype vectors ) text documents (normalized by maximum term For each training example < x , c ( x )> ∈ D frequency). Let d = frequency normalized TF/IDF term vector for doc x • For each category, compute a prototype vector by Let i = j : ( c j = c ( x )) ( sum all the document vectors in c i to get p i ) summing the vectors of the training documents in Let p i = p i + d the category. • Assign test documents to the category with the closest prototype vector based on cosine similarity. 21 22 Rocchio Text Categorization Algo Illustration of Rocchio Text (Test) Categorization Given test document x Let d be the TF/IDF weighted term vector for x Let m = –2 ( init. maximum cosSim ) For i from 1 to n : ( compute similarity to prototype vector ) Let s = cosSim( d , p i ) if s > m let m = s let r = c i ( update most similar class prototype ) Return class r 23 24 4

  5. Rocchio Properties Rocchio Time Complexity • Note: The time to add two sparse vectors is • Does not guarantee a consistent hypothesis. proportional to minimum number of non-zero • Forms a simple generalization of the entries in the two vectors. examples in each class (a prototype ). • Training Time: O(| D |( L d + | V d |)) = O(| D | L d ) • Prototype vector does not need to be where L d is the average length of a document in D and V d averaged or otherwise normalized for length is the average vocabulary size for a document in D. since cosine similarity is insensitive to • Test Time: O( L t + |C||V t | ) where L t is the average length of a test document and | V t | vector length. is the average vocabulary size for a test document. • Classification is based on similarity to class – Assumes lengths of p i vectors are computed and stored during training, allowing cosSim( d , p i ) to be computed in time prototypes. proportional to the number of non-zero entries in d (i.e. |V t | ) 25 26 Nearest-Neighbor Learning K Nearest-Neighbor Algorithm • Learning is just storing the representations of the • Using only the closest example to determine training examples in D . categorization is subject to errors due to: • Testing instance x : – A single atypical example. – Compute similarity between x and all examples in D . – Noise (i.e. error) in the category label of a – Assign x the category of the most similar example in D . single training example. • Does not explicitly compute a generalization or • More robust alternative is to find the k category prototypes. most-similar examples and return the • Also called: majority category of these k examples. – Case-based • Value of k is typically odd to avoid ties, 3 – Memory-based and 5 are most common. – Lazy learning 27 28 3 Nearest Neighbor Illustration Similarity Metrics (Euclidian Distance) • Nearest neighbor method depends on a similarity (or distance) metric. . . • Simplest for continuous m -dimensional . instance space is Euclidian distance . . . . . • Simplest for m -dimensional binary instance . . space is Hamming distance (number of . feature values that differ). . • For text, cosine similarity of TF-IDF weighted vectors is typically most effective. 29 30 5

  6. Illustration of 3 Nearest Neighbor K Nearest Neighbor for Text for Text Training: For each each training example < x , c ( x )> ∈ D Compute the corresponding TF-IDF vector, d x , for document x Test instance y : Compute TF-IDF vector d for document y For each < x , c ( x )> ∈ D Let s x = cosSim( d , d x ) Sort examples, x , in D by decreasing value of s x Let N be the first k examples in D. ( get most similar neighbors ) Return the majority class of examples in N 31 32 Rocchio Anomaly 3 Nearest Neighbor Comparison • Prototype models have problems with • Nearest Neighbor tends to handle polymorphic (disjunctive) categories. polymorphic categories better. Cause: strong bias of Rocchio learner 33 34 Nearest Neighbor Time Nearest Neighbor Complexity with Inverted Index • Determining k nearest neighbors is the same as • Training Time: O(| D | L d ) to compose determining the k best retrievals using the test TF-IDF vectors. document as a query to a database of training • Testing Time: O( L t + |D||V t | ) to compare to documents. all training vectors. • Use standard VSR inverted index methods to find the k nearest neighbors. – Assumes lengths of d x vectors are computed and stored during training, allowing cosSim( d , d x ) to be computed • Testing Time: O( B|V t | ) in time proportional to the number of non-zero entries where B is the average number of training documents in in d (i.e. |V t | ) which a test-document word appears. • Testing time can be high for large training • Therefore, overall classification is O( L t + B|V t | ) sets. – Typically B << | D | 35 36 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend