introduction
play

Introduction Alessandro Moschitti Department of Computer Science - PowerPoint PPT Presentation

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule - Revised 27 apr 9:30-12:30 Garda (Introduction to Machine


  1. MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it

  2. Course Schedule - Revised � 27 apr 9:30-12:30 Garda (Introduction to Machine Learning - Decision Tree and Bayesian Classifiers) � 2 maggio: 14:30-18:30 Ofek (Introduction to Statistical Learning Theory – Vector Space Model) � 4 Maggio 9:30-12:30 Ofek (Linear Classifier:) � 28 maggio 9:30-12:30 Ofek (VC dimension, Perceptron and Support Vector Machines) � 29 maggio 9:30-12:30 Garda (Kernel Methods for NLP Applications)

  3. Lectures � Introduction to ML � Decision Tree � Bayesian Classifiers � Vector spaces � Vector Space Categorization � Feature design, selection and weighting � Document representation � Category Learning: Rocchio and KNN � Measuring of Performance � From binary to multi-class classification

  4. Lectures � PAC Learning � VC dimension � Perceptron � Vector Space Model � Representer Theorem � Support Vector Machines (SVMs) � Hard/Soft Margin (Classification) � Regression and ranking

  5. Lectures � Kernels Methods � Theory and Algebraic properties � Linear, Polynomial, Gaussian � Kernel construction, � Kernels for structured data � Sequence, Tree Kernels � Structured Output

  6. Reference Book + some articles

  7. Today � Introduction to Machine Learning � Vector Spaces

  8. Why Learning Functions Automatically? � Anything is a function � From the planet motion � To the input/output actions in your computer � Any problem would be automatically solved

  9. More concretely � Given the user requirement (input/output relations) we write programs � Different cases typically handled with if-then applied to input variables � What happens when � millions of variables are present and/or � values are not reliable (e.g. noisy data) � Machine learning writes the program (rules) for you

  10. What is Statistical Learning? � Statistical Methods – Algorithms that learn relations in the data from examples � Simple relations are expressed by pairs of variables: 〈 x 1 ,y 1 〉 , 〈 x 2 ,y 2 〉 ,…, 〈 x n ,y n 〉 � Learning f such that evaluate y * given a new value x * , i.e. 〈 x * , f(x * ) 〉 = 〈 x * , y * 〉

  11. You have already tackled the learning problem Y X

  12. Linear Regression Y X

  13. Degree 2 Y X

  14. Degree Y X

  15. Machine Learning Problems � Overfitting � How dealing with millions of variables instead of only two? � How dealing with real world objects instead of real values?

  16. Learning Models � Real Values: regression � Finite and integer: classification � Binary Classifiers: � 2 classes, e.g. f(x) à {cats,dogs}

  17. Decision Trees

  18. Decision Tree (between Dogs/Cats) Taller than 50 cm? yes No Short hair? Output: dog No . Mustaches? . . Si No Output: Cat Output: Dog

  19. Mustaches or Whiskers � Are an important orientation tool for both dogs and cats � all dogs and cats have them ⟾ not good features � We may use their length � What about mustaches?

  20. Mustaches?

  21. END

  22. Entropy-based feature selection � Entropy of class distribution P(C i ) : � Measure “how much the distribution is uniform” � Given S 1 …S n sets partitioned wrt a feature the overall entropy is:

  23. Example: cats and dogs classification S 0 � p(dog)=p(cat) = 4/8 = ½ (for both dogs and cats) � H(S 0 ) = ½ *log(2) * 2 = 1

  24. Has the animal more than 6 siblings? S 1 S 0 S 2 � p(dog)=p(cat) = 2/4 = ½ (for both dogs and cats) � H(S 1 ) = H(S 2 ) = ¼ * [ ½ *log(2) * 2] = 0.25 � All(S 1, S 2 ) = 2*.25 = 0.5

  25. Does the animal have short hair? S 1 S 0 S 2 � p(dog)= 1/4; p(cat) = 3/4 � H(S 2 )=H(S 1 ) = ¼ * [(1/4)*log(4) + (3/4)*log(4/3)] = ¼ * [ ½ + 0.31] = ¼ * 0.81 = 0.20 � All(S 1, S 2 ) = 0.20*2 = 0.40 (note that |S1| = |S2|)

  26. Follow up � hair length feature is better than number of siblings since 0.40 is lower than 0.50 � Test all the features � Choose the best � Start with a new feature on the collection sets induced by the best feature

  27. Probabilistic Classifier

  28. Probability (1) � Let Ω be a space and β a collection of subsets of Ω � β is a collection of events � A probability function P is defined as: [ ] P : 0 , 1 β →

  29. Definition of Probability � P is a function which associates each event E with a number P(E) called probability of E as follows: 1) 0 P ( E ) 1 ≤ ≤ 2) P ( ) 1 Ω = 3 ) P ( E E ... E ...) ∨ ∨ ∨ ∨ = 1 2 n ∞ ∑ P ( E i ) if E i ∧ E j = 0 , ∀ i ≠ j = i = 1

  30. Finite Partition and Uniformly Distributed � Given a partition of n events uniformly distributed (with a probability of 1/ n ); and � given an event E , we can evaluate its probability as: P ( E ) = P ( E ∧ E tot ) = P ( E ∧ ( E 1 ∨ E 2 ∨ ... ∨ E n )) = 1 ∑ ∑ ∑ P ( E ∧ E i ) = P ( E i ) = = n i E i ⊂ E E i ⊂ E 1 1 = 1 } ) = Target Cases ∑ { ( i : E i ⊂ E n n All Cases E i ⊂ E

  31. Conditioned Probability � P(A | B) is the probability of A given B � B is the piece of information that we know � The following rule holds: P ( A B ) ∧ A B A ∧ B P ( A | B ) = P ( B )

  32. Indipendence � A and B are indipedent iff : P ( A | B ) P ( A ) = P ( B | A ) P ( B ) = � If A and B are indipendent: P ( A B ) ∧ P ( A ) P ( A | B ) = = P ( B ) P ( A B ) P ( A ) P ( B ) ∧ =

  33. Bayes’s Theorem P ( A | B ) = P ( B | A ) P ( A ) P ( B ) Proof: P ( A | B ) = P ( A ∧ B ) (Def. of. Cond. prob) P ( B ) P ( B | A ) = P ( A ∧ B ) Def. of. Cond. prob P ( A ) P ( A | B ) = [ P ( B | A ) P ( A )] P ( B )

  34. Bayesian Classifier � Given a set of categories { c 1 , c 2 ,… c n } � Let E be a description of a classifying example. � The category of E can be derived by using the following probability: P ( c i | E ) = P ( c i ) P ( E | c i ) P ( E ) n n P ( c i ) P ( E | c i ) ∑ ∑ P ( c i | E ) = = 1 P ( E ) i = 1 i = 1 n ∑ P ( E ) = P ( c i ) P ( E | c i ) i = 1

  35. Bayesian Classifier (cont) � We need to compute: � the posterior probability: P( c i ) � the conditional probability: P( E | c i ) � P( c i ) can be estimated from the training set, D. � given n i examples in D of type c i , then P( c i ) = n i / | D| � Suppose that an example is represented by m features : E e e  e = ∧ ∧ ∧ 1 2 m � The elements will be exponential in m so there are not enough training examples to estimate P( E | c i )

  36. Naïve Bayes Classifiers � The features are assumed to be indipendent given a category ( c i ). m P ( E | c i ) = P ( e 1 ∧ e 2 ∧  ∧ e m | c i ) = ∏ P ( e j | c i ) j = 1 � This allows us to only estimate P ( e j | c i ) for each feature and category.

  37. An example of the Naïve Bayes Clasiffier � C = {Allergy, Cold, Healthy} � e 1 = sneeze; e 2 = cough; e 3 = fever � E = {sneeze, cough, ¬ fever} Prob Healthy Cold Allergy P( c i ) 0.9 0.05 0.05 P(sneeze| c i ) 0.1 0.9 0.9 P(cough| c i ) 0.1 0.8 0.7 P(fever| c i ) 0.01 0.7 0.4

  38. An example of the Naïve Bayes Clasiffier (cont.) Probability Healthy Cold Allergy P( c i ) 0.9 0.05 0.05 E={sneeze, cough, ¬ fever} P(sneeze | c i ) 0.1 0.9 0.9 P(cough | c i ) 0.1 0.8 0.7 P(fever | c i ) 0.01 0.7 0.4 P(Healthy| E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E) P(Cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E) P(Allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E ) The most probable category is allergy P(E) = 0.0089 + 0.01 + 0.019 = 0.0379 P(Healthy| E) = 0.23, P(Cold | E) = 0.26, P(Allergy | E) = 0.50

  39. Probability Estimation � Estimate counts from training data. � Let n i be the number of examples in c i � let n ij be the number of examples of c i containing the feature e j , then: n ij P ( e | c ) = j i n i � Problems: the data set may still be too small. � For rare features we may have, e k , ∀ c i :P( e k | c i ) = 0.

  40. Smoothing � The probabilities are estimated even if they are not in the data � Laplace smoothing � each feature has a priori probability, p , � We assume that such feature has been observed in an example of size m . n mp + ij P ( e | c ) = j i n m + i

  41. Naïve Bayes for text classification � “bag of words” model � The examples are category documents � Features: Vocabulary V = { w 1 , w 2 ,… w m } � P( w j | c i ) is the probability to have w j in a category i � Let us use the Laplace’s smoothing � Uniform distribution ( p = 1/| V |) and m = | V | � That is each word is assumed to appear exactly one time in a category

  42. Training (version 1) � V is built using all training documents D � For each category c i ∈ C Let D i the document subset of D in c i ⇒ P( c i ) = | D i | / | D | n i is the total number of words in D i for each w j ∈ V, n ij is the counts of w j in c i ⇒ P( w j | c i ) = ( n ij + 1) / ( n i + | V |)

  43. Testing � Given a test document X � Let n be the number of words of X � The assigned category is: n ∏ argmax P ( c i ) P ( a j | c i ) ci ∈ C j = 1 where a j is a word at the j - th position in X

  44. Part I: Abstract View of Statistical Learning Theory

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend