artificial intelligence supervised learning classification
play

ARTIFICIAL INTELLIGENCE Supervised learning: classification - PowerPoint PPT Presentation

Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Supervised learning: classification Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from


  1. Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Supervised learning: classification Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

  2. 2

  3. Requirements Supervised learning algorithms for classification learn the relation between  class‐labels (the things to predict) and feature/attribute values (observable things) Various algorithms use probabilistic relations between class and feature. The required probabilities  can be assessed from the data using frequency counting 3

  4. When to play tennis? Example dataset D: N=14 cases, 4 attributes , 1 class variable 4

  5. Frequency counting Our example class variable PlayTennis (PT) has 2 possible values ( yes , no); feature Temperature has 3 values (hot, mild, cool). With N=14 cases, we find with frequency counting the following prior probabilities for the class labels:  9 out of N=14 examples are positive  p(PT=yes) = 9/14  5 of these 14 are negative  p(PT=no) = 5/14 Similary, the conditional probabilities for the features given the class can be determined. E.g. given PT=yes, we find that out of the 9 cases, 2 were in hot conditions, 4 in mild and 3 in cool  p(Temp = hot | PT =yes) = 2/9 p(Temp = mild | PT =yes) = 4/9 p(Temp = cool | PT =yes) = 3/9 5

  6. Naïve Bayes classifier updated given forecast…. Supervised learning of naive Bayes classifier 6

  7. Naive Bayes classifier: learning A naive Bayes classifier specifies  a class variable C  feature variables F 1 ,…,F n  a prior distribution p(C); probabilities sum to one (!)  conditional distributions p(F i |C) (probabilities sum to one for each C=c) Distributions p(C) and p(F i |C) can be `learned’ from data. E.g. simple approach: frequency counting. More sophisticated approach also learns the ‘structure’ of the model, i.e. determines which features to include  requires performance measure (e.g. accuracy). 7

  8. Naive Bayes classifier: use A naive Bayes classifier predicts a most likely value c for class C given observed features F i = f i from: where 1/Z = 1/p(F 1 ,…,F n ) is a normalisation constant. This formula is based on  Bayes’ rule: p(A|B) = p(B|A)p(A)/p(B)  and the naive assumption that all n feature variables are in dependent given the class variable. 8

  9. Learn NBC - example Model ‘structure’ is fixed; just need probabilities from data. F eature variables: C lass variable: PlayTennis Outlook, Temp., Humidity, Wind Conditionals p(F i | C) PT= yes PT= no O=sunny 2/9 3/5 Class Priors: O=overcast 4/9 0 P(C) = P(PlayTennis) = O=rain 3/9 2/5 { T=hot 2/9 2/5 p(PlayTennis=yes) = 9/14 , T=mild 4/9 2/5 p(PT=no) = 5/14 } T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 Probabilities based on frequency counting. W=weak 6/9 2/5 W=strong 3/9 3/5 9

  10. Classify with NBC - example Feature variables: O, T, H, W Conditionals p(F i | C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 0 Class variable: PT O=rain 3/9 2/5 Class Priors: T=hot 2/9 2/5 { p(PT=yes) = 9/14, T=mild 4/9 2/5 p(PT=no) = 5/14 } T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5 Classify ‘instance’ e =<O=sunny, T=hot, H=normal, W=weak>: p(PT=yes | e) = 1/Z *9/14*2/9*2/9*6/9*6/9 = 1/Z * 0.01411 > p(PT=no | e) = 1/Z *5/14*3/5*2/5*1/5*2/5 = 1/Z * 0.00686 10

  11. NBC Properties  NBC learning is complete (Probabilistic: can handle inconsistencies in data)  NBC learning is not optimal (Irrealistic independence assumptions  class posterior often unreliable; yet accurate prediction of most likely value)  Time and space complexity: independence assumptions strongly reduce dimensionality  NBC can overfit on the training data (especially with large number of features)  NBC has been further optimized  TAN/FAN/KDB 11

  12. Decision tree learning Supervised learning of decision tree classifier by means of `splitting’ on attributes 1. What is that? 2. How to split? (ID3) 12

  13. Example data set: when to play tennis, again 13

  14. Decision Tree splits I Let’s start building the tree from scratch.  we first need to decide on which attribute to make a decision. Let’s say 1 we selected “ Humidity”; split data according to the attribute’s values: Humidity normal high D1,D2,D3,D4 D5,D6,D7,D9 D8,D12,D14 D10,D11,D13 1 NB using ID3, this choice will be made by the algorithm… 14

  15. Decision Tree splits - II Now let’s split the first subset (H=high) D1,D2,D3,D4,D8,D12,D14 using attribute “ Wind”: Humidity normal high Wind D5,D6,D7,D9 D10,D11,D13 strong weak D1,D3,D4,D8 D2,D12,D14 15

  16. Decision Tree splits - III Now let’s split the subset H=high & W=strong Humidity (D2,D12,D14) normal high using attribute D5,D6,D7,D9 Wind “ Outlook” D10,D11,D13 strong weak Outlook D1,D3,D4,D8 Sunny Overcast Rain No No Yes entire subset classified 16

  17. Decision Tree splits - IV Now let’s split the subset H=high & W=weak (D1,D3,D4,D8) using attribute “ Outlook” Humidity normal high D5,D6,D7,D9 Wind D10,D11,D13 strong weak Outlook Outlook Sunny Overcast Overcast Rain Sunny Rain No No Yes Yes No Yes 17

  18. Decision Tree splits – V Now let’s split the subset H= normal (D5,D6,D7,D9,D10,D11,D13) using “ Outlook ” Humidity normal high outlook wind Sunny strong weak Rain Overcast Yes Yes D5,D6,D10 outlook outlook Sunny Overcast Overcast Rain Sunny Rain No No Yes No Yes Yes 18

  19. Decision Tree splits – VI Now let’s split subset H=normal & O=rain (D5,D6,D10) using “ Wind ” Humidity normal high wind outlook strong Sunny weak Rain Overcast Yes Yes wind outlook outlook weak strong Sunny Overcast Sunny Rain Overcast Rain Yes No No No Yes Yes No Yes 19

  20. Final Decision Tree (humidity=high  wind=strong  outlook=overcast)  Note: The decision tree can (humidity=high  wind=weak  outlook=overcast)  be expressed as an expression (humidity=high  wind=weak  outlook=rain)  of if‐then‐else sentences, or (humidity=normal  outlook=sunny)  – in case of binary outcomes – (humidity=normal  outlook=overcast)  a logical formula : (humidity=normal  outlook=rain  wind=weak) Humidity normal high wind outlook strong Sunny weak Rain Overcast Yes Yes wind outlook outlook weak strong Sunny Overcast Sunny Rain Overcast Rain Yes No No No Yes Yes No Yes 20

  21. Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> = ??? Humidity normal high wind outlook strong weak Sunny Rain Overcast Yes Yes wind outlook outlook Sunny Sunny weak Rain Overcast Rain Overcast strong Yes No Yes No Yes No No Yes 21

  22. Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> = ??? Humidity normal high wind outlook strong weak Sunny Rain Overcast Yes Yes wind outlook outlook Sunny Sunny weak Rain Overcast Rain Overcast strong Yes No Yes No Yes No No Yes Note that this was an ‘unseen’ instance (not in data). 22

  23. Alternative Decision Trees Another tree from the same data, using different attributes: We can build quite a large number of (unique) decision trees… So which attribute should we choose at branches? 23

  24. ID3: an entropy-based decision tree learner 24

  25. Entropy A measure of the disorder or randomness in a closed system with variable(s) of interest S : where n = | S | is the number of values of S  Convention: 0 log 2 0 = 0  For a degenerate distribution, the entropy will be 0 (why?)  For a uniform distribution, the entropy will be log 2 n (= 1 for binary‐valued variable)  Recall: log 2 x = log b x / log b 2 for any base‐b logarithm 25

  26. Entropy: example In our system we have 1 variable of interest (S= PlayTennis ), with 2 possible values i ( yes , no)  n=|S|=2. Let p + = p(PT=yes) and p − = p(PT=no); we again use frequency counting to establish these probabilities from the data; recall:  9 out of N=14 examples are positive  p + = 9/14  5 of these 14 are negative  p − = 5/14  Entropy( PlayTennis ) = = − p + log 2 p + − p − log 2 p − = = −(9/14)log 2 (9/14) − (5/14)log 2 (5/14) = 0.940 26

  27. Conditional & Expected Entropy Conditional entropy Entropy( S | X ) is the entropy we expect in a system S when another variable X is given; it is the expected value of the entropy given possible values x of X : Entropy( S | X ) = ��� where for a specific value x of X: Entropy( S | X = x ) with n = |S| NB We will use the following short‐hand notations (!): • Entropy( S X ) for Entropy( S | X ) = conditional entropy • Entropy( S x ) for Entropy( S | X = x ) = entropy given specific x 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend