Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Supervised learning: classification Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
2
Requirements Supervised learning algorithms for classification learn the relation between class‐labels (the things to predict) and feature/attribute values (observable things) Various algorithms use probabilistic relations between class and feature. The required probabilities can be assessed from the data using frequency counting 3
When to play tennis? Example dataset D: N=14 cases, 4 attributes , 1 class variable 4
Frequency counting Our example class variable PlayTennis (PT) has 2 possible values ( yes , no); feature Temperature has 3 values (hot, mild, cool). With N=14 cases, we find with frequency counting the following prior probabilities for the class labels: 9 out of N=14 examples are positive p(PT=yes) = 9/14 5 of these 14 are negative p(PT=no) = 5/14 Similary, the conditional probabilities for the features given the class can be determined. E.g. given PT=yes, we find that out of the 9 cases, 2 were in hot conditions, 4 in mild and 3 in cool p(Temp = hot | PT =yes) = 2/9 p(Temp = mild | PT =yes) = 4/9 p(Temp = cool | PT =yes) = 3/9 5
Naïve Bayes classifier updated given forecast…. Supervised learning of naive Bayes classifier 6
Naive Bayes classifier: learning A naive Bayes classifier specifies a class variable C feature variables F 1 ,…,F n a prior distribution p(C); probabilities sum to one (!) conditional distributions p(F i |C) (probabilities sum to one for each C=c) Distributions p(C) and p(F i |C) can be `learned’ from data. E.g. simple approach: frequency counting. More sophisticated approach also learns the ‘structure’ of the model, i.e. determines which features to include requires performance measure (e.g. accuracy). 7
Naive Bayes classifier: use A naive Bayes classifier predicts a most likely value c for class C given observed features F i = f i from: where 1/Z = 1/p(F 1 ,…,F n ) is a normalisation constant. This formula is based on Bayes’ rule: p(A|B) = p(B|A)p(A)/p(B) and the naive assumption that all n feature variables are in dependent given the class variable. 8
Learn NBC - example Model ‘structure’ is fixed; just need probabilities from data. F eature variables: C lass variable: PlayTennis Outlook, Temp., Humidity, Wind Conditionals p(F i | C) PT= yes PT= no O=sunny 2/9 3/5 Class Priors: O=overcast 4/9 0 P(C) = P(PlayTennis) = O=rain 3/9 2/5 { T=hot 2/9 2/5 p(PlayTennis=yes) = 9/14 , T=mild 4/9 2/5 p(PT=no) = 5/14 } T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 Probabilities based on frequency counting. W=weak 6/9 2/5 W=strong 3/9 3/5 9
Classify with NBC - example Feature variables: O, T, H, W Conditionals p(F i | C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 0 Class variable: PT O=rain 3/9 2/5 Class Priors: T=hot 2/9 2/5 { p(PT=yes) = 9/14, T=mild 4/9 2/5 p(PT=no) = 5/14 } T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5 Classify ‘instance’ e =<O=sunny, T=hot, H=normal, W=weak>: p(PT=yes | e) = 1/Z *9/14*2/9*2/9*6/9*6/9 = 1/Z * 0.01411 > p(PT=no | e) = 1/Z *5/14*3/5*2/5*1/5*2/5 = 1/Z * 0.00686 10
NBC Properties NBC learning is complete (Probabilistic: can handle inconsistencies in data) NBC learning is not optimal (Irrealistic independence assumptions class posterior often unreliable; yet accurate prediction of most likely value) Time and space complexity: independence assumptions strongly reduce dimensionality NBC can overfit on the training data (especially with large number of features) NBC has been further optimized TAN/FAN/KDB 11
Decision tree learning Supervised learning of decision tree classifier by means of `splitting’ on attributes 1. What is that? 2. How to split? (ID3) 12
Example data set: when to play tennis, again 13
Decision Tree splits I Let’s start building the tree from scratch. we first need to decide on which attribute to make a decision. Let’s say 1 we selected “ Humidity”; split data according to the attribute’s values: Humidity normal high D1,D2,D3,D4 D5,D6,D7,D9 D8,D12,D14 D10,D11,D13 1 NB using ID3, this choice will be made by the algorithm… 14
Decision Tree splits - II Now let’s split the first subset (H=high) D1,D2,D3,D4,D8,D12,D14 using attribute “ Wind”: Humidity normal high Wind D5,D6,D7,D9 D10,D11,D13 strong weak D1,D3,D4,D8 D2,D12,D14 15
Decision Tree splits - III Now let’s split the subset H=high & W=strong Humidity (D2,D12,D14) normal high using attribute D5,D6,D7,D9 Wind “ Outlook” D10,D11,D13 strong weak Outlook D1,D3,D4,D8 Sunny Overcast Rain No No Yes entire subset classified 16
Decision Tree splits - IV Now let’s split the subset H=high & W=weak (D1,D3,D4,D8) using attribute “ Outlook” Humidity normal high D5,D6,D7,D9 Wind D10,D11,D13 strong weak Outlook Outlook Sunny Overcast Overcast Rain Sunny Rain No No Yes Yes No Yes 17
Decision Tree splits – V Now let’s split the subset H= normal (D5,D6,D7,D9,D10,D11,D13) using “ Outlook ” Humidity normal high outlook wind Sunny strong weak Rain Overcast Yes Yes D5,D6,D10 outlook outlook Sunny Overcast Overcast Rain Sunny Rain No No Yes No Yes Yes 18
Decision Tree splits – VI Now let’s split subset H=normal & O=rain (D5,D6,D10) using “ Wind ” Humidity normal high wind outlook strong Sunny weak Rain Overcast Yes Yes wind outlook outlook weak strong Sunny Overcast Sunny Rain Overcast Rain Yes No No No Yes Yes No Yes 19
Final Decision Tree (humidity=high wind=strong outlook=overcast) Note: The decision tree can (humidity=high wind=weak outlook=overcast) be expressed as an expression (humidity=high wind=weak outlook=rain) of if‐then‐else sentences, or (humidity=normal outlook=sunny) – in case of binary outcomes – (humidity=normal outlook=overcast) a logical formula : (humidity=normal outlook=rain wind=weak) Humidity normal high wind outlook strong Sunny weak Rain Overcast Yes Yes wind outlook outlook weak strong Sunny Overcast Sunny Rain Overcast Rain Yes No No No Yes Yes No Yes 20
Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> = ??? Humidity normal high wind outlook strong weak Sunny Rain Overcast Yes Yes wind outlook outlook Sunny Sunny weak Rain Overcast Rain Overcast strong Yes No Yes No Yes No No Yes 21
Classifying with Decision Trees Now classify instance <O=sunny, T=hot, H=normal, W=weak> = ??? Humidity normal high wind outlook strong weak Sunny Rain Overcast Yes Yes wind outlook outlook Sunny Sunny weak Rain Overcast Rain Overcast strong Yes No Yes No Yes No No Yes Note that this was an ‘unseen’ instance (not in data). 22
Alternative Decision Trees Another tree from the same data, using different attributes: We can build quite a large number of (unique) decision trees… So which attribute should we choose at branches? 23
ID3: an entropy-based decision tree learner 24
Entropy A measure of the disorder or randomness in a closed system with variable(s) of interest S : where n = | S | is the number of values of S Convention: 0 log 2 0 = 0 For a degenerate distribution, the entropy will be 0 (why?) For a uniform distribution, the entropy will be log 2 n (= 1 for binary‐valued variable) Recall: log 2 x = log b x / log b 2 for any base‐b logarithm 25
Entropy: example In our system we have 1 variable of interest (S= PlayTennis ), with 2 possible values i ( yes , no) n=|S|=2. Let p + = p(PT=yes) and p − = p(PT=no); we again use frequency counting to establish these probabilities from the data; recall: 9 out of N=14 examples are positive p + = 9/14 5 of these 14 are negative p − = 5/14 Entropy( PlayTennis ) = = − p + log 2 p + − p − log 2 p − = = −(9/14)log 2 (9/14) − (5/14)log 2 (5/14) = 0.940 26
Conditional & Expected Entropy Conditional entropy Entropy( S | X ) is the entropy we expect in a system S when another variable X is given; it is the expected value of the entropy given possible values x of X : Entropy( S | X ) = ��� where for a specific value x of X: Entropy( S | X = x ) with n = |S| NB We will use the following short‐hand notations (!): • Entropy( S X ) for Entropy( S | X ) = conditional entropy • Entropy( S x ) for Entropy( S | X = x ) = entropy given specific x 27
Recommend
More recommend