ARTIFICIAL INTELLIGENCE Supervised learning: classification - - PowerPoint PPT Presentation

artificial intelligence supervised learning classification
SMART_READER_LITE
LIVE PREVIEW

ARTIFICIAL INTELLIGENCE Supervised learning: classification - - PowerPoint PPT Presentation

Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Supervised learning: classification Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from


slide-1
SLIDE 1

ARTIFICIAL INTELLIGENCE

Lecturer: Silja Renooij

Supervised learning: classification

Utrecht University The Netherlands

These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

INFOB2KI 2019-2020

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Requirements

Supervised learning algorithms for classification learn the relation between

  • class‐labels (the things to predict) and

feature/attribute values (observable things) Various algorithms use probabilistic relations between class and feature. The required probabilities

  • can be assessed from the data using

frequency counting

3

slide-4
SLIDE 4

When to play tennis?

Example dataset D: N=14 cases, 4 attributes , 1 class variable

4

slide-5
SLIDE 5

Frequency counting

Our example class variable PlayTennis (PT) has 2 possible values (yes, no); feature Temperature has 3 values (hot, mild, cool). With N=14 cases, we find with frequency counting the following prior probabilities for the class labels:

  • 9 out of N=14 examples are positive  p(PT=yes) = 9/14
  • 5 of these 14 are negative  p(PT=no) = 5/14

Similary, the conditional probabilities for the features given the class can be determined. E.g. given PT=yes, we find that out of the 9 cases, 2 were in hot conditions, 4 in mild and 3 in cool  p(Temp = hot | PT =yes) = 2/9 p(Temp = mild | PT =yes) = 4/9 p(Temp = cool | PT =yes) = 3/9

5

slide-6
SLIDE 6

Naïve Bayes classifier

Supervised learning of naive Bayes classifier

updated given forecast…. 6

slide-7
SLIDE 7

Naive Bayes classifier: learning

A naive Bayes classifier specifies

  • a class variable C
  • feature variables F1,…,Fn
  • a prior distribution p(C);

probabilities sum to one (!)

  • conditional distributions p(Fi|C) (probabilities sum to
  • ne for each C=c)

Distributions p(C) and p(Fi|C) can be `learned’ from

  • data. E.g. simple approach: frequency counting.

More sophisticated approach also learns the ‘structure’

  • f the model, i.e. determines which features to include

 requires performance measure (e.g. accuracy).

7

slide-8
SLIDE 8

Naive Bayes classifier: use

A naive Bayes classifier predicts a most likely value c for class C given observed features Fi = fi from: where 1/Z = 1/p(F1,…,Fn) is a normalisation constant. This formula is based on

  • Bayes’ rule: p(A|B) = p(B|A)p(A)/p(B)
  • and the naive assumption that all n feature

variables are independent given the class variable.

8

slide-9
SLIDE 9

Learn NBC - example

Model ‘structure’ is fixed; just need probabilities from data. Feature variables: Class variable: PlayTennis Outlook, Temp., Humidity, Wind Class Priors: P(C) = P(PlayTennis) = { p(PlayTennis=yes) = 9/14, p(PT=no) = 5/14 }

Conditionals p(Fi|C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 O=rain 3/9 2/5 T=hot 2/9 2/5 T=mild 4/9 2/5 T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5

9

Probabilities based on frequency counting.

slide-10
SLIDE 10

Classify ‘instance’ e =<O=sunny, T=hot, H=normal, W=weak>: p(PT=yes |e) = 1/Z *9/14*2/9*2/9*6/9*6/9 = 1/Z * 0.01411 > p(PT=no |e) = 1/Z *5/14*3/5*2/5*1/5*2/5 = 1/Z * 0.00686

Classify with NBC - example

Feature variables: O, T, H, W Class variable: PT Class Priors: { p(PT=yes) = 9/14, p(PT=no) = 5/14 }

10

Conditionals p(Fi |C) PT= yes PT= no O=sunny 2/9 3/5 O=overcast 4/9 O=rain 3/9 2/5 T=hot 2/9 2/5 T=mild 4/9 2/5 T=cool 3/9 1/5 H=high 3/9 4/5 H=normal 6/9 1/5 W=weak 6/9 2/5 W=strong 3/9 3/5

slide-11
SLIDE 11

NBC Properties

  • NBC learning is complete

(Probabilistic: can handle inconsistencies in data)

  • NBC learning is not optimal

(Irrealistic independence assumptions  class posterior

  • ften unreliable; yet accurate prediction of most likely value)
  • Time and space complexity: independence assumptions

strongly reduce dimensionality

  • NBC can overfit on the training data

(especially with large number of features)

  • NBC has been further optimized  TAN/FAN/KDB

11

slide-12
SLIDE 12

Decision tree learning

Supervised learning of decision tree classifier by means of `splitting’ on attributes 1. What is that? 2. How to split? (ID3)

12

slide-13
SLIDE 13

Example data set: when to play tennis, again

13

slide-14
SLIDE 14

Decision Tree splits I

Let’s start building the tree from scratch.  we first need to decide on which attribute to make a decision. Let’s say1 we selected “Humidity”; split data according to the attribute’s values:

Humidity high normal D1,D2,D3,D4 D8,D12,D14 D5,D6,D7,D9 D10,D11,D13

14

1 NB using ID3, this choice will be made by the algorithm…

slide-15
SLIDE 15

Decision Tree splits - II

Now let’s split the first subset (H=high) D1,D2,D3,D4,D8,D12,D14 using attribute “Wind”:

15

Humidity high normal D5,D6,D7,D9 D10,D11,D13 Wind strong weak D1,D3,D4,D8 D2,D12,D14

slide-16
SLIDE 16

Decision Tree splits - III

Now let’s split the subset H=high & W=strong (D2,D12,D14) using attribute “Outlook”

16

Humidity high normal D5,D6,D7,D9 D10,D11,D13 Wind strong weak D1,D3,D4,D8 Outlook Sunny Rain Overcast Yes No No

entire subset classified

slide-17
SLIDE 17

Decision Tree splits - IV

Outlook Sunny Rain Overcast Yes No Yes Humidity high normal D5,D6,D7,D9 D10,D11,D13 Wind strong weak Outlook Sunny Rain Overcast Yes No No

17

Now let’s split the subset H=high & W=weak (D1,D3,D4,D8) using attribute “Outlook”

slide-18
SLIDE 18

Decision Tree splits – V

Now let’s split the subset H= normal (D5,D6,D7,D9,D10,D11,D13) using “Outlook”

  • utlook

Sunny Rain Overcast Yes Yes D5,D6,D10

  • utlook

Sunny Rain Overcast Yes No Yes Humidity high normal wind strong weak

  • utlook

Sunny Rain Overcast Yes No No

18

slide-19
SLIDE 19

Decision Tree splits – VI

Now let’s split subset H=normal & O=rain (D5,D6,D10) using “Wind”

wind strong weak Yes No

  • utlook

Sunny Rain Overcast Yes Yes

  • utlook

Sunny Rain Overcast Yes No Yes Humidity high normal wind strong weak

  • utlook

Sunny Rain Overcast Yes No No

19

slide-20
SLIDE 20

Final Decision Tree

(humidity=high  wind=strong  outlook=overcast)  (humidity=high  wind=weak  outlook=overcast)  (humidity=high  wind=weak  outlook=rain)  (humidity=normal  outlook=sunny)  (humidity=normal  outlook=overcast)  (humidity=normal  outlook=rain  wind=weak)

Note: The decision tree can be expressed as an expression

  • f if‐then‐else sentences, or

– in case of binary outcomes – a logical formula:

20

wind strong weak Yes No

  • utlook

Sunny Rain Overcast Yes Yes

  • utlook

Sunny Rain Overcast Yes No Yes Humidity high normal wind strong weak

  • utlook

Sunny Rain Overcast Yes No No

slide-21
SLIDE 21

Classifying with Decision Trees

Humidity high normal wind strong weak

  • utlook

Sunny Rain Overcast Yes No No

Now classify instance <O=sunny, T=hot, H=normal, W=weak> = ???

  • utlook

Sunny Rain Overcast Yes No Yes

  • utlook

Sunny Rain Overcast Yes Yes wind strong weak Yes No

21

slide-22
SLIDE 22

Humidity high normal wind strong weak

  • utlook

Sunny Rain Overcast Yes No No

  • utlook

Sunny Rain Overcast Yes No Yes

  • utlook

Sunny Rain Overcast Yes Yes wind strong weak Yes No

Now classify instance <O=sunny, T=hot, H=normal, W=weak> = ???

22

Classifying with Decision Trees

Note that this was an ‘unseen’ instance (not in data).

slide-23
SLIDE 23

Alternative Decision Trees

Another tree from the same data, using different attributes: We can build quite a large number of (unique) decision trees… So which attribute should we choose at branches?

23

slide-24
SLIDE 24

ID3: an entropy-based decision tree learner

24

slide-25
SLIDE 25

A measure of the disorder or randomness in a closed system with variable(s) of interest S: where n = |S | is the number of values of S

Entropy

  • Convention: 0 log2 0 = 0
  • For a degenerate distribution, the entropy will be 0 (why?)
  • For a uniform distribution, the entropy will be log2 n

(= 1 for binary‐valued variable)

  • Recall: log2 x = logb x / logb 2 for any base‐b logarithm

25

slide-26
SLIDE 26

Entropy: example

In our system we have 1 variable of interest (S=PlayTennis), with 2 possible values i (yes, no)  n=|S|=2. Let p+ = p(PT=yes) and p− = p(PT=no); we again use frequency counting to establish these probabilities from the data; recall:

  • 9 out of N=14 examples are positive  p+= 9/14
  • 5 of these 14 are negative  p−= 5/14

 Entropy(PlayTennis) = = − p+ log2 p+ − p− log2 p− = = −(9/14)log2 (9/14) − (5/14)log2 (5/14) = 0.940

26

slide-27
SLIDE 27

Conditional & Expected Entropy

Conditional entropy Entropy(S | X ) is the entropy we expect in a system S when another variable X is given; it is the expected value of the entropy given possible values x of X:

Entropy(S | X ) =

  • where for a specific value x of X:

Entropy(S | X = x ) with n = |S| NB We will use the following short‐hand notations (!):

  • Entropy(SX) for Entropy(S | X) = conditional entropy
  • Entropy(Sx) for Entropy(S | X = x) = entropy given specific x

27

slide-28
SLIDE 28

Conditional Entropy - example

For example, we can evaluate the attribute “Temperature” , which has 3 values: hot, mild, cool. So we need to consider 3 subsystems: Shot

, Smild , Scool .

For each subsystem, probabilities are assessed from a subset of the data D: Dhot= {D1,D2,D3,D13}  p(hot) = 4/14 Dmild= {D4,D8,D10,D11,D12,D14}  p(mild) = 6/14 Dcool= {D5,D6,D7,D9}  p(cool) = 4/14 Now first compute entropy in the subsystems: Entropy(Shot ), Entropy(Smild), Entropy(Scool) We can now evaluate each attribute by calculating how much change they will do in entropy.

28

slide-29
SLIDE 29

Conditional Entropy example II

  • Dhot={D1(−),D2(−),D3(+),D13(+)}

 p+|hot = 0.5 and p−|hot = 0.5  Entropy(Shot) = − 0.5 log2 0.5 − 0.5 log2 0.5 = 1

  • Dmild={D4 (+),D8(−),D10(+),D11(+),D12(+),D14(−)}

p+|mild = 0.666 and p−|mild = 0.333  Entropy(Smild) = − 0.666 log2 0.666 − 0.333 log2 0.333 = 0.918

  • Dcool={D5(+),D6(−),D7(+),D9(+)}

p+|cool = 0.75 and p−|cool = 0.25  Entropy(Scool) = − 0.75 log2 0.75 − 0.25 log2 0.25 = 0.811

29

slide-30
SLIDE 30

Conditional Entropy example III

Okay: but does this mean we should split on this attribute??

30

The conditional entropy after splitting on “Temperature” now is: Entropy(STemperature) = = p(hot) Entropy(Shot) + p(mild) Entropy(Smild) + p(cool) Entropy(Scool) = (4/14)*1 + (6/14)*0.918 + (4/14)*0.811 = 0.9108

slide-31
SLIDE 31

We now define the Gain (reduction in entropy) of splitting on attribute X as: Gain(S,X) = Entropy(S) – Entropy(S|X)

  • Information‐gain is always a non‐negative value! (Why?)
  • If Entropy(SX) = 0, then all cases in SX are correctly classified

split on attribute with smallest conditional entropy Equivalently: split on attribute with highest gain

Information Gain

31

slide-32
SLIDE 32

Information Gain - example

The gain of splitting on “Temperature” is: Gain(S, Temp) = 0.940 ‐ 0.9108 = 0.029 Compute the Gain of splitting for all other attributes:

  • Gain(S, Outlook) = 0.246
  • Gain(S, Humidity) = 0.151
  • Gain(S, Wind) = 0.048

We therefore split on Outlook and repeat the process for:

  • S  Ssunny

with D  Dsunny

  • S  Sovercast with D  Dovercast
  • S  Srain

with D  Drain

32

slide-33
SLIDE 33

ID3 (Decision Tree Algorithm)

Building a decision tree with the ID3 algorithm

  • 1. Start from an empty node
  • 2. Select an attribute with the most information gain
  • 3. Split: create the subsystems (children) for each value of

the selected attribute

  • 4. For each associated subset of the data:

if not all elements belong to same class then repeat the steps 2‐3 for the subset

33

slide-34
SLIDE 34

Domain for ID3 example

Obstacles Robot Robot can turn left & right, and move forward

34

slide-35
SLIDE 35

Cases for ID3 example

Left Sensor Right Sensor Forward Sensor Back Sensor Previous Action Action Obstacle Free Obstacle Free moveForward TurnRight Free Free Obstacle Free TurnLeft TurnLeft Free Obstacle Free Free MoveForward MoveForward Free Obstacle Free Obstacle TurnLeft MoveForward Obstacle Free Free Free TurnRight MoveForward Free Free Free Obstacle TurnRight MoveForward

X1 X2 X3 X4 X5 X6 S:

35

slide-36
SLIDE 36

ID3 Example

Entropy(S) = − 1/6*log2(1/6) − 1/6*log2(1/6) − 4/6*log2(4/6) = 1.25 Entropy(SLeftSensor) = 2/6*Entropy(SLS=obstacle) + 4/6*Entropy(SLS=free) = 2/6*1 + 4/6*0.811 = 0.874 Entropy(SRightSensor) = 2/6*Entropy(SRS=obstacle) + 4/6*Entropy(SRS=free) = 2/6*0 + 4/6*1.5 = 1 Entropy(SForwardSensor) = 2/6*Entropy(SFS=obstacle) + 4/6*Entropy(SFS=Free) = 2/6*1 + 4/6*0 = 0.333 Entropy(SBackSensor) = 2/6*Entropy(SBS=obstacle) + 4/6*Entropy(SBS=free) = 2/6*0 + 4/6*1.5 = 1 Entropy(SPreviousAction) = 2/6*Entropy(SPA=MoveForw) + 2/6*Entropy(SPA=TurnL) + 2/6*Entropy(SPA=TurnR) = 2/6*1 + 2/6*1 + 2/6*0 = 0.666 Gain(S,LeftSensor) = 1.25 − 0.874 = 0.376 Gain(S,RightSensor) = 1.25 − 1 = 0.25 Gain(S,ForwardSensor) = 1.25 − 0.333 = 0.917 Gain(S,BackSensor) = 1.25 − 1 = 0.25 Gain(S,PreviousAction) = 1.25 − 0.666 = 0.584

Select ForwardSensor

36

slide-37
SLIDE 37

Decision Tree ID3 Example

ForwardSensor free

  • bstacle

MoveForward {X1,X2} = S’

Entropy(S’) = −1/2*log2(1/2) − 1/2*log2(1/2) = 1 (X1: Action = TR; X2: Action = TL) Entropy(S’LeftSensor) = 1/2*Entropy(S’LS=obstacle) + 1/2*Entropy(S’LS=free) = 1/2*0 + 1/2*0 = 0  Gain = 1 – 0 = 1 Entropy(S’RightSensor) = 1*Entropy(S’RS=free) = 1*1 = 1  Gain = 1 – 1 = 0 Entropy(S’BackSensor) = exact same  Gain = 1 – 1 = 0 Entropy(SPreviousAction) = 1/2*Entropy(SPA=MoveForw) + 1/2*Entropy(SPA=TurnL) = 1/2*0 +1/2*0 = 0  Gain = 1 – 0 = 1 Select either LeftSensor or PreviousAction, depending on the execution order

37

slide-38
SLIDE 38

Decision Tree ID3 Example

ForwardSensor free

  • bstacle

MoveForward LeftSensor

  • bstacle

free TurnRight (X1) TurnLeft (X2) ForwardSensor free

  • bstacle

MoveForward Previous action Move forward TurnLeft TurnRight (X1) TurnLeft (X2)

38

slide-39
SLIDE 39

ID3 preference bias example I

Race Name BeenToB5 Good Person Minbari Delenn Yes Yes Minbari Draal Yes Yes Human Morden Yes No Narn G’Kar Yes Yes Human Sheridan Yes Yes

Entropy(S) = − 0.2*log2 0.2 − 0.8*log2 0.8 = 0.72

pyes=0.8 pno=0.2

Babylon 5 universe

D1 D2 D3 D4 D5 S

Split on Race

Dminbari={D1(+),D2(+)}  Entropy(Sminbari)=0 Dhuman={D3(‐),D5(+)}  Entropy(Shuman)=1 Dnarn={D4(+)}  Entropy(Snarn)=0 Entropy(SRace)=2/5*0+2/5*1+1/5*0 = 2/5  Gain(S,Race)=0.72‐2/5=0.32

39

slide-40
SLIDE 40

ID3 preference bias example II

Race Name BeenToB5 Good Person Minbari Delenn Yes Yes Minbari Draal Yes Yes Human Morden Yes No Narn G’Kar Yes Yes Human Sheridan Yes Yes

Entropy(S) = − 0.2*log2 0.2 − 0.8*log2 0.8 = 0.72 pyes=0.8 pno=0.2

Babylon 5 universe

D1 D2 D3 D4 D5 S

Split on Name

DDelenni={D1(+)}  Entropy(SDelenn) = 0 The entropies of all DDraal={D2(+)}  Entropy(SDraal) = 0 subsets are 0 DMorden={D3(‐)} ….. DG’Kar={D4(+)} DSheridan={D5(+)}  Entropy(SName) = 0  Gain(S,Name)=0.72‐0=0.72

40

slide-41
SLIDE 41

ID3: Preference Bias

Name Delenn Draal Morden G’kar Sheridan Yes Yes No Yes Yes

ID3 prefers some trees over others: ‐ It favors shorter trees over longer ones ‐ It selects trees that place the attributes with highest information gain closest to the root Its bias is solely a consequence of the ordering of hypotheses by its search strategy.

41

slide-42
SLIDE 42

ID3: Overfitting (illustrated)

  • Suppose we receive an additional data point

42

slide-43
SLIDE 43

Extra data: effect on our tree

NB in previous tree, instance <O=sunny, . ,H= normal, . > was classified as PlayTennis =yes…..

43

slide-44
SLIDE 44

Effects of ID3 Overfitting

  • Trees may grow to include irrelevant attributes (e.g., Date,

Color, etc.)

  • Noisy examples may add spurious nodes to tree

44

slide-45
SLIDE 45

ID3 Properties

  • ID3 is complete for consistent(!) training data
  • ID3 is not optimal

(greedy Hill climbing approach  no guarantees)

  • ID3 can overfit on the training data

(accuracy of learned model = prediction on test set)

  • Use of information gain  preference bias
  • Continuous data: many more places to split an

attribute  time consuming search for best split. ID3 has been further optimized  e.g. C4.5 and C5.0 ID3 for iterative online learning: ID4

45

slide-46
SLIDE 46

Which is more powerful?

46