Introduction to Artificial Intelligence Classification Algorithms - - PowerPoint PPT Presentation

▶

Mar 20, 2023 925 likes •1.41k views

Introduction to Artificial Intelligence Classification Algorithms Decision Trees and Overfitting Mi osz Kadzi ski Institute of Computing Science Poznan University of Technology, Poland www.cs.put.poznan.pl/mkadzinski/iai Artificial

SLIDE 1

Introduction to Artificial Intelligence

Classification Algorithms Decision Trees and Overfitting

Miłosz Kadziński

Institute of Computing Science Poznan University of Technology, Poland www.cs.put.poznan.pl/mkadzinski/iai

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 2

Classification

A fixed set of class or category labels C = {C1, C2,…,Cn} A training set of m hand-labeled objects {(O1,C2),....,(Om,Cj)} An object Ox∈X to be classified A learned classifier predicts class c(Ox) for Ox, where c(Ox)∈C Input Output Create a model and use it to classify new data (i.e., predict a discrete, nominal value (category/class)) A learned classifier predicts class c(Ox) for Ox, where c(Ox)∈C and c is a function whose domain is X and whose range is C

Classification algorithms Decision trees Bayesian Distance-based Neural networks Support Vector Machines Association-based Genetic

Aim

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 3

Examples of Classification (1)

Classification: predicting class as a function of the values of other attributes Medical diagnosis and treatment effectiveness An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc.) of newly admitted patients. A decision has to be taken whether to put the patient in an intensive-care unit. Due to high cost of ICU, we need to predict high-risk patients and discriminate them from low-risk patients. Credit risk assessment and fraud detection A credit card company typically receives hundreds of thousands of application for new card. The application contains information regarding several different attributes, such annual salary, any

utstanding debts, age, etc. The problem is to categorize

application into those who have good credit, bad credit, or fall into a gray area.

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 4

Examples of Classification (2)

Classification: predicting class as a function of the values of other attributes Industrial applications A nuclear fuel processing company wishes to improve the yield of its factories. In one such factory, uranium haxafloride gas is converted into uranium-dioxide pellets. Six processing steps are needed to do the conversion. There are 30 controllable variables (e.g., pressure and Star classification Astronomers have been cataloguing distant objects in the sky using long-exposure images. The objects need to be labeled as star, galaxy, nebule, etc. The data is highly noisy, and the images are very faint. The cataloguing can take decades to complete. How can psysicists automate the cataloguing process, and improve its effectiveness? There are 30 controllable variables (e.g., pressure and flow rates, temperature). Engineers note that yield is high on some days and low on others. How can they control the variable to produce high yield on all days?

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 5

Examples of Classification (3) – Our Running Example

Attributes / characteristics / features Decision attribute Income AI student Sex Buy iPhone XI? 1 medium yes male yes 2 medium yes female yes 3 high yes female yes 4 low yes male no 5 low yes female no 6 low no female no 7 medium no male no 8 medium no female no Objects / instances / items Attributes / characteristics / features Decision attribute (classification; class = yes

r class = no)

Aim: represent the classification with a decision tree if income = medium and AI stud = yes and sex = male then yes

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 6

Decision Tree – How to Interpret It?

I AI S iPh? 1 M Y M Y 2 M Y F Y 3 H Y F Y 4 L Y M N 5 L Y F N 6 L N F N 7 M N M N 8 M N F N

income N low high medium Y Y N yes no AI stud. root vertices (tests an attribute) internal node leaf (node) (assigns a classification) Directed tree = directed acyclic graph who underlying undirected graph is a tree (graph in which any two vertices are connected by exactly one path) branch (attribute value, leads to a node at the lower level) level Classification tree needs to represent a division of the set of objects into classes Internal nodes = means of performing such a division; leaves = classes

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 7

Decision Tree vs. Decision Rules

Decision tree can be alternatively represented in form of decision rules Each path leading from the root to some leaf corresponds to a rule if income = low then buy = no if income = high then buy = yes if (income = medium and AI student= yes) then buy = yes if (income = medium and AI student = no) then buy = no For each class decision tree represents a disjunction (∨)

f conjunctions (∧) on constraints on the value of attributes:

yes (Y): (income = high) ∨ (income = medium ∧ AI student = yes) no (N): (income = low) ∨ (income = medium ∧ AI student = no) Both rules and con/disjuctions can possibly be simplified (see no (N)) Each path leading from the root to some leaf corresponds to a rule income N low high medium Y Y N yes no AI stud.

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 8

Decision Tree - Use for Classification

Classification of new objects with unknown classification by traversing the tree Start from the root {4,5,6} {3} {1,2} {7,8} Verify value of the attribute related with the current node Move to the next node following the branch corresponding to the object’s attribute value Repeat until reach a leaf indicating a class assignment Y N Some attributes (see sex (S)) may not have any influence on making the buying decision (according to the constructed decision tree) In classifying any object, the tree may not use all attributes in the table

income N low high medium Y Y N yes no AI stud.

I AI S iPh? 9 H N M ? 10 M N F ?

START HERE EXAMPLE NEW OBJECTS

the buying decision (according to the constructed decision tree)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 9

Decision Tree – Characteristics

Decision Trees is one of the most widely used and practical methods of inference based on decision examples inference based on decision examples Learned functions are represented as decision trees (or if-then-else rules) Expressive hypotheses space, including disjunction (disjunctive hypothesis) Possibly noisy training data samples: robust to errors in training data Instances can be described by attribute value pairs Sequence of conditions conditioning decision making is used in many domains Intuitive, easily understandable, human readable The most famous Decision Trees algorithms ID3 (Iterative Dichotomiser 3) C4 and C4.5 (successors of ID3) CART (Classification and Regression Tree)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 10

ID3 – The Famous Algorithm for Learning Decision Trees

ID3 is a basic algorithm for learning decision trees (DTs) Given a training set of examples, the algorithm performs search in the space of decision trees The construction of the tree is top-down (root → leaves) The algorithm is greedy

John Ross Quinlan ID3, 1986

A ← select the “best” attribute for the next node For each value of A create a branch and a descendant node (working leaf) Partition training examples to leaf nodes according to the attribute value of the branch If all training examples in a given leaf are perfectly classified (the same value of target attribute = class) or there are no attributes left, stop and create the leaf node indicating the respective class Otherwise, iterate over new successor leaf nodes (recursively construct a sub-tree for each partition using ID3) ID3

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 11

ID3 – Example (1)

Start from an empty tree Let us assume ”income” is the “best” attribute for next node (Why? See later…) I AI S iPh? 1 M Y M Y 2 M Y F Y 3 H Y F Y 4 L Y M N 5 L Y F N 6 L N F N 7 M N M N 8 M N F N income low high medium {4 (N),5 (N), 6 (N)} {3 (Y)} for each value of ”income” create new branch and descendant node {1 (Y), 2 (Y), 7 (N), 8 (N)} imperfect classification and other attributes still available: iterate over successive leaf node perfect classification: create the leaf nodes indicating the respective classes income N low high medium Y {4 (N),5 (N), 6 (N)} {3 (Y)} {1 (Y), 2 (Y), 7 (N), 8 (N)} construct a sub-tree for this partition

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 12

ID3 – Example (2)

income Let us assume ”AI student” is the “best” decision attribute for the next node We don’t have an empty tree anymore. Expand what needs to be expanded. {1 (Y), 2 (Y), 7 (N), 8 (N)} I AI S iPh? 1 M Y M Y 2 M Y F Y 7 M N M N 8 M N F N income N low high medium Y {4 (N),5 (N), 6 (N)} {3 (Y)} for each value of ”AI student” create new branch and descendant node perfect classification: create the leave nodes indicating the respective classes Y N yes no AI student {1 (Y), 2 (Y)} {7 (N), 8 (N)} All training examples are perfectly classified

HOW TO SELECT THE BEST DECISION ATTRIBUTE FOR A GIVEN NODE?

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 13

ID3 vs. Occam Razor (1)

ID3 algorithm assumes that a good decision tree is the simplest decision tree Preferring simplicity and avoiding unnecessary assumptions Occam Razor was first articulated by the medieval English logician and Franciscan friar William of Occam in 1324 ”It is vain do with more what can be done with less” The term razor refers to the act of shaving away unnecessary assumptions to get to the simplest explanation William of Occam ”Entities should not be multiplied without necessiry” ”The simplest solution is most likely the right one” Inspiration: for Einstein, Euler, Planck ”Everything should be kept as simple as possible, but not simpler”

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 14

ID3 vs. Occam Razor (2)

We should always accept the simplest answer that correctly fits our data The simplest decision tree that covers all examples should be the least likely to include unnecessary constraints ID3 selects a property to test at the current node of the tree and uses this test to partition the set of examples Because the order of tests is critical to constructing a simple tree, ID3 relies heavily on its criteria for selecting the test at the root of each sub tree

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 15

How to Select The Best Attribute? Information Gain!

A statistical property called information gain measures how well a given attribute separates the training examples Information gain uses the notion of entropy, commonly used in information theory information gain = expected reduction of entropy yes no AI student {1(Y), 2(Y), 3(Y), 4(N), 5(N)} 3/5: Y 2/5 N {6(N), 7(N), 8(N)} 0/3: Y 3/3 N I AI S iPh? 1 M Y M Y 2 M Y F Y 3 H Y F Y 4 L Y M N 5 L Y F N 6 L N F N 7 M N M N 8 M N F N male female sex {1(Y), 4(N), 7(N)} 1/3: Y 2/3 N {2(Y), 3(Y), 5(N), 6(N), 8(N)} 2/5: Y 3/5 N Entropy measures the impurity of a collection of examples The higher the information gain, the more effective the attribute in classifying training data {1(Y), 2(Y), 3(Y), 4(N), 5(N), 6(N), 7(N), 8(N)} 3/8: Y 5/8 N How pure are these sets? How pure are these sets? How pure is this set?

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 16

Entropy – Binary Classification (1)

Entropy(S) = Ent(S) = – pY·log2 pY – pN·log2 pN [0 log20 = 0]

Entropy measures the impurity of a collection of examples S is a collection of training examples pY the proportion of positive examples (from class Y) in S pN the proportion of negative examples (from class N) in S Note: the log2 of a number < 1 is negative It depends from the distribution of the random variable p Entropy = the amount of information in a random variable First focus: binary classification (two classes: positive, negative)

entropy function

0 ≤ pY. pN ≤ 1, so 0 ≤ entropy ≤ 1 x 1 1/2 1/4 1/8 1/16 log2x

0 ≤ pY. pN ≤ 1, so 0 ≤ entropy ≤ 1 (for binary classification)

fair: Ent(S) = – 0.5·log20.5 – 0.5·log20.5 = – 0.5·(-1) – 0.5·(-1) = 1 unfair: Ent(S) = – 0.7·log20.7 – 0.3·log20.3 = – 0.7·(-0.515) – 0.5 ·(-1.737) = 0.882

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 17

Entropy – Binary Classification (2)

AI student

Entropy(S) = Ent(S) = – pY·log2 pY – pNlog2 pN [0 log20 = 0]

yes no AI student {1(Y), 2(Y), 3(Y), 4(N), 5(N)} 3/5: Y 2/5 N {6(N), 7(N), 8(N)} 0/3: Y 3/3 N val1 val2 fictive attribute {1(Y), 2(Y), 3(Y), 6(N), 7(N), 8(N)} 3/6: Y 3/6 N {4(N),5(N)} 0/2: Y 2/2 N

Ent([3Y, 5N]) = – 3/8·log2 (3/8) – 5/8·log2 (5/8) = = – 3/8·(-1.415) – 5/8·(-0.678) = 0.954 Ent([3Y, 2N]) = – 3/5·log2 (3/5) – 2/5·log2 (2/5) = 0.971 Ent([0Y, 3N]) = – 0/3·log2 (0/3) – 3/3·log2 (3/3) = 0 Ent([3Y, 3N]) = – 3/6·log2 (3/6) – 3/6·log2 (3/6) = 1 Ent([0Y, 2N]) = – 0/2·log2 (0/2) – 2/2·log2 (2/2) = 0

PURE PURE

EXTR. IMPURE

{1(Y), 2(Y), 3(Y), 4(N), 5(N), 6(N), 7(N), 8(N)} 3/8: Y 5/8 N VERY IMPURE VERY IMPURE

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 18

Entropy – Multi-Class Classification

Ent(S) = – pY·log2 pY – pNlog2 pN

For binary classification (two-valued random variable)

Ent(S) = –∑i=1,…,n pi·log2 pi = – p1·log2 p1 – p2·log2 p2 – … – pn·log2 pn

For classification in n classes {1(C1), 2(C1), 3(C1), 4(C2), 5(C2), 6(C2), 7(C3), 8(C3), 9(C3)} 3/9: C1 3/9: C2 3/9: C3 Three examples for classification in C1, C2 and C3 classes: {1(C1), 2(C1), 3(C1), 4(C1), 5(C1), 6(C1), 7(C1), 8(C2), 9(C3)} 7/9: C1 1/9: C2 1/9: C3 {1(C1), 2(C1), 3(C1), 4(C1), 5(C1), 6(C1), 7(C1), 8(C1), 9(C1)} 9/9: C1 0/9: C2 0/9: C3

Ent(S) = - 3/9·log2 3/9 - 3/9·log2 3/9 - 3/9·log2 3/9 = 1.585 Ent(S) = - 7/9·log2 7/9 - 1/9·log2 1/9 - 1/9·log2 1/9 = 0.986 Ent(S) = - 9/9·log2 9/9 - 0/9·log2 0/9 - 0/9·log2 0/9 = 0

PURE

EXTR. IMPURE

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 19

Interpretation of Entropy – Information Theory

When the data source produces a high-probability value, Information theory provides the information content of a message When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more "information" ("surprisal") Entropy specifies the average / expected length (in bits) of the message needed to transmit the outcome of a random variable When the data source produces a high-probability value, the event carries less "information" We may think of a decision tree as conveying information about the classification of examples in the decision table The information content of the tree is computed from the probabilities

f different classifications

Claude Shannon 1948

This value clearly depends on the probability distribution The amount of information conveyed by each event is a random variable whose expected value is the information entropy

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 20

Conditional Entropy

The amount of information needed to complete the tree is defined as a weighted sum

f the information content of each sub-tree

yes (5/8) no (3/8) AI student {1(Y), 2(Y), 3(Y), 4(N), 5(N)} 3/5:Y 2/5:N Ent(S1AI) = 0.971 {6(N), 7(N), 8(N)} 0/3:Y 3/3:N Ent(S2AI) = 0 Consider a set of examples S with nS examples and attribute A with p values {v1, v2, …, vp} Sj, j=1,..,p, contains nSj examples having value vj for attribute A Answering question ”what is the value of attribute A?” separates S into subsets {S1, S2, …, Sp} Conditional entropy of dividing set S with respect to attribute A: Ent(S, A) = Ent(S/A) =

j=1,…,p ·Ent(Sj)

nSj nS Ent(S, AI student) = 5/8·Ent(S1AI) + 3/8·Ent(S2AI) = = 5/8·0.971 + 3/8·0 = 0.607

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 21

Conditional Entropy – Example

sex income high (1/8) medium (4/8) {4 (N), 5 (N), 6 (N)} 0/3:Y 3/3:N Ent(S1in) = 0 {3 (Y)} 1/1:Y 0/1:N Ent(S2in) = 0 {1 (Y), 2 (Y), 7 (N), 8 (N)} 2/4:Y 2/4:N Ent(S3in) = 1 low (3/8) Ent(S, income) = 3/8·Ent(S1in) + 1/8·Ent(S2in) + 4/8·Ent(S3in) = = 3/8·0 + 1/8·0 + 4/8·1 = 0.5 Ent(S, AI student) = 5/8·Ent(S1AI) + 3/8· Ent(S2AI) = 5/8·0.971 + 3/8·0 = 0.607 Ent(S, sex) = 3/8·Ent(S1sex) + 5/8·Ent(S2sex) = 3/8·0.918 + 5/8·0.971 = 0.951 Conditional entropy of dividing set S with respect to attribute A: Ent(S, A) = Ent(S/A) =

j=1,…,p ·Ent(Sj)

nSj nS male female {1(Y), 4(N), 7(N)} 1/3: Y 2/3 N Ent(S1sex) = 0.918 {2(Y), 3(Y), 5(N), 6(N), 8(N)} 2/5: Y 3/5 N Ent(S2sex) = 0.971

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 22

Information Gain

Information gain is the expected reduction in entropy caused by partitioning the examples/instances in set S on attribute A The higher the information gain, the more effective the attribute in classifying training data (split on the attribute with the highest information gain) Gain(S, A) = Ent(S) - Ent(S, A) Total information of the table: the amount of information needed to complete the classification after performing the test Ent(S, income) = 0.5 Ent(S, AI student) = 0.607 Ent(S, sex) = 0.951 Ent(S) = 0.954 Gain(S, income) = 0.954 - 0.5 = 0.454 Gain(S, AI student) = 0.954 - 0.607 = 0.347 Gain(S, sex) = 0.954 - 0.951= 0.003 income N high medium Y {4 (N),5 (N), 6 (N)} {3 (Y)} {1 (Y), 2 (Y), 7 (N), 8 (N)} DECISION: split on ”income” Gain(S, A) = Ent(S) - Ent(S, A) low

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 23

Information Gain – Example

I AI S iPh? 1 M Y M Y 2 M Y F Y 7 M N M N 8 M N F N income N high medium Y {4 (N),5 (N), 6 (N)} {3 (Y)} {1 (Y), 2 (Y), 7 (N), 8 (N)} Ent(S) = Ent([2Y, 2N]) = – 2/4·log2 (2/4) – 2/4·log2 (2/4) = 1 yes (2/4) no (2/4) AI student {1(Y), 2(Y)} 2/2:Y 0/2:N Ent(S1AI) = 0 {7(N), 8(N)} 0/2:Y 2/2:N Ent(S2AI) = 0 Ent(S, AI student) = 2/4·Ent(S1AI) + 2/4· Ent(S2AI) = 2/4·0 + 2/4·0 = 0 male (2/4) female (2/4) sex {1(Y), 7(N)} 1/2:Y 1/2:N Ent(S1sex) = 1 {2(Y), 8(N)} 1/2:Y 1/2:N Ent(S2sex) = 1 Ent(S, sex) = 2/4·Ent(S1sex) + 2/4· Ent(S2sex) = 2/4·1 + 2/4·1 = 1 Gain(S, AI student) = 1 – 0 = 1 Gain(S, sex) = 1 – 1 = 0 income N low high medium Y {4 (N),5 (N), 6 (N)} {3 (Y)} Y N yes no AI student {1 (Y), 2 (Y)} {7 (N), 8 (N)} Entropy ”before” is the same Select the attribute for which the conditional entropy ”after” is smaller DECISION: split

n ”AI student”

NEXT STEP? low

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 24

Greater Running Example (1)

You are stranded on a desert island, with only a pile of records and a book of poetry, and so have no way to determine which

f the many types of fruit available are safe to eat.

They are of various colours and sizes, some have hairy skins

skin colour size flesh decision 4 hairy green large soft safe 7 smooth brown small hard safe … … … … … …

They are of various colours and sizes, some have hairy skins and others are smooth, some have hard flesh and others are

soft. After a great deal of stomach ache, you compile data

shown in the table. This is pretty tedious to look up each time you feel hungry; is there some kind of pattern to it?

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 25

Greater Running Example (2)

skin colour size flesh decision 1 hairy brown large hard safe 2 hairy green large hard safe 3 smooth red large soft dangerous 4 hairy green large soft safe 5 hairy red small hard safe sk co si fl de 1 H B L H S 2 H G L H S 3 S R L S D 4 H G L S S 5 H R S H S 5 hairy red small hard safe 6 smooth red small hard safe 7 smooth brown small hard safe 8 hairy green small soft dangerous 9 smooth green small hard dangerous 10 hairy red large hard safe 11 smooth brown large soft safe 12 smooth green small soft dangerous 13 hairy red small soft safe 14 smooth red large hard dangerous 15 smooth red small hard safe 5 H R S H S 6 S R S H S 7 S B S H S 8 H G S S D 9 S G S H D 10 H R L H S 11 S B L S S 12 S G S S D 13 H R S S S 14 S R L H D 15 S R S H S 15 smooth red small hard safe 16 hairy green small hard dangerous 15 S R S H S 16 H G S H D Aim: construct a decision tree using ID3

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 26

How to Select The Best Attribute? Example (1)

sk co si fl de 1 H B L H S 2 H G L H S 3 S R L S D 4 H G L S S 5 H R S H S 6 S R S H S 7 S B S H S 8 H G S S D 9 S G S H D 10 H R L H S 11 S B L S S 12 S G S S D 13 H R S S S 14 S R L H D 15 S R S H S 16 H G S H D Ent(de) = – 6/16·log2 (6/16) – 10/16·log2 (10/16) = 0.954 Ent(de,sk) = 8/16·(– 6/8·log2 (6/8) – 2/8·log2 (2/8)) + + 8/16·(– 4/8·log2 (4/8) – 4/8·log2 (4/8)) = 0.905 Ent(de,co) = 3/16·(– 3/3·log2 (3/3) – 0/3·log2 (0/3)) + + 6/16·(– 2/6·log2 (2/6) – 4/6·log2 (4/6)) + + 7/16·(– 5/7·log2 (5/7) – 2/7·log2 (2/7)) = 0.619 Ent(de,si) = 7/16·(– 5/7·log2 (5/7) – 2/7·log2 (2/7)) + + 9/16·(– 5/9·log2 (5/9) – 4/9·log2 (4/9)) = 0.935 Ent(de,fl) = 10/16·(– 7/10·log2 (7/10) – 3/10·log2 (3/10)) + 6/16·(– 3/6·log2 (3/6) – 3/6·log2 (3/6)) = 0.925 6 out of 16 safe 10 out of 16 safe 8 out of 16 skin = hairy; 6 safe and 2 dangerous 8 out of 16 skin = soft; 4 safe and 4 dangerous

DECISION: split on ”colour”

(condtional entropy for ”colour ” is the least; the respective information gain is the greatest)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 27

How to Select The Best Attribute? Example (2)

sk co si fl de 1 H B L H S 2 H G L H S 3 S R L S D 4 H G L S S 5 H R S H S 6 S R S H S 7 S B S H S 8 H G S S D 9 S G S H D 10 H R L H S 11 S B L S S 12 S G S S D 13 H R S S S 14 S R L H D 15 S R S H S 16 H G S H D colour S brown green red {2(S), 4(S), 8(D), 9(D), 12(D), 16(D)} {1(S), 7(S), 11(S)} {5(S), 6(S), 10(S), 13(S), 15(S), 3(D), 14(D)} Task: recursively construct a sub tree for each partition Ent(de) = – 2/6·log2 (2/6) – 4/6·log2 (4/6) = 0.918 Ent(de,sk) = 4/6(– 2/4·log2 (2/4) – 2/4·log2 (2/4)) + + 2/6(– 0/2·log2 (0/2) – 2/2·log2 (2/2)) = 0.667 Ent(de,si) = 2/6(– 2/2·log2 (2/2) – 0/2·log2 (0/2)) + + 4/6(– 0/4·log2 (0/4) – 4/4·log2 (4/4)) = 0 Ent(de,fl) = 3/6(– 1/3·log2 (1/3) – 2/3·log2 (2/3)) + + 3/6(– 1/3·log2 (1/3) – 2/3·log2 (2/3)) = 0.918 DECISION: split on ”size”

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 28

Decision Tree – Properties

colour S brown green red size size level 0 level 1

no. of levels = 4

depth = no. of levels – 1 = 3

no. of leaves = 6

S {2(S), 4(S)} {1(S), 7(S), 11(S)} {5(S), 6(S), 13(S), 15(S)} size large small S D {8(D), 9(D), 12(D), 16(D)} size small S D {3(D), 14(D)} skin large hairy S {10(S)} level 1 level 2 level 3 The search maintains a single current hypothesis No backtracking; no guarantee of optimality It uses all the available examples (not incremental) May terminate earlier, accepting noisy classes smooth

AFTER SOME ITERATIONS

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 29

Dealing with Continuous-Valued Attributes

So far discrete values for attributes and for outcome Given a continuous-valued attribute A, dynamically create a new attribute Ac A = true if A > th, false if A ≤ th . Ac = true if A > th, false if A ≤ th How to determine threshold value th? Example: sort the examples according to ”income” Income 50 52 55 65 68 70 76 80 Buy? No No No Yes Yes No Yes Yes Determine candidate thresholds by averaging consecutive values where there is a change in classification 60 69 73 Evaluate candidate thresholds (attributes) according to information gain: Income>60, Income>69, and Income>73 The best threshold determines the new attribute, which competes with the other attributes Once selected, the same attribute though with a different threshold can still be used on a way from the root to some leaf used on a way from the root to some leaf

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 30

Split Information – What’s the Problem with Infor. Gain?

Natural bias of information gain: it favours attributes with many possible values Consider the attribute Phone number in the Buy iPhone example: Consider the attribute Phone number in the Buy iPhone example: The highest information gain, since it perfectly separates the training data It would be selected at the root resulting in a very broad tree Very good on the training; would perform poorly in predicting unknown instances The partition is too specific, too many small classes are generated Consider a set of examples S with nS examples and attribute A with p values {v1, v2, …, vp} Sj, j=1,..,p, contains nSj examples having value vj for attribute A Answering question ”what is the value of attribute A?” separates A into subsets {S1, S2, …, Sp} Conditional entropy of dividing set S with respect to attribute A: nSj nS SplitInformation(S,A) = - j=1,…,p ·log2( ) SplitInformation measures the entropy of S with respect to the values of A nSj nS FOCUS ON THE SIZES OF SUBSETS THAT ARE OBTAINED (NOT CLASSES)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 31

Split Information – Example

Phone number splits examples in n subsets (n is large): SplitInformation(S, Phone no.) = − (1/n·log2 1/n) − … − (1/n·log2 1/n)] = −log21/n = log2n Compare with A, which splits data in two even subsets: SplitInformation(S, A) = − (1/2·log21/2) − (1/2 log21/2) = − (-1/2) − (-1/2) = 1 yes (5/8) no (3/8) AI student {1(Y), 2(Y), 3(Y), 4(N), 5(N)} {6(N), 7(N), 8(N)} income high (1/8) medium (4/8) {4 (N), 5 (N), 6 (N)} {3 (Y)} {1 (Y), 2 (Y), 7 (N), 8 (N)} low (3/8) The more uniformly dispersed the data, the higher it is SplitInformation(buy, income) = − (3/8·log2 3/8) − (1/8·log2 1/8) − (4/8·log2 4/8) = 1.406 SplitInformation(buy, AI student) = − (5/8·log2 5/8) − (3/8·log2 3/8) = 0.954

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 32

Gain Ratio

GainRatio(S, A) = SplitInformation(S, A) Gain(S, A) SplitInformation(S, income) = 1.406 SplitInformation(S, AI student) = 0.954 Gain(S, income) = 0.454 Gain(S, AI student) = 0.347 Gain(S, sex) = 0.003 SplitInformation(S, sex) = 0.954 GainRatio penalizes attributes that split examples in many small subsets GainRatio(S, income) = 0.323 GainRatio(S, AI student) = 0.364 GainRatio(S, sex) = 0.0031 Based on the information gain, we selected ”income” for the first split based on the gain ratio, we would select ”AI student” for the first split high values desired low values desired

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 33

Overfitting (1)

Building trees that “adapt too much” to the training examples may lead to “overfitting” Consider error of hypothesis (model, classifier) h over: Consider error of hypothesis (model, classifier) h over: training data: errorD(h) = empirical error entire distribution X of data: errorX(h) = expected error Hypothesis h overfits training data if h’ behaves better over unseen data, i.e., there is an alternative hypothesis h' ∈ H such that errorD(h) < errorD(h’) and errorX(h’) < errorX(h) Consider a pair of hypotheses: h h' ∈ H better on training worse overall (on new data/testing)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 34

Overfitting (2)

Overfitting: an induced tree may overfit the training data Too many branches/nodes, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Example overfitting in decision tree learning Two strategies: Allow the tree to overfit the data, and then post-prune the tree accuracy

n training

accuracy

n test
no. of

nodes Stop growing the tree earlier (forward pruning or pre-pruning) Allow the tree to overfit the data, and then post-prune the tree

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 35

Reduced-Error Pruning – Post-pruning

Each node is a candidate for pruning Pruning consists in removing a sub-tree rooted in a node: S B S B Pruning consists in removing a sub-tree rooted in a node: the node becomes a leaf and is assigned the most common classification Nodes are removed only if the resulting tree performs no worse on the validation set Nodes are pruned iteratively: at each iteration the node whose removal most increases accuracy on the validation set is pruned. Pruning stops when no pruning increases accuracy colour G R size L S S D size S D {3(D), 14(D)} skin S {10(S)} H S L S colour G R size L S S D size S D L S No error (0/16)

n the learning set

1 error (1/16)

n the learning set

assign majority class Example Tree Unpruned Pruned Decision Scenario 1 – Acc 80% 85% Accept pruning Scenario 2 – Acc 80% 70% Leave unpruned scenarios

f pruning

(accuracy on validation)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 36

Cost-Complexity Pruning – Post-pruning

In reality, data is imperfect and we do not want to reflect all cases in the tree Avoid overly complex trees and generalize better An ”ideal” tree should be error-free and simple (small) colour G R size L S S D size S D skin S H S L S colour G R size L S S D size S D L S error = E/N = 0/16 leaves = L = 6 cost = 0/16 + 2/16·6 = 0.75 error = E/N = 1/16 leaves = L = 5 cost = 1/16 + 2/16·5 = 0.6875 S B S B Cost-complexity: minimize E/N + α·L = error on the learning set + α·L E – the number of classification errors N – the number of learning examples L – the number of leaves α – a trade-off coefficient (how much one leaf less allows new error; how much one error more is worth pruned leaves) example for α = 2/16 Tree with lesser cost if more preferred

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 37

C4 – ID3 with Cost-Complexity Pruning

C4 = ID3 + cost-complexity pruning colour G R size L S S D size S D skin S H S L S colour G R size L S S D size S D L S error = 0/16 leaves = 6 cost = 0.375 error = 1/16 leaves = 5 cost = 0.375 colour G R D size S D skin S H S L S error = 2/16 leaves = 5 cost = 0.4375 S B S B S B Example for α = 1/16 (e.g., 0/16 + 1/16·6 = 0.375 and 4/16 + 1/16·3 = 0.4375) C4 considers all possible cuts and computes the errors on the learning set In practice (for larger data sets), α is assumed to be much smaller

1 2 3

S B colour G R size L S S D S error = 2/16 leaves 4 cost = 0.375 colour G R S error = 4/16 leaves = 3 cost = 0.4375 D S B S error = 6/16 leaves = 1 cost = 0.4375

4 5 6

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 38

Dealing with Inconsistencies

N Y Let us assume a new example (9) is added implying inconsistency of data {5 (N)} {3(Y)} N Y N AI {6(N), 7(N), 8(N)} I AI S iPh? 1 M Y M Y 2 M Y F Y 3 H Y F Y 4 L Y M N 5 L Y F N 6 L N F N 7 M N M N 8 M N F N 9 L Y M Y income H M L Y {1(Y), 2(Y)} sex F M N {4 (N), 9 (Y)} error = 1/9

no. of leaves = 5
no. of levels = 9

No possibility to construct a perfect tree What about stopping its construction earlier?

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 39

Pre-Pruning (1)

Stop growing the tree earlier (forward pruning or pre-pruning) Prevent a constructed tree to grow too much if some stopping condition is met Different quality measures for evaluating the partition in a given node Then, the node is created, leading to the majority class Stop if the vast majority of examples in the node belong to one class Stop if the (relative) number of examples is too small Stop if the (relative) information gain is too small N Y {5(N)} {3(Y)} N Y N AI {6(N), 7(N), 8(N)} income H M L Y {1(Y), 2(Y)} sex F M N {4(N), 9(Y)} N Y N AI {6(N), 7(N), 8(N)} {1(Y), 2(Y). 3(Y), 4(N), 5(N), 9(Y)} Y Stop expanding tree and create a leaf with the majority class The number of examples from one class ≥ threshold = 66% Unpruned variant (i.e., what would be

btained if no stop)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 40

Pre-Pruning (2)

N Y {5 (N)} {3(Y)} N Y N AI {6(N), 7(N), 8(N)} income H M L Y {1(Y), 2(Y)} sex F M N {4 (N), 9 (Y)} N {1(Y), 2(Y), 3(Y), 4(N), 5(N), 6(N), 7(N), 8(N), 9(Y)} error = 4/9 N Y N AI {6(N), 7(N), 8(N)} Y {1(Y), 2(Y), 3(Y), 4(N), 5(N), 9(Y)} error = 2/9 Y {3(Y)} N Y N AI {6(N), 7(N), 8(N)} income H M L Y {1(Y), 2(Y)} {4(N), 5(N), 9(Y)} N error = 1/9 continue continue stop error = 1/9 Stop if adding the next node does not improve the misclassification error on the learning set by at least some pre-defined threshold In the example, the threshold is set to 0, so the tree construction is stopped when the misclassification error is not improved finish earlier error = 1 - accuracy

1 2 3 4

Assign the majority class as a working hypothesis

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 41

Summary (1)

N Y B F G E A Z X Y C L K N A Z X Indicate the root, internal nodes and leaf nodes I) Given the decision tree referring to condition attributes A, B, and C and a decision attribute (class Y or N), answer the following questions: Compute the number of levels and leaves as well as the tree depth II) Formulate the underlying decision rules for class Y and N III) Formulate the disjunctions of conjunctions for class Y and N IV) Classify the two examples with the descriptions provided in the table: V) A B C D 1 Z E L ? 2 Z G L ? Y Z N

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 42

Summary (2)

A B C D 1 X E K Y 2 X E L Y 3 X F L Y 4 X G K N 5 X G L N 6 Z E L N 7 Z G L N 8 Z E K N Given the information table referring to condition attributes A, B, and C and a decision attribute D (class Y or N), answer the following questions: Calcute entropy for n=2 and p1=p2=0.5 and for any n, pi=1/n for i=1,…,n. I) Knowing that: II) Ent(D) = – 3/8·log2 (3/8) – 5/8·log2 (5/8) = 0.955 Ent(D,A) = 5/8(– 3/5·log2 (3/5) – 2/5·log2 (2/5)) + + 3/8(– 0·log2 (0) – 1·log2 (1)) = 0.607 Ent(D,C) = 3/8(– 1/3·log2 (1/3) – 2/3·log2(2/3)) + + 5/8(– 2/5·log2 (2/5) – 3/5·log2 (3/5)) = 0.951 compute InformationGain(D,A) and InformationGain(D,C) Compute Ent(D,B) and InformationGain(D,B) III) Which attribute would be selected for the top-level IV) split? Draw the tree after this split.

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 43

Summary (3)

A B C D 1 X E K Y 2 X E L Y 3 X F L Y 4 X G K N 5 X G L N 6 Z E L N 7 Z G L N 8 Z E K N Given the information table and the decision tree after the top-level split, consider examples 1, 2, 6 and 8 in the left leaf, and solve the following tasks: Knowing that: I) Y B F G E N {1(Y), 2(Y), 6(N), 8(N)} {4(N), 5(N), 7(N)} {3(Y)} Ent(D) = – 2/4·log2 (2/4) – 2/4·log2 (2/4) = 1 Ent(D,C) = 2/4(– 1/2·log2 (1/2) – 1/2·log2(1/2)) + + 2/4(– 1/2·log2 (1/2) – 1/2·log2 (1/2)) = 0 compute InformationGain(D,C) Compute Ent(D,A) and InformationGain(D,A) II) Compute Ent(D,A) and InformationGain(D,A) II) Which attribute would be selected for the split in the left node? Draw the decision tree. Is it a complete final tree? Has it used all atttributes? Why it was not needed? III)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 44

Summary (4)

A B C D 1 X E K Y 2 X E L Y 3 X F L Y 4 X G K N 5 X G L N 6 Z E L N 7 Z G L N 8 Z E K N Given the information table referring to condition attributes A, B, and C and a decision attribute D (class Y or N), answer the following questions: Knowing that: I) Ent(D) = – 3/8·log2 (3/8) – 5/8·log2 (5/8) = 0.955 Ent(A) = – 3/8·log2 (3/8) – 5/8·log2 (5/8) = 0.955 compute GainRatio(D,A) and GainRatio(D,B) Compute Ent(C) and GainRatio(D,C) II) Which attribute would be selected for the top-level III) Ent(B) = – 4/8·log2 (4/8) – 1/8·log2 (1/8) – 3/8·log2 (3/8) = 1.406 InformationGain(D,A) = 0.955 – 0.607 = 0.348 InformationGain(D,B) = 0.955 – 0.500 = 0.455 InformationGain(D,C) = 0.955 – 0.951 = 0.004 Which attribute would be selected for the top-level split based on GainRatio? Draw the tree after this split. Subsequently, continue with calculations so that to make the decision tree complete. III)

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 45

Summary (5)

A B C D 1 X E K Y 2 X E L Y 3 X F L Y 4 X G K N 5 X G L N 6 Z E L N 7 Z G L N 8 Z E K N Given the information table referring to condition attributes A, B, and C and a decision attribute D (class Y or N) - including additional example (9) implying the inconsistency - the following tree has been obtained after the first iteration using ID3 with information gain: Assuming that the tree is pre-pruned if the share of examples in a given node from a single class is at least 2/3, which node would be expanded and which not (what class would be assigned to the respective I) Y B F G E {1(Y), 2(Y), 6(N), 8(N)} {4(N), 5(N), 7(N), 9(Y)} {3(Y)} 9 Z G L Y class would be assigned to the respective leaf/leaves)? Draw the complete tree.

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 46

Summary (6)

A B C D 1 X E K Y 2 X E L Y 3 X F L Y 4 X G K N 5 X G L N 6 Z E L N 7 Z G L N 8 Z E K N Given the information table referring to condition attributes A, B, and C and a decision attribute D (class Y or N) - including additional example (9) implying the inconsistency - the following tree has been obtained after the first iteration using ID3 with information gain: Assuming that the tree is pre-pruned if expansion of the node does not improve the classification accuracy on the learning set, which node would be expanded and which not (what class would be assigned II) Y B F G E {1(Y), 2(Y), 6(N), 8(N)} {4(N), 5(N), 7(N), 9(Y)} {3(Y)} 9 Z G L Y to the respective leaf)? Draw the complete tree.

Artificial Intelligence Introduction to Artificial Intelligence

SLIDE 47

Summary (7)

A B C D 1 X E K Y 2 X E L Y 3 X F L Y 4 X G K N 5 X G L N 6 Z E L N 7 Z G L N 8 Z E K N Given the information table referring to condition attributes A, B, and C and a decision attribute D (class Y or N) - including additional example (9) implying the inconsistency - the following tree has been obtained using ID3 with information gain with classification error 1/9 on the learning set: Knowing that for the original unpruned tree: cost = 1/9 + 0.05·5 = 0.311, I) {1(Y), 2(Y)} {7(N), 9(Y)} {3(Y)} N Y B F G E A Z X Y N A Z X N {6(N), 8(N)} {4(N), 5(N)} 9 Z G L Y cost = 1/9 + 0.05·5 = 0.311, and for the simplest tree composed of a single node assigning all examples to N: cost = 4/9 + 0.05·1 = 0.484, consider the remaining possible three trees using C4 driven by cost- complexity pruning parameterized with α=0.05. What are their costs? Which tree would be finally selected using such a post-pruning?

Artificial Intelligence Introduction to Artificial Intelligence