0 9 0 8 0 7 0 6 0 5 0 4
play

0.9 0.8 0.7 0.6 0.5 0.4 R 23 = C 2 is misclassified as C 3 0.3 - PDF document

Artificial Intelligence: Representation and Problem Solving 15-381 April 17, 2007 Probabilistic Learning Reminder No class on Thursday - spring carnival. Artificial Intelligence: Probabilistic Learning Michael S. Lewicki Carnegie


  1. Artificial Intelligence: Representation and Problem Solving 15-381 April 17, 2007 Probabilistic Learning Reminder • No class on Thursday - spring carnival. Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 2

  2. Recall the basic algorithm for learning decision trees: 1. starting with whole training data 2. select attribute or value along dimension that gives “best” split using information gain or other criteria 3. create child nodes based on split 4. recurse on each child using child data until a stopping criterion is reached • all examples have same class • amount of data is too small • tree too large • Does this capture probabilistic relationships? Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 3 Quantifying the certainty of decions • Suppose instead of a yes or no Predicting credit risk answer want some estimate of how <2 years at missed strongly we believe a loan applicant is defaulted? current job? payments? a credit risk. N N N • This might be useful if we want some Y N Y flexibility in adjusting our decision criteria. N N N - Eg, suppose we’re willing to take N N N more risk if times are good. N Y Y - Or, if we want to examine case we Y N N believe are higher risks more N Y N carefully. N Y Y Y N N Y N N • • • • • • • • • Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 4

  3. The mushroom data • Or suppose we wanted to know how Mushroom data likely a mushroom was safe to eat? • • � • EDIBLE? CAP-SHAPE CAP-SURFACE • Do decision trees give us that 1 edible flat fibrous • • � • information? 2 poisonous convex smooth • • � • 3 edible flat fibrous • • � • 4 edible convex scaly • • � • 5 poisonous convex smooth • • � • 6 edible convex fibrous • • � • 7 poisonous flat scaly • • � • 8 poisonous flat scaly • • � • 9 poisonous convex fibrous • • � • 10 poisonous convex fibrous • • � • 11 poisonous flat smooth • • � • 12 edible convex smooth • • � • 13 poisonous knobbed scaly • • � • 14 poisonous flat smooth • • � • 15 poisonous flat fibrous • • � • • • • • • • • • • • • • Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 5 Fisher’s Iris data 2.5 In which example would Iris virginica you be more confident about the class? 2 petal width (cm) 1.5 1 Iris setosa Iris versicolor 0.5 Decision trees provide a 0 classification but not 1 2 3 4 5 6 7 uncertainty. petal length (cm) Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 6

  4. The general classification problem output is a binary classification vector: desired output � 1 if x i ∈ C i ≡ class i, y i = y = { y 1 , . . . , y K } 0 otherwise model model (e.g. a decision tree) is defined by M parameters. θ = { θ 1 , . . . , θ M } Data input is a set of T observations, each an D = { x 1 , . . . , x T } N-dimensional vector (binary, discrete, or continuous) x i = { x 1 , . . . , x N } i How do we approach this Given data, we want to learn a model that probabilistically? can correctly classify novel observations. Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 7 The answer to all questions of uncertainty • Let’s apply Bayes’ rule to infer the most probable class given the observation: p ( x | C k ) p ( C k ) p ( C k | x ) = p ( x ) p ( x | C k ) p ( C k ) = � k p ( x | C k ) p ( C k ) • This is the answer, but what does it mean? • How do we specify the terms? - p ( C k ) is the prior probability on the different classes - p ( x | C k ) is the data likelihood, ie probability of x given class C k • How should we define this? Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 8

  5. What classifier would give “optimal” performance? • Consider the iris data again. p ( petal length | C 3 ) • How would we minimize the number p ( petal length | C 2 ) 0.9 0.8 of future mis-classifications? 0.7 • We would need to know the true 0.6 0.5 distribution of the classes. 0.4 0.3 • Assume they follow a Gaussian 0.2 distribution. 0.1 0 • The number of samples in each class 2.5 is the same (50), so (assume) p ( C k ) is equal for all classes. 2 petal width (cm) • Because p ( x ) is the same for all 1.5 classes we have: 1 p ( x | C k ) p ( C k ) p ( C k | x ) = p ( x ) 0.5 p ( x | C k ) p ( C k ) ∝ 0 1 2 3 4 5 6 7 petal length (cm) Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 9 Where do we put the boundary? p ( petal length | C 3 ) p ( petal length | C 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 10

  6. Where do we put the boundary? decision boundary p ( petal length | C 3 ) p ( petal length | C 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 R 23 = C 2 is misclassified as C 3 0.3 R 32 = C 3 is misclassified as C 2 0.2 0.1 0 1 2 3 4 5 6 7 Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 11 Where do we put the boundary? Shifting the boundary trades-off the two errors. p ( petal length | C 3 ) p ( petal length | C 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 R 23 = C 2 is misclassified as C 3 0.3 R 32 = C 3 is misclassified as C 2 0.2 0.1 0 1 2 3 4 5 6 7 Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 12

  7. Where do we put the boundary? • The misclassification error is defined by � � p (error) = p ( C 3 | x ) d x + p ( C 2 | x ) d x R 32 R 23 • which in our case is proportional to the data likelihood p ( petal length | C 3 ) p ( petal length | C 2 ) 0.9 0.8 0.7 0.6 0.5 0.4 R 23 = C 2 is misclassified as C 3 0.3 R 32 = C 3 is misclassified as C 2 0.2 0.1 0 1 2 3 4 5 6 7 Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 13 Where do we put the boundary? • The misclassification error is defined by � � p (error) = p ( C 3 | x ) d x + p ( C 2 | x ) d x R 32 R 23 • which in our case is proportional to the data likelihood p ( petal length | C 3 ) p ( petal length | C 2 ) 0.9 0.8 0.7 0.6 This region would yield 0.5 p ( C 3 | x ) > p ( C 2 | x ) 0.4 but we’re still classifying 0.3 this region as C 2 ! 0.2 0.1 0 1 2 3 4 5 6 7 Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 14

  8. The optimal decision boundary • The minimal misclassification error at the point where p ( C 3 | x ) = p ( C 2 | x ) Optimal decision boundary p ( x | C 3 ) p ( C 3 ) /p ( x ) = p ( x | C 2 ) p ( C 2 ) /p ( x ) ⇒ p ( x | C 3 ) = p ( x | C 2 ) ⇒ p ( C 2 | petal length ) p ( C 3 | petal length ) 1 0.8 p ( petal length | C 2 ) p ( petal length | C 3 ) 0.6 0.4 Note: this assumes we have only two classes. 0.2 0 1 2 3 4 5 6 7 Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 15 Bayesian classification for more complex models • Recall the class conditional probability: p ( x | C k ) p ( C k ) p ( C k | x ) = p ( x ) p ( x | C k ) p ( C k ) = � k p ( x | C k ) p ( C k ) • How do we define the data likelihood, p ( x | C k ) ie the probability of x given class C k Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 16

  9. Defining a probabilistic classification model • How would we define credit risk problem? Predicting credit risk - Class: <2 years at missed defaulted? current job? payments? C 1 = “defaulted” N N N C 2 = “didn’t default” Y N Y - Data: N N N x = { “<2 years”, “missed payments” } N N N - Prior (from data): N Y Y Y N N p ( C 1 ) = 3/10; p ( C 2 ) = 7/10; - Likelihood: N Y N N Y Y p ( x 1 , x 2 | C 1 ) = ? Y N N p ( x 1 , x 2 | C 2 ) = ? Y N N - How would we determine these? • • • • • • • • • Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 17 Defining a probabilistic model by counting • The “prior” is obtained by counting number of classes in the data: Count ( C k = k ) p ( C k = k ) = # records • The likelihood is obtained the same way: Count ( x = v ∧ C k = k ) p ( x = v | C k ) = Count ( C k = k ) Count ( x 1 = v 1 , . . . ∧ x N = v N , ∧ C k = k ) p ( x 1 = v 1 , . . . , x N = v N | C k = k ) = Count ( C k = k ) • This is the maximum likelihood estimate (MLE) of the probabilities Artificial Intelligence: Probabilistic Learning Michael S. Lewicki � Carnegie Mellon 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend