Finding Explanations Instead of finding structure in a data set, we - - PowerPoint PPT Presentation

finding explanations
SMART_READER_LITE
LIVE PREVIEW

Finding Explanations Instead of finding structure in a data set, we - - PowerPoint PPT Presentation

Finding Explanations Instead of finding structure in a data set, we are now focusing on methods that find explanations for an unknown dependency within the data. Given: Dataset D = { ( x i , Y i ) | i = 1 , ..., n } with n tuples x : Object


slide-1
SLIDE 1

Finding Explanations

Instead of finding structure in a data set, we are now focusing on methods that find explanations for an unknown dependency within the data. Given: Dataset D = {(xi, Yi)|i = 1, ..., n} with n tuples

x: Object description Y : Target attribute

nominal: classification problem numerical: regression problem

Data analysis

Supervised (because we know the desired outcome) Descriptive (because we care about explanation)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

1 / 39

slide-2
SLIDE 2

Bayes Classifiers

Given: Dataset D = {(xi, Yi)|i = 1, ..., n} with n tuples

x: Object description Y : Nominal target attribute ⇒ classification problem

Bayes classifiers express their model in terms of simple probabilities. Provide ’gold standard’ for evaluating other learning algorithms. Any other model should at least perform as well as the naive Bayes classifier.

Suggestion

Before trying to apply more complex models, a quick look at a Bayes classifier can be helpful to get a feeling for realistic accuracy expectations and simple dependencies in the data.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

2 / 39

slide-3
SLIDE 3

Bayes’ theorem

P(h|E) = P(E|h) · P(h) P(E)

Interpretation

The probability P(h|E) that a hypothesis h is true given event E has

  • ccurred, can be derived from

P(h) the probability of the hypothesis h itself, P(E) the probability of the event E and P(E|h) the conditional probability of the event E given the hypothesis h.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

3 / 39

slide-4
SLIDE 4

Choosing Hypotheses

We want the most probable hypothesis h ∈ H for a given event E Maximum a posteriori hypothesis: hMAP = arg max

h∈H

P(h|E) = arg max

h∈H

P(E|h)P(h) P(E) = arg max

h∈H

P(E|h)P(h)

Maximum likelihood

If we assume that every hypothesis h ∈ H is equally probable a priori (P(hi) = P(hj) for all hi, hj ∈ H) we can further simplify the equation and get the maximum likelihood hypothesis: hML = arg max

h∈H

P(E|h)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 39

slide-5
SLIDE 5

Bayes classifiers

The probability P(h) can be estimated easily based on a given data set D: P(h) = no. of data from class h

  • no. of data

In principle, the probability P(E|h) could be determined analogously based

  • n the values of the attributes A1, . . . , Am, i.e. the attribute vector

E = (a1, . . . , am). P(E|h) = no. of data from class h with values (a1, . . . , am)

  • no. of data from class h

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

5 / 39

slide-6
SLIDE 6

Bayes classifiers

Problem

For n = 10 nominal attributes A1, . . . , A10, each having three possible values, we would need 310 = 59049 data objects to have at least one example per combination. Therefore, the computation is carried out under the (na¨ ıve, unrealistic) assumption that the attributes A1, . . . , Am are independent given the class, i.e. P(E = (a1, . . . , am)|h) = P(a1|h) · . . . · P(am|h) =

  • ai∈E

P(ai|h) P(ai|h) can be computed easily: P(ai|h) = no. of data from class h with Ai = ai

  • no. of data from class h

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

6 / 39

slide-7
SLIDE 7

Na¨ ıve Bayes classifier

Given: A data set with only nominal attributes. Based on the values a1, . . . , am of the attributes A1, . . . , Am a prediction for the value of the attribute H should be derived: For each class h ∈ H compute the likelihood L(h|E) under the assumption that the A1, . . . , Am are independent given the class L(h|E) =

  • ai∈E

P(ai|h) · P(h). Assign E to the class h ∈ H with the highest likelihood pred(E) = arg max

h∈H

L(h|E). This Bayes classifier is called na¨ ıve because of the (conditional) independence assumption for the attributes A1, . . . , Am. Although this assumption is unrealistic in most cases, the classifier often yields good results, when not too many attributes are correlated.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

7 / 39

slide-8
SLIDE 8

Example

Given the dataset D:

ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

we want to predict the sex (male or female) of a person x with the following attribute values: x = (Height = tall, Weight = low, Long hair = yes)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 39

slide-9
SLIDE 9

Example

We need to calculate L(Sex = m|Height = t, Weight = l, Long hair = y) = P(Height = t|Sex = m)· P(Weight = l|Sex = m)· P(Long hair = y|Sex = m)· P(Sex = m) and L(Sex = f|Height = t, Weight = l, Long hair = y) = P(Height = t|Sex = f)· P(Weight = l|Sex = f)· P(Long hair = y|Sex = f)· P(Sex = f).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

9 / 39

slide-10
SLIDE 10

Example

P(Height = t|Sex = m) ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

10 / 39

slide-11
SLIDE 11

Example

P(Height = t|Sex = m) ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

11 / 39

slide-12
SLIDE 12

Example

P(Height = t|Sex = m) = 2/4 = 1/2 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

12 / 39

slide-13
SLIDE 13

Example

P(Weight = l|Sex = m) = 0/4 = 0 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

13 / 39

slide-14
SLIDE 14

Example

P(Long hair = y|Sex = m) = 0/4 = 0 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

14 / 39

slide-15
SLIDE 15

Example

P(Sex = m) = 4/10 = 2/5 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

15 / 39

slide-16
SLIDE 16

Example

L(Sex = m|Height = t, Weight = l, Long hair = y) = 2 4 · 0 4 · 0 4 · 4 10 = 1 2 · 0 · 0 · 2 5 = 0 ⇒ the likelihood of person x being a men is 0.

ID Height Weight Long hair Sex 1 m n n m 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

16 / 39

slide-17
SLIDE 17

Example

P(Height = t|Sex = f) ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

17 / 39

slide-18
SLIDE 18

Example

P(Height = t|Sex = f) ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

18 / 39

slide-19
SLIDE 19

Example

P(Height = t|Sex = f) = 1/6 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

19 / 39

slide-20
SLIDE 20

Example

P(Weight = l|Sex = f) = 3/6 = 1/2 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 g n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

20 / 39

slide-21
SLIDE 21

Example

P(Long hair = y|Sex = f) = 4/6 = 2/3 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

21 / 39

slide-22
SLIDE 22

Example

P(Sex = f) = 6/10 = 3/5 ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

22 / 39

slide-23
SLIDE 23

Example

L(Sex = f|Height = t, Weight = l, Long hair = y) = 1 6 · 3 6 · 4 6 · 6 10 = 1 6 · 1 2 · 2 3 · 3 5 = 1 30 > 0 ⇒ the likelihood of person x being a female is

1 30.

ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

23 / 39

slide-24
SLIDE 24

Example

L(Sex = f|Height = t, Weight = l, Long hair = y) = 1 30 L(Sex = m|Height = t, Weight = l, Long hair = y) = 0 Classification of person x = (Height = tall, Weight = low, Long hair = yes) as female (f).

Notice

The data set D does not contain any object with this combination of values. ⇒ A full Bayes classifier would not be able to classify this object.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

24 / 39

slide-25
SLIDE 25

More examples

Input L(m| . . .) L(f| . . .) Class (m, n, n)

1 4 · 2 4 · 4 4 · 4 10 = 1 20 2 6 · 3 6 · 2 6 · 6 10 = 1 30

m The object (m, n, n) is classified as m although the data sets contains two such objects, one from class m and one from class f. The main impact comes from the attribute Long hair = n, having probability 1 in class m, but a low probability in class f.

ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

25 / 39

slide-26
SLIDE 26

More examples

Input L(m| . . .) L(f| . . .) Class (t, h, n)

2 4 · 2 4 · 4 4 · 4 10 = 1 10 1 6 · 0 6 · 2 6 · 6 10 = 0

m (t, h, y)

2 4 · 2 4 · 0 4 · 4 10 = 0 1 6 · 0 6 · 4 6 · 6 10 = 0

? The object (t, h, y) can not be classified since the likelihood is zero for both classes.

ID Height Weight Long hair Sex 1 m n n m 2 s l y f 3 t h n m 4 s n y f 5 t n y f 6 s l n f 7 s h n m 8 m n n f 9 m l y f 10 t n n m

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

26 / 39

slide-27
SLIDE 27

Laplace correction

If a single likelihood is zero, then the overall likelihood is zero automatically, even then when the other likelihoods are high. Input L(m| . . .) L(f| . . .) Class (t, h, y)

2 4 · 2 4 · 0 4 · 4 10 = 0 1 6 · 0 6 · 4 6 · 6 10 = 0

? A solution is the usage of the Laplace correction γ: P(y) = ny n ⇒ ˆ P(y) = γ + ny γ · |dom(Y )| + n P(x|y) = nhx ny ⇒ ˆ P(x|y) = γ + nyx γ · |dom(X)| + ny n no. of data ny no. of data from class y nyx no. of data from class y with value x for attribute X dom(X) no. of distinct values in X

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

27 / 39

slide-28
SLIDE 28

Laplace correction

Example

Laplace correction for P(Height = . . . |Sex = m) with γ = 1 ˆ P(s|m) = γ + nms γ · |dom(Height)| + nm = 1 + 1 1 · 3 + 4 = 2 7 Height # #Laplace P ˆ P s 1 2 1/4 2/7 m 1 2 1/4 2/7 t 2 3 2/4 3/7

Notice

γ = 0: Maximum likelihood estimation Common choices: γ = 1 or γ = 1

2

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

28 / 39

slide-29
SLIDE 29

Na¨ ıve Bayes classifier: Implementation

The counting of the frequencies should be carried out once when the na¨ ıve Bayes classifier is constructed. The probability distribution for the single attributes should be stored in a table. When the na¨ ıve Bayes classifier is applied to new data, only the corresponding values in the table need to be multiplied.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

29 / 39

slide-30
SLIDE 30

Treatment of missing values

During learning: The missing values are simply not counted for the frequencies of the corresponding attribute. During classification: Only the probabilities (likelihoods) of those attributes are multiplied for which a value is available.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

30 / 39

slide-31
SLIDE 31

Numerical attributes

Assume a normal distribution for a numerical attribute X f(x | y) = 1 √ 2πσX|y exp

  • −(x − µX|y)2

2σ2

X|y

  • Estimation of the mean value

ˆ µX|y = 1 ny

n

  • i=1

τ(yi = y) · xi[X] Estimation of the variance ˆ σ2

X|y = 1

n′

y n

  • i=1

τ(yi = y) ·

  • xi[X] − ˆ

µX|y 2 n′

y = ny

: Maximum likelihood estimation n′

y = ny − 1: Unbiased estimation

τ(yi = y) = 1 if true else

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

31 / 39

slide-32
SLIDE 32

Example

100 data points, 2 classes Small squares: mean values Inner ellipses:

  • ne standard deviation

Outer ellipses: two standard deviations Classes overlap: classification is not perfect Na¨ ıve Bayes classifier

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

32 / 39

slide-33
SLIDE 33

Na¨ ıve Bayes classifier: Iris data

150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue) Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 6 misclassifications on the training data (with all 4 attributes) Na¨ ıve Bayes classifier

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

33 / 39

slide-34
SLIDE 34

Example

20 data points, 2 classes Small squares: mean values Inner ellipses:

  • ne standard deviation

Outer ellipses: two standard deviations Attributes are not conditionally independent given the class Na¨ ıve Bayes classifier

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

34 / 39

slide-35
SLIDE 35

Full Bayes classifiers

Restricted to metric/numeric attributes (only the class is nominal/symbolic). Simplifying Assumption: Each class can be described by a multivariate normal distribution f(xM | y) = 1

  • (2π)m|ΣXM|y|

exp

(xM − µXM|y)⊤Σ−1

XM|y(xM − µXM|y)

2

  • XM: set of metric attributes

xM: attribute vector µXM|y: mean value vector for class y ΣXM|y: covariance matrix for class y

Intuitively

Each class has a bell-shaped probability density.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

35 / 39

slide-36
SLIDE 36

Full Bayes classifiers

Estimation of Probabilities: Estimation of the (class-conditional) mean value vector ˆ µXM|y = 1 ny

n

  • i=1

τ(yi = y) · xi[XM] xi[XM]: attribute vector x at position i that contains the values of all metric attributes XM Estimation of the (class-conditional) covariance matrix

  • ΣXM|y = 1

n′

y n

  • i=1

τ(yi = y)×

  • xi[XM] − ˆ

µXM|y xi[XM] − ˆ µXM|y ⊤ n′

y = ny

: Maximum likelihood estimation n′

y = ny − 1: Unbiased estimation

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

36 / 39

slide-37
SLIDE 37

Na¨ ıve vs. full Bayes classifiers

Na¨ ıve Bayes classifier Full Bayes classifier

Notice

Na¨ ıve Bayes classifiers for numerical data are equivalent to full Bayes classifiers with diagonal covariance matrices.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

37 / 39

slide-38
SLIDE 38

Full Bayes classifier: Iris data

150 data points, 3 classes Iris setosa (red) Iris versicolor (green) Iris virginica (blue) Shown: 2 out of 4 attributes sepal length sepal width petal length (horizontal) petal width (vertical) 2 misclassifications on the training data (with all 4 attributes) Full Bayes classifier

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

38 / 39

slide-39
SLIDE 39

Summary

Pros:

Gold standard for comparison with other classifiers High classification accuracy in many applications Classifier can easily be adapted to new training objects Integration of domain knowledge

Cons:

The conditional probabilities my not be available Independence assumptions might not hold for data set

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

39 / 39