Data mining Machine Intelligence Thomas D. Nielsen September 2008 - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data mining Machine Intelligence Thomas D. Nielsen September 2008 - - PowerPoint PPT Presentation

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008 1 / 37 What is Data Mining? ? Introduction Data mining September 2008 2 / 37 What is Data Mining? ? Introduction Data mining September 2008


slide-1
SLIDE 1

Data mining

Machine Intelligence Thomas D. Nielsen September 2008

Data mining September 2008 1 / 37

slide-2
SLIDE 2

What is Data Mining?

?

Introduction Data mining September 2008 2 / 37

slide-3
SLIDE 3

What is Data Mining?

?

Introduction Data mining September 2008 2 / 37

slide-4
SLIDE 4

What is Data Mining?

!

Introduction Data mining September 2008 2 / 37

slide-5
SLIDE 5

What is Data Mining?

Data Mining in practice

Introduction Data mining September 2008 3 / 37

slide-6
SLIDE 6

What is Data Mining?

Data Mining in practice Real−life data Off−the−shelf algorithm adapt preprocess

Introduction Data mining September 2008 3 / 37

slide-7
SLIDE 7

What is Data Mining?

Data Mining in practice Real−life data Off−the−shelf algorithm adapt evaluate + iterate preprocess

Introduction Data mining September 2008 3 / 37

slide-8
SLIDE 8

What is Data Mining?

Data Mining in practice Real−life data Off−the−shelf algorithm adapt evaluate + iterate preprocess data/domain − specific operations general algorithmic methods

Introduction Data mining September 2008 3 / 37

slide-9
SLIDE 9

What is Data Mining?

An overview Supervised Learning Labeled Data Classification Predictive Modeling Unsupervised Learning Unlabeled Data Clustering Descriptive Modeling Rule Mining, Association Analysis

Introduction Data mining September 2008 4 / 37

slide-10
SLIDE 10

Classification

A high-level view C l a s s i fi e r

Spam yes/no

Classification Data mining September 2008 5 / 37

slide-11
SLIDE 11

Classification

A high-level view

SubAllCap yes/no TrustSend yes/no InvRet yes/no Body’adult’ yes/no Body’zambia’ yes/no

C l a s s i fi e r

Spam yes/no

Classification Data mining September 2008 5 / 37

slide-12
SLIDE 12

Classification

A high-level view C l a s s i fi e r

Cell-1 1..64 Cell-2 1..64 Cell-3 1..64 Cell-324 1..64

Symbol A..Z,0..9

Classification Data mining September 2008 5 / 37

slide-13
SLIDE 13

Classification

Labeled Data Instances (Cases, Examples) Attributes Class variable (Features, Predictor Variables) (Target variable) SubAllCap TrustSend InvRet . . . B’zambia’ Spam y n n . . . n y n n n . . . n n n y n . . . n y n n n . . . n n . . . . . . . . . . . . . . . . . . Instances Attributes Class variable Cell-1 Cell-2 Cell-3 . . . Cell-324 Symbol 1 1 4 . . . 12 B 1 1 1 . . . 3 1 34 37 43 . . . 22 Z 1 1 1 . . . 7 . . . . . . . . . . . . . . . . . . (In principle, any attribute can become the designated class variable)

Classification Data mining September 2008 6 / 37

slide-14
SLIDE 14

Classification

Classification in general Attributes: Variables A1, A2, . . . , An (discrete or continuous). Class variable: Variable C. Always discrete: states(C) = {c1, . . . , cl} (set of class labels) A (complete data) Classifier is a mapping C : states(A1, . . . , An) → states(C). A classifier able to handle incomplete data provides mappings C : states(Ai1, . . . , Aik ) → states(C) for subsets {Ai1, . . . , Aik } of {A1, . . . , An}. A classifier partitions Attribute-value space (also: instance space) into subsets labelled with class labels.

Classification Data mining September 2008 7 / 37

slide-15
SLIDE 15

Classification

Iris dataset

PL PW SL SW

Measurement of petal width/length and sepal width/length for 150 flowers of 3 different species

  • f Iris.

first reported in: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7 (1936). Attributes Class variable SL SW PL PW Species 5.1 3.5 1.4 0.2 Setosa 4.9 3.0 1.4 0.2 Setosa 6.3 2.9 6.0 2.1 Virginica 6.3 2.5 4.9 1.5 Versicolor . . . . . . . . . . . . . . .

Classification Data mining September 2008 8 / 37

slide-16
SLIDE 16

Classification

Labeled data in instance space:

Classification Data mining September 2008 9 / 37

slide-17
SLIDE 17

Classification

Labeled data in instance space: Virginica Versicolor Setosa Partition defined by classifier

Classification Data mining September 2008 9 / 37

slide-18
SLIDE 18

Classification

Decision Regions Axis-parallel linear: e.g. Deci- sion Trees Piecewise linear: e.g. Naive Bayes Nonlinear: e.g. Neural Network

Classification Data mining September 2008 10 / 37

slide-19
SLIDE 19

Classification

Classifiers differ in . . . Model space: types of partitions and their representation. how they compute the class label corresponding to a point in instance space (the actual classification task). how they are learned from data. Some important types of classifiers: Decision trees Naive Bayes classifier Other probabilistic classifiers (TAN,. . . ) Neural networks K-nearest neighbors

Classification Data mining September 2008 11 / 37

slide-20
SLIDE 20

Decision Trees

Example Attributes: height ∈ [0, 2.5], sex ∈ {m, f}. Class labels: {tall, short}.

h h

2.0 1.0 2.5

f m s f m

< 1.8 ≥ 1.8 < 1.7 ≥ 1.7 short short short short tall tall tall tall Partition of instance space Representation by decision tree

Decision trees Data mining September 2008 12 / 37

slide-21
SLIDE 21

Decision Trees

A decision tree is a tree

  • whose internal nodes are labeled with attributes
  • whose leaves are labeled with class labels
  • edges going out from node labeled with attribute A are labeled with subsets of states(A),

such that all labels combined form a partition of states(A). Possible partitions states(A) = R : [−∞, 2.3[, [2.3, ∞] [−∞, 1.9[, [1.9, 3.5[, [3.5, ∞] states(A) = {a, b, c} : {a}, {b}, {c} {a, b}, {c}

Decision trees Data mining September 2008 13 / 37

slide-22
SLIDE 22

Decision Trees

Decision tree classification Each point in the instance space is sorted into a leaf by the decision tree. It is classified according to the class label at that leaf.

h h s m f

[m,1.85]

< 1.8 ≥ 1.8 < 1.7 ≥ 1.7 short short tall tall C([m, 1.85]) = tall

Decision trees Data mining September 2008 14 / 37

slide-23
SLIDE 23

Decision Trees

Learning a decision tree In general, we look for a small decision tree with minimal classification error over the data set (a1, c1), (a2, c2), . . . , (an, cn).

t f t f t f t f t f B B A B A

c1 c1 c1 c2 c2 c2 c2 Good tree Bad tree Note: if data is noise-free, i.e. there are no instances (ai, ci), (aj, cj) with ai = aj and ci = cj, then there always exists decision tree with zero classification error.

Decision trees Data mining September 2008 15 / 37

slide-24
SLIDE 24

Decision Trees

The ID3 algorithm A t f X yes

Decision trees Data mining September 2008 16 / 37

slide-25
SLIDE 25

Decision Trees

The ID3 algorithm A B t t f f yes ? ? Top-down construction of the decision tree. For an “open” node X: Let D(X) be the instances that can reach X. If all instances agree on the class c, then label X with c and make it a leaf. Otherwise, find best attribute A and partition of states(A), replace X with A, and make an

  • utgoing edge from A for each member of the partition.

Decision trees Data mining September 2008 16 / 37

slide-26
SLIDE 26

Decision Trees

Notes: The exact algorithm is formulated as a recursive procedure. One can modify the algorithm by providing weaker conditions for termination (necessary for noisy data):

  • If <some other termination condition applies>, turn X into a leaf with

<most appropriate class label>.

Decision trees Data mining September 2008 17 / 37

slide-27
SLIDE 27

Decision Trees

Scoring new partitions

f t B

c1 X D(X)

Decision trees Data mining September 2008 18 / 37

slide-28
SLIDE 28

Decision Trees

Scoring new partitions

f t B

c1 A a1 a2 a3 X1 X2 X3 D(X1) D(X2) D(X3) For each candidate attribute A with partition a1, a2, a3

  • f states(A):

Let pi(c) be the relative frequency of class label c in D(ai). Measure for uniformity of class label distribution in D(Xi) (entropy): HXi := − X

c∈states(C)

pi(c) log2(pi(c)) Score of new partition (-expected entropy): Score(A, a1, a2, a3) := −

3

X

i=1

|D(Xi)| |D(X)| HXi

Decision trees Data mining September 2008 18 / 37

slide-29
SLIDE 29

Decision Trees

Searching for partitions When trying attribute A look for the partition of states(A) with highest score. In practice: Can try all choices for A. Cannot try all partitions of states(A). Therefore

For states(A) = R: only consider partitions of the form ] − ∞, r[, [r, ∞[. Example: A: 1 3 4 6 10 12 17 18 22 25 C: y y y n n y y y n n Pick the partition with minimal expected entropy. For states(A) = {a1, . . . , ak}: only consider partition {a1}, . . . , {ak}.

Decision trees Data mining September 2008 19 / 37

slide-30
SLIDE 30

Decision Trees

Decision boundaries revisited

Decision trees Data mining September 2008 20 / 37

slide-31
SLIDE 31

Attributes with many values

The expected entropy measure favors attributes with many values: For example, an attribute Date (with the possible dates as states) will have a very low expected entropy but is unable to generalize! One approach for avoiding this problem is to select attributes based on GainRation: GainRation(D, A) = score(S, A) HA HA = − X

a∈states(A)

p(a) log2(p(a)), where p(a) is the relative frequency of A = a in D.

Decision trees Data mining September 2008 21 / 37

slide-32
SLIDE 32

Decision Trees

Overfitting Constructing a classifier with zero classification error on training data may lead to overfitting of the data: Virginica Versicolor Setosa

Decision trees Data mining September 2008 22 / 37

slide-33
SLIDE 33

Decision Trees

Complex models will represent properties of the training data very precisely The training data may contain some peculiar properties that are not representative for the domain The model will not perform optimally in classifying future instances

Model size Future data Training data Classification error

Decision trees Data mining September 2008 23 / 37

slide-34
SLIDE 34

Decision Trees

Pruning To prevent overfitting, extensions of ID3 (C4.5, C5.0) add a pruning step after the tree construction:

  • Data is split into training data and validation data
  • Decision tree is learned using training data only
  • Pruning:

for internal node X: replace subtree rooted at X with leaf labelled with c ∈ states(C) if this reduces the classification error on the validation data.

Decision trees Data mining September 2008 24 / 37

slide-35
SLIDE 35

Overfitting

Model Tuning with Test Set Model learn apply tune

final Model

split parameter tuning learn setting Train Test Train Test Data Data Models can be adjusted or tuned (e.g. pruning subtrees, setting model parameters) Tuning can be an iterative process that requires repeated evaluations on the test set A final model is learned using all the data Problem: part of data “wasted” as test set

Decision trees Data mining September 2008 25 / 37

slide-36
SLIDE 36

Overfitting

Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each setting of tuning parameter:

for i = 1 to n: learn a model using folds 1, . . . , i − 1, i + 1, . . . , n as training data measure performance on fold i model performance = average performance on the n test sets

Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Cross Validation is also used for final evaluation of a learned model.

Decision trees Data mining September 2008 26 / 37

slide-37
SLIDE 37

Decision Trees

Pros and Cons + Easy to interpret + Efficient learning methods

  • Greedy “one variable at a time” construction does not utilize possible correlations between

attributes

  • Difficulties with missing data (but ID3 algorithm can be extended to deal with missing data)

Decision trees Data mining September 2008 27 / 37

slide-38
SLIDE 38

K Nearest Neighbor

Thomas D. Nielsen September 2008

k Nearest Neighbor K Nearest Neighbor September 2008 28 / 37

slide-39
SLIDE 39

K Nearest Neighbor

Labeled training data in instance space (class labels: red, green, blue)

k Nearest Neighbor K Nearest Neighbor September 2008 29 / 37

slide-40
SLIDE 40

K Nearest Neighbor

Labeled training data in instance space (class labels: red, green, blue) x A new instance x should be classified.

k Nearest Neighbor K Nearest Neighbor September 2008 29 / 37

slide-41
SLIDE 41

K Nearest Neighbor

Labeled training data in instance space (class labels: red, green, blue) x A new instance x should be classified. The nearest neighbor is green, hence x is classified as green.

k Nearest Neighbor K Nearest Neighbor September 2008 29 / 37

slide-42
SLIDE 42

K Nearest Neighbor

Labeled training data in instance space (class labels: red, green, blue) x A new instance x should be classified. The nearest neighbor is green, hence x is classified as green. Two of x’s three nearest neighbors are red, hence x is classified as red.

k Nearest Neighbor K Nearest Neighbor September 2008 29 / 37

slide-43
SLIDE 43

K Nearest Neighbor: Distance Measures

Distance Measures in Instance Space Some classification and almost all clustering methods require a distance measure d(i1, i2) between any pair a1 = (a1,1, . . . , a1,k), a2 = (a2,1, . . . , a2,k) of instances. Common distance measures are: (I) for instances with continuous attributes A1, . . . , Ak: d2(a1, a2) = qPk

j=1(a1,j − a2,j)2

Euclidean or L2 distance d1(a1, a2) = Pk

j=1 |a1,j − a2,j|

Manhatten or L1 distance d∞(a1, a2) = max{|a1,j − a2,j| | j = 1, . . . , k} L∞ distance (II) for instances with binary attributes A1, . . . , Ak : d(a1, a2) = |{j | a1,j = a2,j}| Hamming or edit distance

k Nearest Neighbor K Nearest Neighbor September 2008 30 / 37

slide-44
SLIDE 44

K Nearest Neighbor: Distance Measures

(II) for instances with discrete attributes A1, . . . , Ak: d(a1, a2) =

k

X

j=1

dj(a1,j, a2,j) where dj is a separately defined distance function for attribute Aj, e.g States(Aj) low medium high low 1 2 medium 1 1 high 2 1 States(Aj) red blue green red 1 1 blue 1 1 green 1 1 If all attributes have 0-1 distance (right matrix), then this is the same as edit distance.

k Nearest Neighbor K Nearest Neighbor September 2008 31 / 37

slide-45
SLIDE 45

K Nearest Neighbor: Distance Measures

Normalization Continuous attributes: using Euclidean distance on continuous attributes may cause one attribute to dominate the distance measure. E.g.: Ak = heightininches Al = incomein$ Methods for providing a “common scale” for all attributes: Min-Max Normalization replace Ai with Ai − min(Ai) max(Ai) − min(Ai) (min(Ai), max(Ai) are min/max values of Ai appearing in the data)

0.2 0.4 0.6 0.8 1

  • 20

20 40 60 80 100 120 normalized values

  • riginal values

A1 A2

k Nearest Neighbor K Nearest Neighbor September 2008 32 / 37

slide-46
SLIDE 46

K Nearest Neighbor: Distance Measures

Z-score Standardization replace Ai with Ai − mean(Ai) standarddeviation(Ai)

  • 4
  • 3
  • 2
  • 1

1 2 3

  • 20

20 40 60 80 100 120 standardized values

  • riginal values

A1 A2

where mean(Ai) = 1

n

Pn

j=1 aj,i

standarddeviation(Ai) = q

1 n−1

Pn

j=1(aj,i − mean(Ai))2

k Nearest Neighbor K Nearest Neighbor September 2008 33 / 37

slide-47
SLIDE 47

K Nearest Neighbor Classifier

Model=(Training) Data Required: distance function on instances. Model = labeled training data (a1, c1), . . . , (aN, cN). Classify new instance anew as follows:

  • Let (aj1, cj1), . . . , (ajK , cjK ) be the K training instances whose attributes

are closest to anew.

  • Define C(anew) as the class label that occurs most frequently among

cj1, . . . , cjK .

k Nearest Neighbor K Nearest Neighbor September 2008 34 / 37

slide-48
SLIDE 48

K Nearest Neighbor Classifier

Dependence on K Decision regions (approximately) for 1-nearest neighbor (left) and 5-nearest neighbor (right). possibility of overfitting for small values K. Cross-validation can be used to find a suitable value for k.

k Nearest Neighbor K Nearest Neighbor September 2008 35 / 37

slide-49
SLIDE 49

K Nearest Neighbor Classifier

Weighted voting We can give a higher weight to neighbors close to x than to neighbors far away. Calculate a weight for label c: v(c) =

k

X

i=1:ci=c

1 d(x, ai) and label x with the class having the highest weight.

k Nearest Neighbor K Nearest Neighbor September 2008 36 / 37

slide-50
SLIDE 50

K Nearest Neighbor Classifier

Pros and Cons + Can represent complex decision boundaries + Trivial to “learn”

  • High memory requirement (but can sometimes just use subset of data)
  • Classification time increases in size of training data
  • Does not explain the data
  • Dependence on appropriate distance function

k Nearest Neighbor K Nearest Neighbor September 2008 37 / 37