Classification Basic concepts Decision tree Nave Bayesian - - PowerPoint PPT Presentation

classification
SMART_READER_LITE
LIVE PREVIEW

Classification Basic concepts Decision tree Nave Bayesian - - PowerPoint PPT Presentation

Classification Basic concepts Decision tree Nave Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief


slide-1
SLIDE 1

1

Classification

 Basic concepts  Decision tree  Naïve Bayesian classifier  Model evaluation  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)  Bayesian Belief Networks

slide-2
SLIDE 2

Classification: Discriminant Function

Decision Tree Linear Functions

( )

T

g b   x w x

Nonlinear Functions

slide-3
SLIDE 3

Linear Discriminant Function

 g(x) is a linear function:

( )

T

g b   x w x

x1 x2 wT x + b < 0 wT x + b > 0

 A hyper-plane in the

feature space

 Unit normal vector of the

hyper-plane:

 w n w

n

slide-4
SLIDE 4

 How would you classify

these points using a linear discriminant function in

  • rder to minimize the error

rate?

Linear Discriminant Function

denotes +1 denotes -1

x1 x2

 Infinite number of answers!

slide-5
SLIDE 5

 How would you classify

these points using a linear discriminant function in

  • rder to minimize the error

rate?

Linear Discriminant Function

denotes +1 denotes -1

x1 x2

 Infinite number of answers!

slide-6
SLIDE 6

 How would you classify

these points using a linear discriminant function in

  • rder to minimize the error

rate?

Linear Discriminant Function

denotes +1 denotes -1

x1 x2

 Infinite number of

answers!

slide-7
SLIDE 7

x1 x2

 How would you classify

these points using a linear discriminant function in

  • rder to minimize the error

rate?

Linear Discriminant Function

denotes +1 denotes -1

 Infinite number of answers!  Which one is the best?

slide-8
SLIDE 8

Large Margin Linear Classifier

“safe zone”

 The linear discriminant

function (classifier) with the maximum margin is the best

 Margin is defined as the

width that the boundary could be increased by before hitting a data point

 Why it is the best?

 Robust to outliners and thus

strong generalization ability

Margin x1 x2

denotes +1 denotes -1

slide-9
SLIDE 9

Large Margin Linear Classifier

 Given a set of data points:

 With a scale transformation

  • n both w and b, the above

is equivalent to

x1 x2

denotes +1 denotes -1

For 1, For 1,

T i i T i i

y b y b         w x w x {( , )}, 1,2, ,

i i

y i n  x

, where

For 1, 1 For 1, 1

T i i T i i

y b y b          w x w x

slide-10
SLIDE 10

Large Margin Linear Classifier

 For the boundary points

 The margin width is:

x1 x2

denotes +1 denotes -1

1 1

T T

b b

 

     w x w x

Margin x+ x+ x-

( ) 2 ( ) M

   

       x x n w x x w w

n Support Vectors

slide-11
SLIDE 11

Large Margin Linear Classifier

 Formulation:

x1 x2

denotes +1 denotes -1

Margin x+ x+ x- n

such that

2 maximize w For 1, 1 For 1, 1

T i i T i i

y b y b          w x w x

slide-12
SLIDE 12

Large Margin Linear Classifier

 Formulation:

x1 x2

denotes +1 denotes -1

Margin x+ x+ x- n

2

1 minimize 2 w

such that

For 1, 1 For 1, 1

T i i T i i

y b y b          w x w x

slide-13
SLIDE 13

Large Margin Linear Classifier

 Formulation:

x1 x2

denotes +1 denotes -1

Margin x+ x+ x- n

( ) 1

T i i

y b   w x

2

1 minimize 2 w

such that

slide-14
SLIDE 14

Solving the Optimization Problem

 The solution has the form:

1 SV n i i i i i i i i

y y  

 

 

 

w x x get from ( ) 1 0, where is support vector

T i i i

b y b    w x x

x1 x2 x+ x+ x- Support Vectors

slide-15
SLIDE 15

Solving the Optimization Problem

SV

( )

T T i i i

g b b 

   

x w x x x

 The linear discriminant function is:  Notice it relies on a dot product between the test point x

and the support vectors xi

 Also keep in mind that solving the optimization problem

involved computing the dot products xi

Txj between all pairs

  • f training points
slide-16
SLIDE 16

Classification: Discriminant Function

Decision Tree Linear Functions

( )

T

g b   x w x

Nonlinear Functions

slide-17
SLIDE 17

Non-linear SVMs: Feature Space

 General idea: the original input space can be mapped to

some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x)

This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt

slide-18
SLIDE 18

Nonlinear SVMs: The Kernel Trick

 With this mapping, our discriminant function is now:

SV

( ) ( ) ( ) ( )

T T i i i

g b b    

   

x w x x x

 No need to know this mapping explicitly, because we only use

the dot product of feature vectors in both the training and test.

 A kernel function is defined as a function that corresponds to

a dot product of two feature vectors in some expanded feature space:

( , ) ( ) ( )

T i j i j

K    x x x x

slide-19
SLIDE 19

Nonlinear SVMs: The Kernel Trick

 Linear kernel:

2 2

( , ) exp( ) 2

i j i j

K     x x x x

( , )

T i j i j

K  x x x x ( , ) (1 )

T p i j i j

K   x x x x

 Examples of commonly-used kernel functions:

 Polynomial kernel:  Gaussian (Radial-Basis Function (RBF) ) kernel:

slide-20
SLIDE 20

Summary: Support Vector Machine

 1. Large Margin Classifier

 Better generalization ability & less over-fitting

 2. The Kernel Trick

 Map data points to higher dimensional space in

  • rder to make them linearly separable.

 Since only dot product is used, we do not need

to represent the mapping explicitly.

slide-21
SLIDE 21

27

Classification

 Basic concepts  Decision tree  Naïve Bayesian classifier  Model evaluation  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)  Bayesian Belief Networks

slide-22
SLIDE 22
slide-23
SLIDE 23

Regression

 Given input/output samples (X, y), we learn a

function f such that y = f(X), which can be used on new data.

 Classification: y is discrete (class labels).  Regression: y is continuous, e.g. linear

regression.

CS 478 - Regression 29

y dependent variable (output) x – independent variable (input)

slide-24
SLIDE 24

30

Linear Regression

  • Linear regression: use a linear function to model

the relationship between a dependent (target) variable y and explanatory variables X.

  • Simple linear regression: one explanatory variable

20 40 60 20 40 60 X Y

slide-25
SLIDE 25

31

Linear Regression

How do you determine which line ‘fits best’?

20 40 60 20 40 60 X Y

Y X    

1

slide-26
SLIDE 26

Least Squares

 ‘Best Fit’ can be defined by a cost function -

Difference Between Actual Y Values & Predicted Y Values (residue)

 Least Squares Minimizes the Sum of the Squared

Differences (errors) (SSE)

Y X    

1

 

2 2 1 1 1 n n i i i i i

y x   

 

  

 

slide-27
SLIDE 27

Linear Regression

33

Data: Inputs are continuous vectors of length K. Outputs are continuous scalars. Prediction: Output is a linear function of the inputs.

(We assume x1 is 1)

Learning: finds the parameters that minimize some

  • bjective function.
slide-28
SLIDE 28

Least Squares

34

Learning: finds the parameters that minimize some

  • bjective function.

We minimize the sum of the squares: Why? Reduces distance between true measurements and predicted hyperplane (line in 1D)

slide-29
SLIDE 29

35

Learning the parameters

 Closed form

 set partial derivatives equal to zero and solve

for parameters)

 Gradient descent (GD)  Stochastic gradient descent (SGD)

slide-30
SLIDE 30

36

Derivation of Parameters (1)

 Least Squares: Find  that minimize the

  • bjective function - squared error (SSE)

x y

1

ˆ ˆ    

   

2 2 1 1

2

i i i

y x ny n n x                   

 

 

2 2 1 1 1 n n i i i i i

y x   

 

  

 

slide-31
SLIDE 31

37

Derivation of Parameters (2)

 Least Squares: Find  that minimize the

  • bjective function - squared error (SSE)

     

2 2 1 1 1 1 1 1

2 2

i i i i i i i i i

y x x y x x y y x x                          

   

         

1 1 1

ˆ

i i i i i i i i xy xx

x x x x y y x x x x x x y y SS SS            

   

slide-32
SLIDE 32

38

Learning the parameters

 Closed form

 set partial derivatives equal to zero and solve

for parameters

 Gradient descent (GD)

 Start with initial values, gradually move to

towards the minimal loss function value based

  • n gradient

 Stochastic gradient descent (SGD)

slide-33
SLIDE 33

Gradient Descent Illustration

slide-34
SLIDE 34

Gradient Descent

40

Gradient of the

  • bjective function (i.e.

vector of partial derivatives):

slide-35
SLIDE 35

Gradient Descent

41

  • Convergence. We could check whether the L2

norm of the gradient is below some small tolerance. Alternatively whethre the reduction in the objective function from one iteration to the next is small.

slide-36
SLIDE 36

Partial Derivatives for Linear Reg.

42

slide-37
SLIDE 37

Partial Derivatives for Linear Reg.

43

slide-38
SLIDE 38

Gradient Descent for Least Sqaures

44

slide-39
SLIDE 39

Stochastic Gradient Descent (SGD)

45

Update parameters based on gradient of random data samples Often preferred over gradient descent because it gets close to the minimum much faster

slide-40
SLIDE 40

Non-Linear basis function

 So far we only used the observed values x1,x2,…  However, linear regression can be applied in the same

way to functions of these values

 Eg: to add a new variable x1

2 and x1x2 so each example becomes:

x1, x2, …. x1

2, x1x2

 As long as these functions can be directly computed from

the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem

     

2 2 1 1 k kx

w x w w y 

slide-41
SLIDE 41

Non-linear basis functions

What type of functions can we use?

A few common examples:

  • Polynomial: j(x) = xj for j=0 … n
  • Gaussian:
  • Sigmoid:
  • Logs:

f j(x) = (x - m j) 2s j

2

f j(x) = 1 1+ exp(-s jx)

fj(x) = log(x+1)

slide-42
SLIDE 42

General linear regression problem

 Using our new notations for the basis function linear

regression can be written as

 Where j(x) can be either xj for multivariate regression or

  • ne of the non-linear basis functions we defined

 … and 0(x)=1 for the intercept term

y = w jf j(x)

j= 0 n

å

slide-43
SLIDE 43

0th Order Polynomial

n=10

slide-44
SLIDE 44

1st Order Polynomial

slide-45
SLIDE 45

3rd Order Polynomial

slide-46
SLIDE 46

9th Order Polynomial

slide-47
SLIDE 47

Over-fitting

Root-Mean-Square (RMS) Error:

slide-48
SLIDE 48

Polynomial Coefficients

slide-49
SLIDE 49

Regularization

Penalize large coefficient values

JX,y(w) = 1 2 yi - wjf j(xi)

j

å

æ è ç ç ö ø ÷ ÷

i

å

2

  • l

2 w

2

slide-50
SLIDE 50

Regularization:

+

slide-51
SLIDE 51

Over Regularization:

slide-52
SLIDE 52

Polynomial Coefficients

none exp(18) huge

slide-53
SLIDE 53

LASSO

  • Adds an L1 regularizer to Linear Regression

60

slide-54
SLIDE 54

Intepretability

 Coefficients suggest importance/correlation with

the output

 A large positive coefficient implies that the

  • utput will increase when this input is

increased (positively correlated)

 A large negative coefficient implies that the

  • utput will decrease when this input is

increased (negatively correlated)

 A small or 0 coefficient suggests that the input

is uncorrelated with the output (at least at the 1st order)

 Linear regression can be used to find best

"indicators"

CS 478 - Regression 61

slide-55
SLIDE 55

Regression for Classification

 Given input/output samples (X, y), we learn a

function f such that y = f(X), which can be used on new data.

 Classification: y is discrete (class labels).  Regression: y is continuous, e.g. linear

regression.

CS 478 - Regression 62

x – independent variable (input) y dependent variable (output) x – independent variable (input) 1

slide-56
SLIDE 56

From real value to discrete value

February 27, 2018 Data Mining: Concepts and Techniques 63

slide-57
SLIDE 57

From real value to discrete value

February 27, 2018 Data Mining: Concepts and Techniques 64

slide-58
SLIDE 58

From real value to discrete value

February 27, 2018 Data Mining: Concepts and Techniques 65

Non-differentiable

slide-59
SLIDE 59

Logistic Regression

66

Data: Inputs are continuous vectors of length K. Outputs are discrete valued. Prediction: Output is a logistic function of the linear function of

the inputs.

slide-60
SLIDE 60

Classification: Discriminant Function

Decision Tree Linear Functions Nonlinear Functions

slide-61
SLIDE 61

Logistic Regression

68

Data: Inputs are continuous vectors of length K. Outputs are discrete valued. Prediction: Output is a logistic function of the linear function of

the inputs.

Learning: finds the parameters that minimize some

  • bjective function.
slide-62
SLIDE 62

Recall: Least Squares for Linear Regression

69

Learning: finds the parameters that minimize some

  • bjective function.

We minimize the sum of the squares: Why? Reduces distance between true measurements and predicted values

slide-63
SLIDE 63

Maximum Likelihood for Logistic Regression

70

Learning: finds the parameters that maximizes the log likelihood of observing the data:

slide-64
SLIDE 64

Review: Derivative Rules

February 27, 2018 Data Mining: Concepts and Techniques 71

slide-65
SLIDE 65

Stochastic Gradient Descent (Ascent)

February 27, 2018 Data Mining: Concepts and Techniques 72

Partial derivative with one training example (x,y): Stochastic Gradient Descent Update: Gradient Descent Update:

slide-66
SLIDE 66

Regression Summary

February 27, 2018 Data Mining: Concepts and Techniques 73

 Regression methods

 Linear regression  Logistic regression

 Optimization

 Gradient descent  Stochastic gradient descent

 Regularization

slide-67
SLIDE 67

74

Classification

 Basic concepts  Decision tree  Naïve Bayesian classifier  Model evaluation  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)  Bayesian Belief Networks

slide-68
SLIDE 68

Deep Learning: MIT Technology Review - 10 Breakthrough Technologies 2013

February 27, 2018 Data Mining: Concepts and Techniques 75

slide-69
SLIDE 69

February 27, 2018 Data Mining: Concepts and Techniques 76

slide-70
SLIDE 70

Applications

 Image recognition  Speech recognition  Natural language

processing

February 27, 2018 Data Mining: Concepts and Techniques 77

slide-71
SLIDE 71

Classification: Discriminant Function

Decision Tree Linear Functions Nonlinear Functions

slide-72
SLIDE 72

80

Neural Network and Deep Learning

Output layer Input layer Hidden layer Output vector Input vector: X

wij

 A neural network is a

multi-layer structure of connected input/output units (artificial neurons)

 Learning by adjusting

the weights so as to predict the correct class label of the input tuples

 Deep learning use more

layers than shallow learning

slide-73
SLIDE 73

Artificial Neuron – Perceptron

February 27, 2018 Data Mining: Concepts and Techniques 81

Perceptron

slide-74
SLIDE 74

82

Neuron: A Hidden/Output Layer Unit

An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it.

mk

f

weighted sum Input vector x

  • utput y

Activation function weight vector w

w0 w1 wn x0 x1 xn

) sign( y Example For

n i k i ix

w m  

bias

slide-75
SLIDE 75

From Neuron to Neural Network

The input layer correspond to the attributes measured for each training tuple

They are then weighted and fed simultaneously to hidden layers

The weighted outputs of the last hidden layer are input to output layer, which emits the network's prediction

From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function

slide-76
SLIDE 76

Neural Networks

February 27, 2018 Data Mining: Concepts and Techniques 84

 A family of parametric, non-linear, and hierarchical

representation learning functions

slide-77
SLIDE 77

Learning Weights (and Bias) in the Network

 If a small change in a weight (or bias) cause only a small

change in output, we could modify weights and biases gradually to train the network

 Does the perceptron work?

slide-78
SLIDE 78

Artificial Neuron – Perceptron to sigmoid neuron

87

Perceptron sigmoid neuron

slide-79
SLIDE 79

Popular Activation Functions

Tanh(x) ReLU (Rectified Linear Unit)

tanh x = ex − e−x ex + e−x 𝑔 𝑦 = max(0, 𝑦)

slide-80
SLIDE 80

February 27, 2018 89

slide-81
SLIDE 81

91

Learning by Backpropagation

Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”

Steps

 Initialize weights to small random numbers, associated with biases  Propagate the inputs forward (by applying activation function)  Backpropagate the error (by updating weights and biases)  Terminating condition (when error is very small, etc.)

slide-82
SLIDE 82

Stochastic Gradient Descent

 Gradient Descent (batch GD)

 The cost gradient is based on the complete training set,

can be costly and longer to converge to minimum

 Stochastic Gradient Descent (SGD, iterative or online-GD)

 Update the weight after each training sample  The gradient based on a single training sample is a

stochastic approximation of the true cost gradient

 Converges faster but the path towards minimum may

zig-zag

 Mini-Batch Gradient Descent (MB-GD)

 Update the weights based on small group of training

samples

slide-83
SLIDE 83

Training the neural network Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

slide-84
SLIDE 84

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Initialise with random weights

slide-85
SLIDE 85

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Feed it through to get output 1.4 2.7 0.8 1.9

slide-86
SLIDE 86

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Compare with target output 1.4 2.7 0.8 1.9 error 0.8

slide-87
SLIDE 87

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … Adjust weights based on error 1.4 2.7 0.8 1.9 error 0.8

slide-88
SLIDE 88

Training data Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc … And so on …. 6.4 2.8 0.9

1

1.7 error -0.1 Repeat this thousands, maybe millions of times – each time taking a random training instance, and making slight weight adjustments (stochastic gradient descent)

slide-89
SLIDE 89

Learning for neural networks

 Shallow networks  Deep networks with multiple layers - deep

learning

slide-90
SLIDE 90

Feature detectors

slide-91
SLIDE 91

Hidden layer units become self-organised feature detectors

… 1 63 1 5 10 15 20 25 … Strong weight low/zero weight

slide-92
SLIDE 92

What does this unit detect?

… 1 63 1 5 10 15 20 25 … strong weight low/zero weight it will send strong signal for a horizontal line in the top row, ignoring everywhere else

slide-93
SLIDE 93

What features might you expect a good NN to learn, when trained with data like this?

slide-94
SLIDE 94

63 1

vertical lines

slide-95
SLIDE 95

63 1

Horizontal lines

slide-96
SLIDE 96

63 1

Small circles

slide-97
SLIDE 97

63 1

Small circles

slide-98
SLIDE 98

Hierarchical Feature Learning

etc …

detect lines in Specific positions

v

Higher level detetors ( horizontal line, “RHS vertical lune” “upper loop”, etc…

etc …

slide-99
SLIDE 99

Hierarchical Feature Learning

 Deep learning (a.k.a. representation learning) seeks

to learn rich hierarchical representations (i.e. features) automatically through multiple stage of feature learning process.

Low-level features

  • utput

Mid-level features High-level features Trainable classifier

Feature visualization of convolutional net trained on ImageNet (Zeiler and Fergus, 2013)

slide-100
SLIDE 100

Deep Learning Architectures

 Commonly used architectures

 convolutional neural networks  recurrent neural networks

slide-101
SLIDE 101

Convolutional Neural Network

Input can have very high dimensions

 Using a fully-connected neural network would need a large

amount of parameters.

CNNs are a special type of neural network using shared weights and local connection

 The number of parameters needed by CNNs is much smaller.

Example: 200x200 image a) fully connected: 40,000 hidden units => 1.6 billion parameters b) CNN: 5x5 filter, 100 filters => 2,500 parameters

slide-102
SLIDE 102

Building-blocks for CNN’s

112 Each sub-region yields a feature map, representing its feature. Images are segmented into sub-regions. Feature maps are trained with neurons. Feature maps of a larger region are combined.

Shared weights

slide-103
SLIDE 103

CNN Architecture: Convolutional Layer

 The convolutional layer consists of a set of filters.  Each filter covers a spatially small portion of the input

data.

 Each filter is convolved across the dimensions of the input

data (dot product), producing a multidimensional feature map.

113

slide-104
SLIDE 104

Full CNN

118

pooling pooling

slide-105
SLIDE 105

Recurrent Neural Networks

 Standard Neural Networks (also Convolutional Networks) :

 Assume input examples as vectors of fixed length (e.g.,

an image) and produce a fixed-size vector as output (e.g., probabilities of different classes).

 These models use a fixed amount of computational

steps (e.g. the number of layers in the model).

 Recurrent Neural Networks

 Model data with temporal or sequential structures  Varying length of input and outputs

119

slide-106
SLIDE 106

Recurrent Neural Networks

 Recurrent Neural Networks are networks with

loops in them, allowing information to persist.

120

At time t, given some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.

slide-107
SLIDE 107

121

Neural Network as a Classifier

Weakness

 Long training time  Require a large number of training data  Poor interpretability: Difficult to interpret the symbolic meaning

behind the learned weights and of “hidden units” in the network

Strength

 Successful on an array of real-world data, e.g., hand-written letters  High tolerance to noisy data  Well-suited for continuous-valued inputs and outputs  Algorithms are inherently parallel

slide-108
SLIDE 108

Deep learning frameworks

https://www.microway.com/hpc-tech-tips/deep-learning-frameworks- survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack/ TensorFlow playground http://playground.tensorflow.org/

slide-109
SLIDE 109

123

Classification

 Basic concepts  Decision tree  Naïve Bayesian classifier  Model evaluation  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)

slide-110
SLIDE 110

124

Lazy vs. Eager Learning

 Lazy vs. eager learning

 Lazy learning (e.g., instance-based learning): Simply

stores training data (or only minor processing) and waits until it is given a test tuple

 Eager learning (the previously discussed methods):

Given a set of training tuples, constructs a classification model before receiving new (e.g., test) data to classify

 Lazy: less time in training but more time in predicting  Accuracy

 Lazy method effectively uses a richer hypothesis space

since it uses many local linear functions

 Eager: must commit to a single hypothesis (global

function) that covers the entire instance space

slide-111
SLIDE 111

2/27/2018 Data Mining: Concepts and Techniques 126

K-nearest neighbor method

 Majority vote within the k nearest neighbors

K= 1: positive K= 3: negative new

slide-112
SLIDE 112

127

K-nearest neighbor method

 Distance-weighted nearest neighbor algorithm

 Weight the contribution of each of the k

neighbors according to their distance to the query xq

2 ) , ( 1 i x q x d w

slide-113
SLIDE 113

February 27, 2018 Data Mining: Concepts and Techniques 128

slide-114
SLIDE 114

2/27/2018 Data Mining: Concepts and Techniques 129

Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

slide-115
SLIDE 115

2/27/2018 Data Mining: Concepts and Techniques 130

1 nearest-neighbor

Voronoi Diagram

slide-116
SLIDE 116

2/27/2018 Data Mining: Concepts and Techniques 131

Nearest Neighbor Classification…

 Choosing the value of k:

 If k is too small, sensitive to noise points  If k is too large, neighborhood may include points from

  • ther classes

X

slide-117
SLIDE 117

Similarity/distance between data objects Data objects

 as points: distance between points  as vectors: cosine between vectors  as random variables: correlation  as sets: Jaccard distance between sets  as strings: Hamming distance

133

slide-118
SLIDE 118

February 27, 2018 134

Distance between two data points

Euclidean distance

Manhattan distance

Minkowski distance

| | ... | | | | ) , (

2 2 1 1 p p

j x i x j x i x j x i x j i d       

) | | ... | | | (| ) , (

2 2 2 2 2 1 1 p p

j x i x j x i x j x i x j i d       

                  np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x

q q p p q q

j x i x j x i x j x i x j i d ) | | ... | | | (| ) , (

2 2 1 1

      

slide-119
SLIDE 119

February 27, 2018 Data Mining: Concepts and Techniques 136

Distance between two attributes values

 To compute

 f is numeric (interval or ratio scale)

 Scaling issues -> normalization

 f is ordinal

 Mapping by rank

 f is nominal

 Mapping function

= 0 if xif = xjf , or 1 otherwise

 Hamming distance (edit distance) for strings

1 1   

f if

M r zif

| |

f f

j x i x  | |

f f

j x i x 

slide-120
SLIDE 120

Normalization of attributes

scaled attributes to fall within a small, specified range

Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

  • Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then

$73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

  • Ex. Let μ = 54,000, σ = 16,000. Then

716 . ) . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73     

A A A A A A

min new min new max new min max min v v _ ) _ _ ( '     

A A

v v  m   '

225 . 1 000 , 16 000 , 54 600 , 73  

slide-121
SLIDE 121

Weighted distance

 Assigning weights to different attributes  If wi is inverse variance, it’s a form of

Mahalanobis distance

 Supervised metric learning: learn the weights wi

using labeled data

Data Mining: Concepts and Techniques 138

slide-122
SLIDE 122

2/27/2018 Data Mining: Concepts and Techniques 139

Euclidean distance

 Euclidean distance may not be meaningful

(counter intuitive) for high dimensional data, e.g. user movie ratings

3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 0 3 0 3 0 3 0 3 0 3 0 3 1 1 1 1 1 1 1 1 1 1 1 1 vs

slide-123
SLIDE 123

Similarity/distance between data objects Data objects

 as points: distance between points  as vectors: cosine between vectors  as random variables: correlation  as sets: Jaccard distance between sets  as strings: Hamming distance

140

slide-124
SLIDE 124

February 27, 2018 Li Xiong 141

Cosine similarity between two vectors

 Cosine measure  From -1 to 1

                  np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x

|| || || || j X i X j X i X  

slide-125
SLIDE 125

2/27/2018 Data Mining: Concepts and Techniques 142

Cosine similarity

 Consine similarity

 Invariant to multiplicative scaling  Variant to additive scaling

3 3 3 3 3 3 6 6 6 6 6 6 1 1 1 1 1 1 4 4 4 4 4 4 2 2 2 2 2 2 8 8 8 8 8 8 1 1 1 1 1 1 4 4 4 4 4 4 vs

slide-126
SLIDE 126

February 27, 2018 143

Correlation between two random variables (numerical data)

 Correlation coefficient (also called Pearson’s

product moment coefficient)

where n is the number of tuples, and are the respective means

  • f A and B, σA and σB are the respective standard deviation of A and

B, and Σ(AB) is the sum of the AB dot-product.

 rA,B > 0, A and B are positively correlated (A’s values increase as B’s)  rA,B = 0: independent  rA,B < 0: negatively correlated B A B A

n B A n AB n B B A A r

B A

    ) 1 ( ) ( ) 1 ( ) )( (

,

      

 

A

B

slide-127
SLIDE 127

Visualization of Correlation

Scatter plots showing the Pearson correlation from –1 to 1.

slide-128
SLIDE 128

2/27/2018 Data Mining: Concepts and Techniques 145

Data object at a set

 For transaction data, document data

 Shared items are more important to consider

0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 vs

slide-129
SLIDE 129

Jaccard distance between two sets

 The Jaccard similarity of two sets is the size of

their intersection divided by the size of their union: sim(C1, C2) = |C1C2|/|C1C2|

 Jaccard distance: d(C1, C2) = 1 -

|C1C2|/|C1C2|

  • J. Leskovec, A. Rajaraman, J. Ullman:

Mining of Massive Datasets, http://www.mmds.org

3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8

slide-130
SLIDE 130

Similarity/distance between data objects Data objects

 as points: distance between points  as vectors: cosine between vectors  as random variables: correlation  as sets: Jaccard distance between sets  as strings: Hamming distance

147

slide-131
SLIDE 131

148

Supervised Learning

 Decision tree  Naïve Bayesian classifier  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)

slide-132
SLIDE 132

Semi-Supervised Classification

 Supervised learning: learning from labeled data  Labeled data can be rare or expensive, unlabeled

data are much easier to get

 Semi-supervised learning: Use both labeled and

unlabeled data

 Self-training  Co-training

 Active learning: iterative supervised learning

149

slide-133
SLIDE 133

Self-training

 Build a classifier using the labeled data  Repeatedly Use it to label the unlabeled data,

and those with the most confident label prediction are added to the set of labeled data

 May reinforce errors

February 28, 2018 Data Mining: Concepts and Techniques 150

slide-134
SLIDE 134

Co-training

[Blum&Mitchell’98] Many problems have two different sources of info you can use to determine label. E.g., classifying webpages: can use words on page or words on links pointing to the page.

My Advisor

  • Prof. Avrim Blum

My Advisor

  • Prof. Avrim Blum

x2- Text info x1- Link info x - Link info & Text info

slide-135
SLIDE 135

Co-training

 Learn a separate classifier for each view using labeled

data

 Iteratively use each classifier on the unlabeled data to

construct additional labeled training data for the other classifier.

February 28, 2018 Data Mining: Concepts and Techniques 152

slide-136
SLIDE 136

Active Learning

 Use a query function to select one or more tuples from

Unlabeled data and request labels from an oracle (a human annotator)

 The newly labeled samples are added to labeled data to

train the model

 Evaluated through learning curves: Accuracy as a function

  • f the number of instances queried

153

slide-137
SLIDE 137

154

Summary

 Supervised learning: classification and regression

 Decision tree  Naïve Bayesian classifier  Support Vector Machines  Linear regression and logistic regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)

 Ensemble methods  Model evaluation and selection  Semi-supervised learning