1
Classification
Basic concepts Decision tree Naïve Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief Networks
Classification Basic concepts Decision tree Nave Bayesian - - PowerPoint PPT Presentation
Classification Basic concepts Decision tree Nave Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief
1
Basic concepts Decision tree Naïve Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief Networks
Decision Tree Linear Functions
T
Nonlinear Functions
g(x) is a linear function:
T
x1 x2 wT x + b < 0 wT x + b > 0
A hyper-plane in the
Unit normal vector of the
n
How would you classify
denotes +1 denotes -1
x1 x2
Infinite number of answers!
How would you classify
denotes +1 denotes -1
x1 x2
Infinite number of answers!
How would you classify
denotes +1 denotes -1
x1 x2
Infinite number of
x1 x2
How would you classify
denotes +1 denotes -1
Infinite number of answers! Which one is the best?
“safe zone”
The linear discriminant
Margin is defined as the
Why it is the best?
Robust to outliners and thus
strong generalization ability
Margin x1 x2
denotes +1 denotes -1
Given a set of data points:
With a scale transformation
x1 x2
denotes +1 denotes -1
T i i T i i
i i
T i i T i i
For the boundary points
The margin width is:
x1 x2
denotes +1 denotes -1
T T
Margin x+ x+ x-
n Support Vectors
Formulation:
x1 x2
denotes +1 denotes -1
Margin x+ x+ x- n
T i i T i i
Formulation:
x1 x2
denotes +1 denotes -1
Margin x+ x+ x- n
2
T i i T i i
Formulation:
x1 x2
denotes +1 denotes -1
Margin x+ x+ x- n
T i i
2
The solution has the form:
1 SV n i i i i i i i i
T i i i
x1 x2 x+ x+ x- Support Vectors
SV
T T i i i
The linear discriminant function is: Notice it relies on a dot product between the test point x
Also keep in mind that solving the optimization problem
Txj between all pairs
Decision Tree Linear Functions
T
Nonlinear Functions
General idea: the original input space can be mapped to
Φ: x → φ(x)
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
With this mapping, our discriminant function is now:
SV
T T i i i
No need to know this mapping explicitly, because we only use
A kernel function is defined as a function that corresponds to
T i j i j
Linear kernel:
2 2
i j i j
T i j i j
T p i j i j
Examples of commonly-used kernel functions:
Polynomial kernel: Gaussian (Radial-Basis Function (RBF) ) kernel:
1. Large Margin Classifier
Better generalization ability & less over-fitting
2. The Kernel Trick
Map data points to higher dimensional space in
Since only dot product is used, we do not need
27
Basic concepts Decision tree Naïve Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief Networks
Given input/output samples (X, y), we learn a
Classification: y is discrete (class labels). Regression: y is continuous, e.g. linear
CS 478 - Regression 29
y dependent variable (output) x – independent variable (input)
30
31
‘Best Fit’ can be defined by a cost function -
Least Squares Minimizes the Sum of the Squared
2 2 1 1 1 n n i i i i i
33
(We assume x1 is 1)
34
35
Closed form
set partial derivatives equal to zero and solve
Gradient descent (GD) Stochastic gradient descent (SGD)
36
Least Squares: Find that minimize the
2 2 1 1
i i i
2 2 1 1 1 n n i i i i i
37
Least Squares: Find that minimize the
2 2 1 1 1 1 1 1
i i i i i i i i i
1 1 1
ˆ
i i i i i i i i xy xx
x x x x y y x x x x x x y y SS SS
38
Closed form
set partial derivatives equal to zero and solve
Gradient descent (GD)
Start with initial values, gradually move to
Stochastic gradient descent (SGD)
40
41
42
43
44
45
So far we only used the observed values x1,x2,… However, linear regression can be applied in the same
Eg: to add a new variable x1
2 and x1x2 so each example becomes:
x1, x2, …. x1
2, x1x2
As long as these functions can be directly computed from
2 2 1 1 k kx
What type of functions can we use?
A few common examples:
2
Using our new notations for the basis function linear
Where j(x) can be either xj for multivariate regression or
… and 0(x)=1 for the intercept term
j= 0 n
n=10
j
i
2
2
none exp(18) huge
60
Coefficients suggest importance/correlation with
A large positive coefficient implies that the
A large negative coefficient implies that the
A small or 0 coefficient suggests that the input
Linear regression can be used to find best
CS 478 - Regression 61
Given input/output samples (X, y), we learn a
Classification: y is discrete (class labels). Regression: y is continuous, e.g. linear
CS 478 - Regression 62
x – independent variable (input) y dependent variable (output) x – independent variable (input) 1
February 27, 2018 Data Mining: Concepts and Techniques 63
February 27, 2018 Data Mining: Concepts and Techniques 64
February 27, 2018 Data Mining: Concepts and Techniques 65
Non-differentiable
66
the inputs.
Decision Tree Linear Functions Nonlinear Functions
68
the inputs.
69
70
February 27, 2018 Data Mining: Concepts and Techniques 71
February 27, 2018 Data Mining: Concepts and Techniques 72
Partial derivative with one training example (x,y): Stochastic Gradient Descent Update: Gradient Descent Update:
February 27, 2018 Data Mining: Concepts and Techniques 73
Regression methods
Linear regression Logistic regression
Optimization
Gradient descent Stochastic gradient descent
Regularization
74
Basic concepts Decision tree Naïve Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief Networks
February 27, 2018 Data Mining: Concepts and Techniques 75
February 27, 2018 Data Mining: Concepts and Techniques 76
Image recognition Speech recognition Natural language
February 27, 2018 Data Mining: Concepts and Techniques 77
Decision Tree Linear Functions Nonlinear Functions
80
Output layer Input layer Hidden layer Output vector Input vector: X
A neural network is a
Learning by adjusting
Deep learning use more
February 27, 2018 Data Mining: Concepts and Techniques 81
Perceptron
82
An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping
The inputs to unit are outputs from the previous layer. They are multiplied by their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it.
) sign( y Example For
n i k i ix
w m
bias
The input layer correspond to the attributes measured for each training tuple
They are then weighted and fed simultaneously to hidden layers
The weighted outputs of the last hidden layer are input to output layer, which emits the network's prediction
From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function
February 27, 2018 Data Mining: Concepts and Techniques 84
A family of parametric, non-linear, and hierarchical
If a small change in a weight (or bias) cause only a small
Does the perceptron work?
87
Perceptron sigmoid neuron
tanh x = ex − e−x ex + e−x 𝑔 𝑦 = max(0, 𝑦)
February 27, 2018 89
91
Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”
Steps
Initialize weights to small random numbers, associated with biases Propagate the inputs forward (by applying activation function) Backpropagate the error (by updating weights and biases) Terminating condition (when error is very small, etc.)
Gradient Descent (batch GD)
The cost gradient is based on the complete training set,
Stochastic Gradient Descent (SGD, iterative or online-GD)
Update the weight after each training sample The gradient based on a single training sample is a
Converges faster but the path towards minimum may
Mini-Batch Gradient Descent (MB-GD)
Update the weights based on small group of training
Shallow networks Deep networks with multiple layers - deep
detect lines in Specific positions
Higher level detetors ( horizontal line, “RHS vertical lune” “upper loop”, etc…
Deep learning (a.k.a. representation learning) seeks
Low-level features
Mid-level features High-level features Trainable classifier
Feature visualization of convolutional net trained on ImageNet (Zeiler and Fergus, 2013)
Commonly used architectures
convolutional neural networks recurrent neural networks
Input can have very high dimensions
Using a fully-connected neural network would need a large
amount of parameters.
CNNs are a special type of neural network using shared weights and local connection
The number of parameters needed by CNNs is much smaller.
Example: 200x200 image a) fully connected: 40,000 hidden units => 1.6 billion parameters b) CNN: 5x5 filter, 100 filters => 2,500 parameters
112 Each sub-region yields a feature map, representing its feature. Images are segmented into sub-regions. Feature maps are trained with neurons. Feature maps of a larger region are combined.
Shared weights
The convolutional layer consists of a set of filters. Each filter covers a spatially small portion of the input
Each filter is convolved across the dimensions of the input
113
118
pooling pooling
Standard Neural Networks (also Convolutional Networks) :
Assume input examples as vectors of fixed length (e.g.,
These models use a fixed amount of computational
Recurrent Neural Networks
Model data with temporal or sequential structures Varying length of input and outputs
119
Recurrent Neural Networks are networks with
120
At time t, given some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next.
121
Weakness
Long training time Require a large number of training data Poor interpretability: Difficult to interpret the symbolic meaning
behind the learned weights and of “hidden units” in the network
Strength
Successful on an array of real-world data, e.g., hand-written letters High tolerance to noisy data Well-suited for continuous-valued inputs and outputs Algorithms are inherently parallel
https://www.microway.com/hpc-tech-tips/deep-learning-frameworks- survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack/ TensorFlow playground http://playground.tensorflow.org/
123
Basic concepts Decision tree Naïve Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors)
124
Lazy vs. eager learning
Lazy learning (e.g., instance-based learning): Simply
Eager learning (the previously discussed methods):
Lazy: less time in training but more time in predicting Accuracy
Lazy method effectively uses a richer hypothesis space
Eager: must commit to a single hypothesis (global
2/27/2018 Data Mining: Concepts and Techniques 126
Majority vote within the k nearest neighbors
K= 1: positive K= 3: negative new
127
Distance-weighted nearest neighbor algorithm
Weight the contribution of each of the k
February 27, 2018 Data Mining: Concepts and Techniques 128
2/27/2018 Data Mining: Concepts and Techniques 129
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
2/27/2018 Data Mining: Concepts and Techniques 130
2/27/2018 Data Mining: Concepts and Techniques 131
Choosing the value of k:
If k is too small, sensitive to noise points If k is too large, neighborhood may include points from
X
as points: distance between points as vectors: cosine between vectors as random variables: correlation as sets: Jaccard distance between sets as strings: Hamming distance
133
February 27, 2018 134
Euclidean distance
Manhattan distance
Minkowski distance
2 2 1 1 p p
) | | ... | | | (| ) , (
2 2 2 2 2 1 1 p p
j x i x j x i x j x i x j i d
np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x
q q p p q q
j x i x j x i x j x i x j i d ) | | ... | | | (| ) , (
2 2 1 1
February 27, 2018 Data Mining: Concepts and Techniques 136
To compute
f is numeric (interval or ratio scale)
Scaling issues -> normalization
f is ordinal
Mapping by rank
f is nominal
Mapping function
Hamming distance (edit distance) for strings
1 1
f if
| |
f f
j x i x | |
f f
j x i x
scaled attributes to fall within a small, specified range
Min-max normalization: [minA, maxA] to [new_minA, new_maxA]
$73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
716 . ) . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73
A A A A A A
A A
225 . 1 000 , 16 000 , 54 600 , 73
Assigning weights to different attributes If wi is inverse variance, it’s a form of
Supervised metric learning: learn the weights wi
Data Mining: Concepts and Techniques 138
2/27/2018 Data Mining: Concepts and Techniques 139
Euclidean distance may not be meaningful
as points: distance between points as vectors: cosine between vectors as random variables: correlation as sets: Jaccard distance between sets as strings: Hamming distance
140
February 27, 2018 Li Xiong 141
Cosine measure From -1 to 1
np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x
|| || || || j X i X j X i X
2/27/2018 Data Mining: Concepts and Techniques 142
Consine similarity
Invariant to multiplicative scaling Variant to additive scaling
February 27, 2018 143
Correlation coefficient (also called Pearson’s
where n is the number of tuples, and are the respective means
B, and Σ(AB) is the sum of the AB dot-product.
rA,B > 0, A and B are positively correlated (A’s values increase as B’s) rA,B = 0: independent rA,B < 0: negatively correlated B A B A
B A
,
Scatter plots showing the Pearson correlation from –1 to 1.
2/27/2018 Data Mining: Concepts and Techniques 145
For transaction data, document data
Shared items are more important to consider
The Jaccard similarity of two sets is the size of
Jaccard distance: d(C1, C2) = 1 -
Mining of Massive Datasets, http://www.mmds.org
3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8
as points: distance between points as vectors: cosine between vectors as random variables: correlation as sets: Jaccard distance between sets as strings: Hamming distance
147
148
Decision tree Naïve Bayesian classifier Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors)
Supervised learning: learning from labeled data Labeled data can be rare or expensive, unlabeled
Semi-supervised learning: Use both labeled and
Self-training Co-training
Active learning: iterative supervised learning
149
Build a classifier using the labeled data Repeatedly Use it to label the unlabeled data,
May reinforce errors
February 28, 2018 Data Mining: Concepts and Techniques 150
My Advisor
My Advisor
x2- Text info x1- Link info x - Link info & Text info
Learn a separate classifier for each view using labeled
Iteratively use each classifier on the unlabeled data to
February 28, 2018 Data Mining: Concepts and Techniques 152
Use a query function to select one or more tuples from
The newly labeled samples are added to labeled data to
Evaluated through learning curves: Accuracy as a function
153
154
Supervised learning: classification and regression
Decision tree Naïve Bayesian classifier Support Vector Machines Linear regression and logistic regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors)
Ensemble methods Model evaluation and selection Semi-supervised learning