Lecture 23: Final Exam Review Dr. Chengjiang Long Computer Vision - - PowerPoint PPT Presentation

lecture 23 final exam review
SMART_READER_LITE
LIVE PREVIEW

Lecture 23: Final Exam Review Dr. Chengjiang Long Computer Vision - - PowerPoint PPT Presentation

Lecture 23: Final Exam Review Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu Final Project Presentation Agenda on May 1st No. Start time Duration Project name Authors 1


slide-1
SLIDE 1

Lecture 23: Final Exam Review

  • Dr. Chengjiang Long

Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: longc3@rpi.edu

slide-2
SLIDE 2
  • C. Long

Lecture 23 May 6, 2018 2

Final Project Presentation Agenda on May 1st

No. Start time Duration Project name Authors

1 1:30pm 00:10:00 Handwritten digits recognition Kimberly Oakes 2 1:40pm 00:10:00 Character Recognition Xiangyang Mou, Tong Jian 3 1:50pm 00:10:00 Handdrawn recognition Deniz Koyuncu 4 2:00pm 00:10:00 Human Face Recognition Chao-Ting Hsueh, Huaiyuan Chu, Yilin Zhu 5 2:10pm 00:10:00 Head Pose Estimation Lisa Chen 6 2:20pm 00:10:00 Facial expressions expression Cameron Mine 7 2:30pm 00:10:00 Kickstarter: succeed or fail? Jeffrey Chen and Steven Sperazza 8 2:40pm 00:10:00 Tragedy of Titanic: a person on board can survive or not. Ziyi Wang, Dewei Hu 9 2:50pm 00:10:00 Classifying groceries by image using CNN Rui Li, Yan Wang 10 3:00pm 00:10:00 Neural Style Transfer for Video Sarthak Chatterjee and Ashraful Islam 11 3:10pm 00:10:00 Feature selection Zijun Cui

slide-3
SLIDE 3
  • C. Long

Lecture 23 May 6, 2018 3

Guideline for the final project presentation

  • Briefly review the importance, the problem you solved and the objective

in your project - 1 or 3 slides (can used some of your proposal slides)

  • The details of your solutions you used to solve the project - at least 2

slides.

  • Experiment part - at least 3 slides.
  • the detailed information data set.
  • data split.
  • details about feature extraction.
  • hyper-parameter selection.
  • showing comparison results.
  • discuss based on your experimental observations.
  • Conclusion and future work - 1 slide.
  • Share the mistakes you encountered and the lessons you learn during

completing the final project to others - 1 slide.

  • List the references. - 1 slide

8-10 min presentation, including Q&A. I would like to recommend you to use informative figures as possible as you can to share what you are going to do with the other classmates.

slide-4
SLIDE 4
  • C. Long

Lecture 23 May 6, 2018 4

Guideline for the final project report (May 2)

  • Title
  • Abstract
  • Introduction (can include related work)
  • Related work (if not included in introduction section, list it as a seperate

section)

  • Techniques
  • Experimens (describe the detailed information data set, data split, details

about feature extraction and hyper-parameter selection. Show comparison results. and discuss based on your experimental observations).

  • Conclusion and Future work.
  • References.
slide-5
SLIDE 5
  • C. Long

Lecture 23 May 6, 2018 5

Pattern recognition design cycle

Collect data Select features Classifier model Train classifier Evaluate classifier Regression model Clustering model Train regressor Evaluate regressor Evaluate clustering Train mixture model

slide-6
SLIDE 6
  • C. Long

Lecture 23 May 6, 2018 6

Pattern recognition design cycle

Collect data Select features Classifier model Train classifier Evaluate classifier Ensemble classification models Hidden Markov Model Multiple Layer Neural Networks Convolutional Neural Networks

slide-7
SLIDE 7
  • C. Long

Lecture 23 May 6, 2018 7

The Bagging Algorithm

  • B bootstrap samples
  • From which we derive:
  • The aggregate classifier becomes:
slide-8
SLIDE 8
  • C. Long

Lecture 23 May 6, 2018 8

Example (1)

slide-9
SLIDE 9
  • C. Long

Lecture 23 May 6, 2018 9

Example (2)

Testing set

slide-10
SLIDE 10
  • C. Long

Lecture 23 May 6, 2018 10

Random Forest

slide-11
SLIDE 11
  • C. Long

Lecture 23 May 6, 2018 11

Training and Information Gain

slide-12
SLIDE 12
  • C. Long

Lecture 23 May 6, 2018 12

Classification Forest: Ensemble Model

slide-13
SLIDE 13
  • C. Long

Lecture 23 May 6, 2018 13

AdaBoost

  • Constructing Dt

where Zt is a normalization constant

slide-14
SLIDE 14
  • C. Long

Lecture 23 May 6, 2018 14

The AdaBoost Algorithm

slide-15
SLIDE 15
  • C. Long

Lecture 23 May 6, 2018 15

The AdaBoost Algorithm

slide-16
SLIDE 16
  • C. Long

Lecture 23 May 6, 2018 16

Analyzing training error

  • What αt to choose for hypothesis ht?
  • If each weak learner ht is slightly better than random

guessing (εt < 0.5), then training error of AdaBoost decays exponentially fast in number of rounds T.

slide-17
SLIDE 17
  • C. Long

Lecture 23 May 6, 2018 17

What αt to choose for hypothesis ht?

  • For boolean target function, this is accomplished by

[Freund & Schapire ’97]:

  • We can tighten this bound greedily, by choosing αt and

ht on each iteration to minimize Zt.

slide-18
SLIDE 18
  • C. Long

Lecture 23 May 6, 2018 18

What αt to choose for hypothesis ht?

  • For boolean target function, this is accomplished by

[Freund & Schapire ’97]:

  • We can tighten this bound greedily, by choosing αt and

ht on each iteration to minimize Zt.

slide-19
SLIDE 19
  • C. Long

Lecture 23 May 6, 2018 19

Dumb classifiers made Smart

  • Training error of final classifier is bounded by:
slide-20
SLIDE 20
  • C. Long

Lecture 23 May 6, 2018 20

Hidden Markov Models

  • Parameters – stationary/homogeneous markov model

(independent of time t)

slide-21
SLIDE 21
  • C. Long

Lecture 23 May 6, 2018 21

Three main problems in HMMs

slide-22
SLIDE 22
  • C. Long

Lecture 23 May 6, 2018 22

HMM Algorithms

  • Evaluation

– What is the probability of the observed sequence? Forward Algorithm

  • Decoding

– What is the probability that the third roll was loaded given the observed sequence? Forward-Backward Algorithm – What is the most likely die sequence given the observed sequence? Viterbi Algorithm

  • Learning

– Under what parameterization is the observed sequence most probable? Baum-Welch Algorithm (EM)

slide-23
SLIDE 23
  • C. Long

Lecture 23 May 6, 2018 23

Notations

slide-24
SLIDE 24
  • C. Long

Lecture 23 May 6, 2018 24

Forward Probability

slide-25
SLIDE 25
  • C. Long

Lecture 23 May 6, 2018 25

Forward Algorithm

slide-26
SLIDE 26
  • C. Long

Lecture 23 May 6, 2018 26

Forward Algorithm Example

slide-27
SLIDE 27
  • C. Long

Lecture 23 May 6, 2018 27

Forward Algorithm Example

slide-28
SLIDE 28
  • C. Long

Lecture 23 May 6, 2018 28

Forward Algorithm Example

slide-29
SLIDE 29
  • C. Long

Lecture 23 May 6, 2018 29

Forward Algorithm Example

slide-30
SLIDE 30
  • C. Long

Lecture 23 May 6, 2018 30

Forward Algorithm Example

slide-31
SLIDE 31
  • C. Long

Lecture 23 May 6, 2018 31

Viterbi Algorithm

slide-32
SLIDE 32
  • C. Long

Lecture 23 May 6, 2018 32

Viterbi Algorithm

slide-33
SLIDE 33
  • C. Long

Lecture 23 May 6, 2018 33

Viterbi Algorithm Example

slide-34
SLIDE 34
  • C. Long

Lecture 23 May 6, 2018 34

Viterbi Algorithm Example

slide-35
SLIDE 35
  • C. Long

Lecture 23 May 6, 2018 35

Viterbi Algorithm Example

slide-36
SLIDE 36
  • C. Long

Lecture 23 May 6, 2018 36

Viterbi Algorithm Example

slide-37
SLIDE 37
  • C. Long

Lecture 23 May 6, 2018 37

Viterbi Algorithm Example

slide-38
SLIDE 38
  • C. Long

Lecture 23 May 6, 2018 38

Viterbi Algorithm Example

slide-39
SLIDE 39
  • C. Long

Lecture 23 May 6, 2018 39

Viterbi Algorithm Example

slide-40
SLIDE 40
  • C. Long

Lecture 23 May 6, 2018 40

Viterbi Algorithm Example

slide-41
SLIDE 41
  • C. Long

Lecture 23 May 6, 2018 41

Viterbi Algorithm Example

slide-42
SLIDE 42
  • C. Long

Lecture 23 May 6, 2018 42

Viterbi Algorithm Example

slide-43
SLIDE 43
  • C. Long

Lecture 23 May 6, 2018 43

Network structures

  • Feed-forward networks:

– single-layer perceptrons – multi-layer perceptrons

  • Feed-forward networks implement functions, have no

internal state

slide-44
SLIDE 44
  • C. Long

Lecture 23 May 6, 2018 44

Feed-forward example

  • Feed-forward network = a parameterized family of

nonlinear functions:

  • Adjusting weights changes the function: do learning

this way!

slide-45
SLIDE 45
  • C. Long

Lecture 23 May 6, 2018 45

Backpropagation learning algorithm BP

  • Solution to credit assignment problem in MLP. Rumelhart,

Hinton and Williams (1986) (though actually invented earlier in a PhD thesis relating to economics)

  • BP has two phases:

Forward pass phase: computes ‘functional signal’, feed forward propagation

  • f input pattern signals through network.

Backward pass phase: computes ‘error signal’, propagates the error backwards through network starting at output units (where the error is the difference between actual and desired output values)

slide-46
SLIDE 46
  • C. Long

Lecture 23 May 6, 2018 46

Conceptually: Forward Activity - Backward Error

Output node i Hidden node j Input node k Link between hidden node j and output node i: Wji Link between input node k and hidden node j: Wkj

slide-47
SLIDE 47
  • C. Long

Lecture 23 May 6, 2018 47

Back-propagation derivation

  • The squared error on a single example is defined as

where the sum is over the nodes in the output layer.

slide-48
SLIDE 48
  • C. Long

Lecture 23 May 6, 2018 48

Back-propagation derivation contd.

slide-49
SLIDE 49
  • C. Long

Lecture 23 May 6, 2018 49

Back-propagation Learning

  • Output layer: same as the single-layer perceptron

where

  • Hidden layer: back-propagation the error from the output

layer.

  • Upadate rules for weights in the hidden layers.

(Most neuroscientists deny that back-propagation occurs in the brain)

slide-50
SLIDE 50
  • C. Long

Lecture 23 May 6, 2018 50

Learning Algorithm: Backpropagation

  • Pictures below illustrate how signal is propagating

through the network, Symbols w(xm)n represent weights

  • f connections between network input xm and

neuron n in input layer.Symbols on represents output signal of neuron n.

slide-51
SLIDE 51
  • C. Long

Lecture 23 May 6, 2018 51

Learning Algorithm: Backpropagation

slide-52
SLIDE 52
  • C. Long

Lecture 23 May 6, 2018 52

Learning Algorithm: Backpropagation

slide-53
SLIDE 53
  • C. Long

Lecture 23 May 6, 2018 53

Learning Algorithm: Backpropagation

  • Propagation of signals through the hidden layer.

Symbols wmn represent weights of connections between output of neuron m and input of neuron n in the next layer.

slide-54
SLIDE 54
  • C. Long

Lecture 23 May 6, 2018 54

Learning Algorithm: Backpropagation

slide-55
SLIDE 55
  • C. Long

Lecture 23 May 6, 2018 55

Learning Algorithm: Backpropagation

  • Propagation of signals through the output layer.
slide-56
SLIDE 56
  • C. Long

Lecture 23 May 6, 2018 56

Learning Algorithm: Backpropagation

  • In the next algorithm step the output signal of the

network y is compared with the desired output value (the target), which is found in training data set. The difference is called error signal of output layer neuron

slide-57
SLIDE 57
  • C. Long

Lecture 23 May 6, 2018 57

Learning Algorithm: Backpropagation

  • The idea is to propagate error signal (computed in

single step) back to all neurons, which output signals were input for discussed neuron.

slide-58
SLIDE 58
  • C. Long

Lecture 23 May 6, 2018 58

Learning Algorithm: Backpropagation

  • The idea is to propagate error signal (computed in

single step) back to all neurons, which output signals were input for discussed neuron.

slide-59
SLIDE 59
  • C. Long

Lecture 23 May 6, 2018 59

Learning Algorithm: Backpropagation

  • The weights' coefficients wmn used to propagate errors back

are equal to this used during computing output value. Only the direction of data flow is changed (signals are propagated from output to inputs one after the other). This technique is used for all network layers. If propagated errors came from few neurons they are added. The illustration is below:

slide-60
SLIDE 60
  • C. Long

Lecture 23 May 6, 2018 60

Learning Algorithm: Backpropagation

  • The weights' coefficients wmn used to propagate errors back

are equal to this used during computing output value. Only the direction of data flow is changed (signals are propagated from output to inputs one after the other). This technique is used for all network layers. If propagated errors came from few neurons they are added. The illustration is below:

slide-61
SLIDE 61
  • C. Long

Lecture 23 May 6, 2018 61

Learning Algorithm: Backpropagation

  • The weights' coefficients wmn used to propagate errors back

are equal to this used during computing output value. Only the direction of data flow is changed (signals are propagated from output to inputs one after the other). This technique is used for all network layers. If propagated errors came from few neurons they are added. The illustration is below:

slide-62
SLIDE 62
  • C. Long

Lecture 23 May 6, 2018 62

Learning Algorithm: Backpropagation

  • When the error signal for each neuron is computed, the

weights coefficients of each neuron input node may be modified.

slide-63
SLIDE 63
  • C. Long

Lecture 23 May 6, 2018 63

Learning Algorithm: Backpropagation

  • When the error signal for each neuron is computed, the

weights coefficients of each neuron input node may be modified.

slide-64
SLIDE 64
  • C. Long

Lecture 23 May 6, 2018 64

Classification

  • Classification
  • Convolutional neural network

Classification Preprocessing for feature extraction

Input

f1 … fn

  • utput
  • utput

Feature ext r a c t i o n

Input

Shift and distortion invariance classification

f1 f 2

slide-65
SLIDE 65
  • C. Long

Lecture 23 May 6, 2018 65

CNN’s Topology

Feature extraction layer Convolution layer Shift and distortion invariance or Pooling layer C P Feature maps

slide-66
SLIDE 66
  • C. Long

Lecture 23 May 6, 2018 66

Feature extraction

  • Shared weights: all neurons in a feature share the

same weights (but not the biases).

  • In this way all neurons detect the same feature at

different positions in the input image.

  • Reduce the number of free parameters.

Inputs C P

slide-67
SLIDE 67
  • C. Long

Lecture 23 May 6, 2018 67

Putting it all together

slide-68
SLIDE 68
  • C. Long

Lecture 23 May 6, 2018 68

Intuition behind Deep Neural Nets

  • The final layer outputs a probability distribution of

categories.

slide-69
SLIDE 69
  • C. Long

Lecture 23 May 6, 2018 69

Joint training architecture overview

slide-70
SLIDE 70
  • C. Long

Lecture 23 May 6, 2018 70

Lots of pretrained ConvNets

  • Caffe models: https://github.com/BVLC/caffe/wiki/Model-Zoo
  • TensorFlow models:

https://github.com/tensorflow/models/tree/master/research/slim

  • PyTorch models:https://github.com/Cadene/pretrained-

models.pytorch

Caffe TensorFlow PyTorch

slide-71
SLIDE 71
  • C. Long

Lecture 23 May 6, 2018 71

Disadvantages

  • From a memory and capacity standpoint the CNN is

not much bigger than a regular two layer network.

  • At runtime the convolution operations are

computationally expensive and take up about 67% of the time.

  • CNN’s are about 3X slower than their fully connected

equivalents (size wise).

slide-72
SLIDE 72
  • C. Long

Lecture 23 May 6, 2018 72

Disadvantages

  • Convolution operation
  • 4 nested loops ( 2 loops on input image & 2 loops on kernel)
  • Small kernel size
  • make the inner loops very inefficient as they frequently JMP.
  • Cash unfriendly memory access
  • Back-propagation require both row-wise and column-wise

access to the input and kernel image.

  • 2D Images represented in a row-wise/ serialized order.
  • Column-wise access to data can result in a high rate of cash

misses in memory subsystem.

slide-73
SLIDE 73
  • C. Long

Lecture 23 May 6, 2018 73

Activation Functions

SReLU (Shift Rectified Linear Unit)

max(-1, x)

slide-74
SLIDE 74
  • C. Long

Lecture 23 May 6, 2018 74

In practice

  • Use ReLU. Be careful with your learning rates
  • Try out Leaky ReLU / Maxout / ELU
  • Try out tanh but don’t expect much
  • Don’t use sigmoid
slide-75
SLIDE 75
  • C. Long

Lecture 23 May 6, 2018 75

Mini-batch SGD

  • Loop:
  • 1. Sample a batch of data
  • 2. Forward prop it through the graph, get loss
  • 3. Backprop to calculate the gradients
  • 4. Update the parameters using the gradient
slide-76
SLIDE 76
  • C. Long

Lecture 23 May 6, 2018 76

Overview of gradient descent optimization algorithms

Link: http://ruder.io/optimizing-gradient- descent/

slide-77
SLIDE 77
  • C. Long

Lecture 23 May 6, 2018 77

Which Optimizer to Use?

  • If your input data is sparse, then you likely achieve the best results using
  • ne of the adaptive learning-rate methods.
  • RMSprop is an extension of Adagrad that deals with its radically

diminishing learning rates. Adam, finally, adds bias-correction and momentum to RMSprop. Insofar, RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances.

  • Experiments show that bias-correction helps Adam slightly outperform

RMSprop towards the end of optimization as gradients become sparser. Insofar, Adam might be the best overall choice.

  • Interestingly, many recent papers use SGD without momentum and a

simple learning rate annealing schedule. As has been shown, SGD usually achieves to find a minimum, but it might take significantly longer than with some of the optimizers, is much more reliant on a robust initialization and annealing schedule, and may get stuck in saddle points rather than local minima.

  • If you care about fast convergence and train a deep or complex neural

network, you should choose one of the adaptive learning rate methods

slide-78
SLIDE 78
  • C. Long

Lecture 23 May 6, 2018 78

Learning rate

  • SGD, SGD+Momentum, Adagrad, RMSProp, Adam

all have learning rate as a hyperparameter.

slide-79
SLIDE 79
  • C. Long

Lecture 23 May 6, 2018 79

L-BFGS

  • Usually works very well in full batch, deterministic

mode.

  • i.e. if you have a single, deterministic f(x) then L-BFGS will

probably work very nicely

  • Does not transfer very well to mini-batch setting.
  • Gives bad results. Adapting L-BFGS to large-scale,

stochastic setting is an active area of research

  • In practice:
  • Adam is a good default choice in most cases
  • If you can afford to do full batch updates then try out L-

BFGS (and don’t forget to disable all sources of noise)

slide-80
SLIDE 80
  • C. Long

Lecture 23 May 6, 2018 80

Regularization: Dropout

  • “randomly set some neurons to zero in the forward

pass”

[Srivastava et al., 2014]

slide-81
SLIDE 81
  • C. Long

Lecture 23 May 6, 2018 81

Regularization: Dropout

  • Wait a second… How could this possibly be a good

idea?

Another interpretation: Dropout is training a large ensemble

  • f models (that share parameters).

Each binary mask is one model, gets trained on only ~one datapoint

slide-82
SLIDE 82
  • C. Long

Lecture 23 May 6, 2018 82

At test time….

  • Ideally:
  • want to integrate out all the

noise

  • Monte Carlo approximation:
  • do many forward passes with

different dropout masks, average all predictions

slide-83
SLIDE 83
  • C. Long

Lecture 23 May 6, 2018 83

At test time….

  • Can in fact do this with a single forward pass!

(approximately)

  • Leave all input neurons turned on (no dropout).

Q: Suppose that with all inputs present at test time the output of this neuron is x. What would its output be during training time, in expectation? (e.g. if p = 0.5)

slide-84
SLIDE 84
  • C. Long

Lecture 23 May 6, 2018 84

At test time….

  • Can in fact do this with a single forward pass!

(approximately)

  • Leave all input neurons turned on (no dropout).
slide-85
SLIDE 85
  • C. Long

Lecture 23 May 6, 2018 85

At test time….

  • Can in fact do this with a single forward pass!

(approximately)

  • Leave all input neurons turned on (no dropout).
slide-86
SLIDE 86
  • C. Long

Lecture 23 May 6, 2018 86

Pattern recognition design cycle

Collect data Select features Regression model Train regressor Evaluate regressor Linear regression model Support vector regression Logistical regression model Convolutional Neural Networks

slide-87
SLIDE 87
  • C. Long

Lecture 23 May 6, 2018 87

Linear Regression

  • Given data with n dimensional variables and 1 target-

variable (real number) where

  • The objective: Find a function f that returns the best fit.
  • To find the best fit, we minimize the sum of squared

errors -> Least square estimation

  • The solution can be found by solving
slide-88
SLIDE 88
  • C. Long

Lecture 23 May 6, 2018 88

Linear Regression

To avoid over-fitting, a regularization term can be introduced (minimize a magnitude of w)

slide-89
SLIDE 89
  • C. Long

Lecture 23 May 6, 2018 89

Support Vector Regression

  • Find a function, f(x), with at most ε-deviation from

the target y

slide-90
SLIDE 90
  • C. Long

Lecture 23 May 6, 2018 90

Support Vector Regression

slide-91
SLIDE 91
  • C. Long

Lecture 23 May 6, 2018 91

Soft margin

slide-92
SLIDE 92
  • C. Long

Lecture 23 May 6, 2018 92

Logistic Regression

slide-93
SLIDE 93
  • C. Long

Lecture 23 May 6, 2018 93

Logistic Regression Objective Function

  • Can’t just use squared loss as in linear regression

– Using the logistic regression model results in a non-convex optimization

slide-94
SLIDE 94
  • C. Long

Lecture 23 May 6, 2018 94

Deriving the Cost Function via Maximum Likelihood Estimation

slide-95
SLIDE 95
  • C. Long

Lecture 23 May 6, 2018 95

Deriving the Cost Function via Maximum Likelihood Estimation

slide-96
SLIDE 96
  • C. Long

Lecture 23 May 6, 2018 96

Regularized Logistic Regression

  • We can regularize logistic regression exactly as before
slide-97
SLIDE 97
  • C. Long

Lecture 23 May 6, 2018 97

Another Interpretation

  • Equivalently, logistic regression assumes that
  • In other words, logistic regression assumes that the log
  • dds is a linear function of x
slide-98
SLIDE 98
  • C. Long

Lecture 23 May 6, 2018 98

DNN Regression

  • For a two-layer MLP:
  • The network weights are adjusted to minimize an
  • utput cost function
slide-99
SLIDE 99
  • C. Long

Lecture 23 May 6, 2018 99

Idea #1: Localization as Regression

slide-100
SLIDE 100
  • C. Long

Lecture 23 May 6, 2018 100

Simple Recipe for Classification + Localization

  • Step 2: Attach new fully-connected “regression head”

to the network

slide-101
SLIDE 101
  • C. Long

Lecture 23 May 6, 2018 101

Simple Recipe for Classification + Localization

  • Step 3: Train the regression head only with SGD and

L2 loss

slide-102
SLIDE 102
  • C. Long

Lecture 23 May 6, 2018 102

Pattern recognition design cycle

Collect data Select features Clustering model Evaluate clustering Train mixture model K-Means algorithm Hirarchical clustering algorithm Gaussian mixture model

slide-103
SLIDE 103
  • C. Long

Lecture 23 May 6, 2018 103

Clustering evaluation

Clustering is hard to evaluate. In most applications, expert judgements are still the key.

slide-104
SLIDE 104
  • C. Long

Lecture 23 May 6, 2018 104

Data Clustering - Formal Definition

  • Given a set of N unlabeled examples D = x1, x2, ..., xN in

a d-dimensional feature space, D is partitioned into a number of disjoint subsets Dj’s:

  • A partition is denoted by:

and the problem of data clustering is thus formulated as where f(·) is formulated according to a given criterion.

slide-105
SLIDE 105
  • C. Long

Lecture 23 May 6, 2018 105

K-means

slide-106
SLIDE 106
  • C. Long

Lecture 23 May 6, 2018 106

Pros and cons of K-means

Weakneses:

The user needs to specify the value of K.

Applicable only when mean is defined.

The algorithm is sensitive to the initial seeds.

The algorithm is sensitive to outliers.

 Outliers are data points that are very far away from other data points.  Outliers could be errors in the data recording or some special data points

with very different values.

slide-107
SLIDE 107
  • C. Long

Lecture 23 May 6, 2018 107

Hierarchical Clustering

  • Up to now, considered “flat” clustering
  • For some data, hierarchical clustering is more

appropriate than “flat” clustering

  • Hierarchical clustering
slide-108
SLIDE 108
  • C. Long

Lecture 23 May 6, 2018 108

Hierarchical Clustering:

  • Hierarchical cluster representation.

Venn diagram Dendrogram-Binary Tree

slide-109
SLIDE 109
  • C. Long

Lecture 23 May 6, 2018 109

Hierarchical Clustering

  • Algorithms for hierarchical clustering can be divided

into two types:

  • 1. Agglomerative (bottom up) procedures

_x0001_ Start with n singleton clusters _x0001_ Form hierarchy by merging most similar clusters

  • 2. Divisive (top bottom) procedures

_x0001_ Start with all samples in one cluster _x0001_ Form hierarchy by splitting the “worst” clusters

slide-110
SLIDE 110
  • C. Long

Lecture 23 May 6, 2018 110

Divisive Hierarchical Clustering

  • Any “flat” algorithm which produces a fixed number
  • f clusters can be used

set c = 2

slide-111
SLIDE 111
  • C. Long

Lecture 23 May 6, 2018 111

Agglomerative Hierarchical Clustering

  • Initialize with each example in singleton cluster while

there is more than 1 cluster

  • 1. find 2 nearest clusters
  • 2. merge them
  • Four common ways to measure cluster distance
slide-112
SLIDE 112
  • C. Long

Lecture 23 May 6, 2018 112

Single Linkage or Nearest Neighbor

  • Agglomerative clustering with minimum distance
  • generates minimum spanning tree
  • encourages growth of elongated clusters
  • disadvantage: very sensitive to noise
slide-113
SLIDE 113
  • C. Long

Lecture 23 May 6, 2018 113

Complete Linkage or Farthest Neighbor

  • Agglomerative clustering with maximum distance
  • Encourages compact clusters
  • Does not work well if elongated clusters present
slide-114
SLIDE 114
  • C. Long

Lecture 23 May 6, 2018 114

Average and Mean Agglomerative Clustering

  • Agglomerative clustering is more robust under the

average or the mean cluster distance

  • Mean distance is cheaper to compute than the average

distance

  • Unfortunately, there is not much to say about

agglomerative clustering theoretically, but it does work reasonably well in practice

slide-115
SLIDE 115
  • C. Long

Lecture 23 May 6, 2018 115

Agglomerative vs. Divisive

  • Agglomerative is faster to compute, in general
  • Divisive may be less “blind” to the global structure
  • f the data
slide-116
SLIDE 116
  • C. Long

Lecture 23 May 6, 2018 116

Mixture Density Model

  • Model data with density model.
  • To generate a sample from distribution p(x|θ)
  • first select j with probability p(cj)
  • then generate x according to probability law p(x|cj, θj)
slide-117
SLIDE 117
  • C. Long

Lecture 23 May 6, 2018 117

The General GMM Assumption

slide-118
SLIDE 118
  • C. Long

Lecture 23 May 6, 2018 118

ML Estimation for Mixture Density

  • Can use Maximum Likelihood estimation for a mixture

density; need to estimate

  • As in the supervised case, form the logarithm likelihood

function

slide-119
SLIDE 119
  • C. Long

Lecture 23 May 6, 2018 119

Expectation Maximization Algorithm

  • EM is an algorithm for ML parameter estimation when the

data has missing values. It is used when

  • 1. data is incomplete (has missing values)
  • some features are missing for some samples due to data corruption,

partial survey responces, etc.

  • This scenario is very useful
  • 2. Suppose data X is complete, but p(X|θ) is hard to optimize.

Suppose further that introducing certain hidden variables U whose values are missing, and suppose it is easier to optimize the “complete” likelihood function p(X,U|θ). Then EM is useful.

  • This scenario is useful for the mixture density estimation, and is subject of
  • ur lecture today
  • Notice that after we introduce artificial (hidden) variables U

with missing values, case 2 is completely equivalent to case 1

slide-120
SLIDE 120
  • C. Long

Lecture 23 May 6, 2018 120

EM: Joint Likelihood

  • Let , and
  • The complete likelihood is
  • If we actually observed Z, the log likelihood ln[p(X,Z|θ)]

would be trivial to maximize with respect to θ and

  • The problem, of course, is that the values of Z are missing,

since we made it up (that is Z is hidden)

slide-121
SLIDE 121
  • C. Long

Lecture 23 May 6, 2018 121

EM Algorithm

  • EM solution is to iterate
slide-122
SLIDE 122
  • C. Long

Lecture 23 May 6, 2018 122

EM Algorithm and K-means

  • k-means can be derived from EM algorithm
  • Setting mixing parameters equal for all classes,
  • If we let , then
  • so at the E step, for each current mean, we find all points closest

to it and form new clusters

  • at the M step, we compute the new means inside current clusters
slide-123
SLIDE 123
  • C. Long

Lecture 23 May 6, 2018 123