Applied Machine Learning Applied Machine Learning Some basic - - PowerPoint PPT Presentation

applied machine learning applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning Applied Machine Learning Some basic - - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Some basic concepts Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Objectives Objectives learning as representation, evaluation and optimization


slide-1
SLIDE 1

Applied Machine Learning Applied Machine Learning

Some basic concepts

Siamak Ravanbakhsh Siamak Ravanbakhsh

COMP 551 COMP 551 (winter 2020) (winter 2020)

1

slide-2
SLIDE 2

learning as representation, evaluation and optimization k-nearest neighbors for classification curse of dimensionality manifold hypothesis

  • verfitting & generalization

cross validation no free lunch theorem inductive bias

Objectives Objectives

2

slide-3
SLIDE 3

A useful perspective on ML A useful perspective on ML

from: Domingos, Pedro M. "A few useful things to know about machine learning." Commun. acm 55.10 (2012): 78-87.

Learning = Representation + Evaluation + Optimization

Let's focus on classification

3 . 1

slide-4
SLIDE 4

A useful perspective on ML A useful perspective on ML

from: Domingos, Pedro M. "A few useful things to know about machine learning." Commun. acm 55.10 (2012): 78-87.

Learning = Representation + Evaluation + Optimization

Model Hypothesis space

the space of functions to choose from is determined by how we represent/define the learner

Let's focus on classification

3 . 1

slide-5
SLIDE 5

A useful perspective on ML A useful perspective on ML

from: Domingos, Pedro M. "A few useful things to know about machine learning." Commun. acm 55.10 (2012): 78-87.

Learning = Representation + Evaluation + Optimization

Model Hypothesis space

the space of functions to choose from is determined by how we represent/define the learner

Let's focus on classification

Objective function Cost function Score function

the criteria for picking the best model

3 . 1

slide-6
SLIDE 6

A useful perspective on ML A useful perspective on ML

from: Domingos, Pedro M. "A few useful things to know about machine learning." Commun. acm 55.10 (2012): 78-87.

Learning = Representation + Evaluation + Optimization

Model Hypothesis space

the space of functions to choose from is determined by how we represent/define the learner

Let's focus on classification

Objective function Cost function Score function

the criteria for picking the best model

Objective Cost Loss

procedure for finding the best model

3 . 1

slide-7
SLIDE 7

Winter 2020 | Applied Machine Learning (COMP551) from: Domingos, Pedro M. "A few useful things to know about machine learning." Commun. acm 55.10 (2012): 78-87.

Learning =

Let's focus on classification

A useful perspective on ML A useful perspective on ML

3 . 2

slide-8
SLIDE 8

Digits dataset Digits dataset

input x

(n)

{0, … , 255}28×28

label

y ∈

(n)

{0, … , 9}

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

indexes the training instance sometime we drop (n)

n ∈ {1, … , N}

4 . 1

slide-9
SLIDE 9

Digits dataset Digits dataset

input x

(n)

{0, … , 255}28×28

label

y ∈

(n)

{0, … , 9}

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

indexes the training instance sometime we drop (n)

n ∈ {1, … , N}

size of the input image in pixels

4 . 1

slide-10
SLIDE 10

Digits dataset Digits dataset

input x

(n)

{0, … , 255}28×28

label

y ∈

(n)

{0, … , 9}

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

indexes the training instance sometime we drop (n)

n ∈ {1, … , N}

size of the input image in pixels

vectorization:

x → vec(x) ∈ R784

input dimension D pretending intensities are real numbers

4 . 1

slide-11
SLIDE 11

Digits dataset Digits dataset

input x

(n)

{0, … , 255}28×28

label

y ∈

(n)

{0, … , 9}

image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4

indexes the training instance sometime we drop (n)

n ∈ {1, … , N}

size of the input image in pixels

vectorization:

x → vec(x) ∈ R784

input dimension D pretending intensities are real numbers

4 . 1

note: this ignores the spatial arrangement of pixels, but good enough for now

slide-12
SLIDE 12

Nearest neighbour classifier Nearest neighbour classifier

training: do nothing test: predict the lable by finding the closest image in the training set and

new test instance closest instance

4 . 2

slide-13
SLIDE 13

Nearest neighbour classifier Nearest neighbour classifier

training: do nothing test: predict the lable by finding the closest image in the training set and

new test instance closest instance

need a measure of distance

4 . 2

slide-14
SLIDE 14

Nearest neighbour classifier Nearest neighbour classifier

training: do nothing test: predict the lable by finding the closest image in the training set and

new test instance closest instance

need a measure of distance e.g., Euclidean distance

∣∣x − x ∣∣

=

′ 2

(x − x )

∑d=1

D d d ′ 2

4 . 2

slide-15
SLIDE 15

Nearest neighbour classifier Nearest neighbour classifier

training: do nothing test: predict the lable by finding the closest image in the training set and

new test instance closest instance

need a measure of distance e.g., Euclidean distance

∣∣x − x ∣∣

=

′ 2

(x − x )

∑d=1

D d d ′ 2

Voronoi diagram shows the decision boundaries

(this example D=2, can't visualize D=784)

test instance: will be classified as 6

4 . 2

slide-16
SLIDE 16

the Voronoi Diagram the Voronoi Diagram

each colour shows all points closer to the corresponding training instance than to any other instance

images from wiki 4 . 3

slide-17
SLIDE 17

the Voronoi Diagram the Voronoi Diagram

each colour shows all points closer to the corresponding training instance than to any other instance

images from wiki

Euclidean distance

∣∣x − x ∣∣

=

′ 2

(x − x )

∑d=1

D d d ′ 2

Manhattan distance

∣∣x − x ∣∣

=

′ 1

∣x − x ∣

∑d=1

D d d ′

4 . 3

slide-18
SLIDE 18

K- nearest neighbours

  • nearest neighbours

training: do nothing test: predict the lable by finding the K closest instances

probability of class c K-nearest neighbours

new test instance closest instances

example K = 9

p(y = 6∣ ) =

9 6

p(y =

new

c ∣ x

) =

new

I(y =

K 1 ∑x ∈KNN(x )

′ new

c)

4 . 4

slide-19
SLIDE 19

K- nearest neighbours K- nearest neighbours

training: do nothing test: predict the lable by finding the K closest instances

p(y =

new

c ∣ x

) =

new

I(y =

K 1 ∑x ∈KNN(x )

′ new

c)

probability of class c K-nearest neighbours

example

C=3, D=2, K=10

training data

4 . 5

slide-20
SLIDE 20

K- nearest neighbours K- nearest neighbours

training: do nothing test: predict the lable by finding the K closest instances

p(y =

new

c ∣ x

) =

new

I(y =

K 1 ∑x ∈KNN(x )

′ new

c)

probability of class c K-nearest neighbours

example

C=3, D=2, K=10

training data

  • prob. of class 1

4 . 5

slide-21
SLIDE 21

K- nearest neighbours K- nearest neighbours

training: do nothing test: predict the lable by finding the K closest instances

p(y =

new

c ∣ x

) =

new

I(y =

K 1 ∑x ∈KNN(x )

′ new

c)

probability of class c K-nearest neighbours

example

C=3, D=2, K=10

training data

  • prob. of class 1
  • prob. of class 2

4 . 5

slide-22
SLIDE 22

Winter 2020 | Applied Machine Learning (COMP551)

a lazy-learner: no training phase, locally estimate when a query comes useful for fast-changing datasets

1

4 . 6

a non-parametric method (misnomer): the number of model parameters grows with the data

K- nearest neighbours K- nearest neighbours

slide-23
SLIDE 23

high dimensions are unintuitive! assuming a uniform distribution

Curse of dimensionality Curse of dimensionality

x ∈ [0, 1]D

5 . 1

slide-24
SLIDE 24

high dimensions are unintuitive! assuming a uniform distribution

Curse of dimensionality Curse of dimensionality

x ∈ [0, 1]D

N (total #training instances) grows expoentially with D (dimensions) suppose we want to maintain #samples per sub-cube of side 1/3

need exponentially more instances for K-NN

5 . 1

slide-25
SLIDE 25

high dimensions are unintuitive! assuming a uniform distribution need exponentially more instances for K-NN

Curse of dimensionality Curse of dimensionality

x ∈ [0, 1]D

Another way to see this

s s

fraction of data in the neighbourhood

5 . 2

slide-26
SLIDE 26

Curse of dimensionality Curse of dimensionality

high dimensions are unintuitive! assuming a uniform distribution need exponentially more instances for K-NN all instances have similar distances x ∈ [0, 1]D

5 . 3

slide-27
SLIDE 27

Curse of dimensionality Curse of dimensionality

D = 3

(2r)D

DΓ(D/2) 2r π

D D/2

high dimensions are unintuitive! assuming a uniform distribution need exponentially more instances for K-NN all instances have similar distances x ∈ [0, 1]D

5 . 3

slide-28
SLIDE 28

Curse of dimensionality Curse of dimensionality

D = 3

(2r)D

DΓ(D/2) 2r π

D D/2

lim

=

D→∞ volum( ) volum( )

most of the volume is close to the corners most pairwise disstances are similar

high dimensions are unintuitive! assuming a uniform distribution need exponentially more instances for K-NN all instances have similar distances x ∈ [0, 1]D

5 . 3

slide-29
SLIDE 29

a "conceptual" visualization of the same example

# corners and the mass in the corners grows quickly

image: Zaki's book on Data Mining and Analysis

Curse of dimensionality Curse of dimensionality

high dimensions are unintuitive! assuming a uniform distribution need exponentially more instances for K-NN all instances have similar distances x ∈ [0, 1]D

5 . 4

slide-30
SLIDE 30

Manifold hypothesis Manifold hypothesis

real-world data is often far from uniform manifold hypothesis: real data lies close to the surface of a manifold

5 . 5

slide-31
SLIDE 31

Manifold hypothesis Manifold hypothesis

real-world data is often far from uniform manifold hypothesis: real data lies close to the surface of a manifold

ambient (data) dimension: D = 3 manifold dimension:

= D ^ 2

5 . 5

slide-32
SLIDE 32

Winter 2020 | Applied Machine Learning (COMP551)

Manifold hypothesis Manifold hypothesis

real-world data is often far from uniform manifold hypothesis: real data lies close to the surface of a manifold

ambient (data) dimension: D = 3 manifold dimension:

= D ^ 2

MNIST digit classification results

for K-NN the manifold dimension matters

D = 784

so K-NN can be competitive

is the number of pixels manifold dimension ?

5 . 5

slide-33
SLIDE 33

Model selection Model selection

K is a hyper-parameter: a model parameter that is not learned by the algorithm

6 . 1

slide-34
SLIDE 34

Model selection Model selection

K is a hyper-parameter: a model parameter that is not learned by the algorithm

example

training data K=1

6 . 1

slide-35
SLIDE 35

Model selection Model selection

K is a hyper-parameter: a model parameter that is not learned by the algorithm

example

training data K=1 K=5 most likely class

6 . 1

slide-36
SLIDE 36

Overfitting Overfitting

how to pick the best K?

6 . 2

slide-37
SLIDE 37

Overfitting Overfitting

how to pick the best K?

first attempt pick K that gives "best results" on the training set

I(arg max p(y ∣

∑n

y

x )

=

(n)  y

)

(n) e.g., misclassification error

6 . 2

slide-38
SLIDE 38

Overfitting Overfitting

how to pick the best K?

first attempt pick K that gives "best results" on the training set

I(arg max p(y ∣

∑n

y

x )

=

(n)  y

)

(n) e.g., misclassification error bad idea! we can overfit the training data we can have bad performance on new instances

6 . 2

slide-39
SLIDE 39

Overfitting Overfitting

how to pick the best K?

first attempt pick K that gives "best results" on the training set

I(arg max p(y ∣

∑n

y

x )

=

(n)  y

)

(n) e.g., misclassification error bad idea! we can overfit the training data we can have bad performance on new instances example

K

6 . 2

slide-40
SLIDE 40

Generalization Generalization

what we care about is generalization

6 . 3

slide-41
SLIDE 41

Generalization Generalization

what we care about is generalization

expected loss: performance of algorithm on unseen data

how to estimate this?

6 . 3

slide-42
SLIDE 42

Generalization Generalization

what we care about is generalization

expected loss: performance of algorithm on unseen data

how to estimate this?

validation set: a subset of available data not used for training

performance on validation set expected error

6 . 3

slide-43
SLIDE 43

Generalization Generalization

what we care about is generalization

expected loss: performance of algorithm on unseen data

how to estimate this?

validation set: a subset of available data not used for training

performance on validation set expected error

k-fold cross validation(CV)

partition the data into k folds use k-1 for training, and 1 for validation average the validation error over all folds

6 . 3

slide-44
SLIDE 44

Generalization Generalization

what we care about is generalization

expected loss: performance of algorithm on unseen data

how to estimate this?

validation set: a subset of available data not used for training

performance on validation set expected error

k-fold cross validation(CV)

partition the data into k folds use k-1 for training, and 1 for validation average the validation error over all folds

leave-one-out CV:extreme case of k=N

6 . 3

slide-45
SLIDE 45

Winter 2020 | Applied Machine Learning (COMP551)

test set: for final evaluation validation set (aka development set): for hyper-parameter tuning training set: to train the model

Train-validation-test split Train-validation-test split

We often use a 3-way split of the data

(e.g., 80%-10%-10% split)

we can use k-fold cross validation with train+validation set

6 . 4

slide-46
SLIDE 46

No free lunch No free lunch

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

7 . 1

slide-47
SLIDE 47

No free lunch No free lunch

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

consider any two binary classifiers (A and B) they have the same average performance (test accuracy) on all possible problems

produce labels using a random (binary) function

{

7 . 1

slide-48
SLIDE 48

Inductive Bias Inductive Bias

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

7 . 2

slide-49
SLIDE 49

Inductive Bias Inductive Bias

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

how is learning possible at all? because world is not random, there are regularities, induction is possible!

7 . 2

slide-50
SLIDE 50

Inductive Bias Inductive Bias

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

ML algorithms need to make assumptions about the problem

inductive bias

how is learning possible at all? because world is not random, there are regularities, induction is possible!

7 . 2

slide-51
SLIDE 51

Inductive Bias Inductive Bias

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

ML algorithms need to make assumptions about the problem

inductive bias

strength and correctness of assumptions are important in having good performance

related to bias - variance trade off that we will discuss later

how is learning possible at all? because world is not random, there are regularities, induction is possible!

7 . 2

slide-52
SLIDE 52

Winter 2020 | Applied Machine Learning (COMP551)

Inductive Bias Inductive Bias

there is no single algorithm that performs well on all class of problems

image: https://community.alteryx.com/t5/Data-Science-Blog/There-is-No-Free-Lunch-in-Data-Science/ba-p/347402

ML algorithms need to make assumptions about the problem

inductive bias

manifold hypothesis in KNN (and many other methods) close to linear dependencies in linear regression conditional independence and causal structure in probabilistic graphical models examples

strength and correctness of assumptions are important in having good performance

related to bias - variance trade off that we will discuss later

how is learning possible at all? because world is not random, there are regularities, induction is possible!

7 . 2

slide-53
SLIDE 53

Summary Summary

ML algorithms involve a choice of model, objective and optimization we saw K-NN method for classification curse of dimensionality: exponentially more data needed in higher dims. manifold hypothesis to the rescue! what we care about is generalization of ML algorithms estimated using cross validation there ain't no such thing as a free lunch the choice of inductive bias is important for good generalization

8