Lecture outline Introduction to the course Introduction to Machine - - PowerPoint PPT Presentation

lecture outline
SMART_READER_LITE
LIVE PREVIEW

Lecture outline Introduction to the course Introduction to Machine - - PowerPoint PPT Presentation

1 Introduction to Machine Learning Lecture 1: Introduction and Linear Regression Iasonas Kokkinos Iasonas.kokkinos@gmail.com University College London 2 Lecture outline Introduction to the course Introduction to Machine Learning Least


slide-1
SLIDE 1

1

University College London

Introduction to Machine Learning

Iasonas Kokkinos

Iasonas.kokkinos@gmail.com

Lecture 1: Introduction and Linear Regression

slide-2
SLIDE 2

2

Lecture outline

Introduction to the course Introduction to Machine Learning Least squares

slide-3
SLIDE 3

3

Machine Learning

Principles, methods, and algorithms for learning and prediction based on past evidence Goal: Machines that perform a task based on experience, instead of explicitly coded instructions Why?

  • Crucial component of every intelligent/autonomous system
  • Important for a system’s adaptability
  • Important for a system’s generalization capabilities
  • Attempt to understand human learning
slide-4
SLIDE 4

4

Machine Learning variants

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction

  • Weakly supervised/semi-supervised

Some data supervised, some unsupervised

  • Reinforcement learning

Supervision: sparse reward for a sequence of decisions

slide-5
SLIDE 5

5

Classification

  • Based on our experience, should we give a loan to this customer?

– Binary decision: yes/no

Decision boundary

slide-6
SLIDE 6

6

Classification examples

  • Digit Recognition
  • Spam Detection
  • Face detection
slide-7
SLIDE 7

7

Decision boundary

Face

Background

`Faceness function’: classifier

slide-8
SLIDE 8

8

  • Scan window over image

– Multiple scales – Multiple orientations

  • Classify window as either:

– Face – Non-face Classifier Window Face Non-face

Test time: deploy the learned function

slide-9
SLIDE 9

9

Machine Learning variants

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction

  • Weakly supervised

Some data supervised, some unsupervised

  • Reinforcement learning

Supervision: reward for a sequence of decisions

slide-10
SLIDE 10

10

Regression

  • Output: Continuous

– E.g. price of a car based on years, mileage, condition,…

slide-11
SLIDE 11

11

Computer vision example

  • Human estimation: from image to vector-valued pose estimate
slide-12
SLIDE 12

12

Machine Learning variants

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction

  • Weakly supervised

Some data supervised, some unsupervised

  • Reinforcement learning

Supervision: reward for a sequence of decisions

slide-13
SLIDE 13

13

Clustering

  • Break a set of data into coherent groups

– Labels are `invented’

slide-14
SLIDE 14

14

Clustering examples

  • Spotify recommendations
slide-15
SLIDE 15

15

Clustering examples

  • Image segmentation
slide-16
SLIDE 16

16

Machine Learning variants

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction

  • Weakly supervised

Some data supervised, some unsupervised

  • Reinforcement learning

Supervision: reward for a sequence of decisions

slide-17
SLIDE 17

17

Dimensionality reduction & manifold learning

  • Find a low-dimensional representation of high-dimensional data

– Continuous outputs are `invented’

slide-18
SLIDE 18

18

Example of nonlinear manifold: faces

x2 1 2(x1 + x2)

Average of two faces is not a face

slide-19
SLIDE 19

19

Moving along the learned face manifold

Trajectory along the “male” dimension Trajectory along the “young” dimension Lample et. al. Fader Networks, NIPS 2017

slide-20
SLIDE 20

20

Machine Learning variants

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction

  • Weakly supervised/semi supervised

Partially supervised

  • Reinforcement learning

Supervision: reward for a sequence of decisions

slide-21
SLIDE 21

21

Weakly supervised learning: only part of the supervision signal

Supervision signal: “motorcycle” Inferred localization information

slide-22
SLIDE 22

22

Weakly supervised learning: only part of the supervision signal

Supervision signal: “motorcycle” Inferred localization information

slide-23
SLIDE 23

23

Semi-supervised learning: only part of the data labelled

Labelled data Labelled + unlabelled data

slide-24
SLIDE 24

24

Machine Learning variants

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction

  • Weakly supervised/semi supervised learning

Some data supervised, some unsupervised

  • Reinforcement learning

Supervision: reward for a sequence of decisions

slide-25
SLIDE 25

25

Reinforcement learning

  • Agent interacts with environment repeatedly

– Take actions, based on state – (occasionally) receive rewards – Update state – Repeat

  • Goal: maximize cumulative reward
slide-26
SLIDE 26

26

Reinforcement learning examples

  • Beat human champions in games
  • Robotics

Backgammon, 90’s GO, 2015

slide-27
SLIDE 27

27

Focus of first part: supervised learning

  • Supervised

– Classification – Regression

  • Unsupervised

– Clustering – Dimensionality Reduction, Manifold Learning

  • Weakly supervised

Some data supervised, some unsupervised

  • Reinforcement learning

Supervision: reward for a sequence of decisions

slide-28
SLIDE 28

28

Classification: yes/no decision

slide-29
SLIDE 29

29

Regression: continuous output

slide-30
SLIDE 30

30

What we want to learn: a function

  • Input-output mapping

y = fw(x)

slide-31
SLIDE 31

31

What we want to learn: a function

  • Input-output mapping

Input method parameters prediction

y = fw(x)

slide-32
SLIDE 32

32

What we want to learn: a function

Input method parameters prediction

x ∈ R

Calculus Vector calculus Machine learning: can work also for discrete inputs, strings, trees, graphs,…

x ∈ RD

y = fw(x)

slide-33
SLIDE 33

33

What we want to learn: a function

Input method parameters prediction

y ∈ {0, 1}

Classification: Regression:

y ∈ R

y = fw(x)

slide-34
SLIDE 34

34

What we want to learn: a function

method prediction Linear classifiers, neural networks, decision trees, ensemble models, probabilistic classifiers, …

y = fw(x)

slide-35
SLIDE 35

35

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

Example of method: K-nearest neighbor classifier

– Compute distance to other training records – Identify K nearest neighbors – Take majority vote

slide-36
SLIDE 36

36

Training data for NN classifier (in R2)

slide-37
SLIDE 37

37

1-nn classifier prediction (in R2)

slide-38
SLIDE 38

38

3-nn classifier prediction

slide-39
SLIDE 39

39

Method example: decision tree

Machine learning: can work also for discrete inputs, strings, trees, graphs,…

slide-40
SLIDE 40

40

Method example: decision tree

slide-41
SLIDE 41

41

Method example: decision tree

What is the depth of the decision tree for this problem?

slide-42
SLIDE 42

42

Method example: linear classifier

Feature coordinate i Feature coordinate j

slide-43
SLIDE 43

43

Method example: neural network

slide-44
SLIDE 44

44

Method example: neural network

slide-45
SLIDE 45

45

Method example: neural network

slide-46
SLIDE 46

46

We have two centuries of material to cover!

The first clear and concise exposition of the method of least squares was published by Legendre in 1805. The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. The value of Legendre's method of least squares was immediately recognized by leading astronomers and geodesists of the time https://en.wikipedia.org/wiki/Least_squares

slide-47
SLIDE 47

47

What we want to learn: a function

  • Input-output mapping

Input method parameters prediction

w ∈ R w ∈ RK

= f(x; w)

y = fw(x)

slide-48
SLIDE 48

48

y = fw(x) = f(x, w) = wT x

Assumption: linear function

x ∈ RD, w ∈ RD wT x = hw, xi =

D

X

d=1

wdxd

Inner product:

slide-49
SLIDE 49

49

Reminder: linear classifier

Feature coordinate i Feature coordinate j

: negative : positive < + ⋅ ≥ + ⋅ b b

i i i i

w x x w x x

Each data point has a class label: +1 ( )

  • 1 ( )

yt =

slide-50
SLIDE 50

50

: negative : positive < + ⋅ ≥ + ⋅ b b

i i i i

w x x w x x

Each data point has a class label: +1 ( )

  • 1 ( )

yt = Feature coordinate i Feature coordinate j

Question: which one?

slide-51
SLIDE 51

51

Linear regression in 1D

slide-52
SLIDE 52

52

Linear regression in 1D

Training set: input–output pairs S = {(xi, yi)},

i = 1 . . . , N

xi ∈ R, yi ∈ R

slide-53
SLIDE 53

53

Linear regression in 1D

yi = w0 + w1xi

1 + ✏i

= w0xi

0 + w1xi 1 + ✏i,

xi

0 = 1,

∀i = wT xi + ✏i

slide-54
SLIDE 54

54

Sum of squared errors criterion

L(w0, w1) =

N

X

i=1

⇥ yi −

  • w0xi

0 + w1xi 1

⇤2

L(w) =

N

X

i=1

(✏i)2

yi = wT xi + ✏i

Loss function: sum of squared errors Expressed as a function of two variables: Question: what is the best (or least bad) value of w? Answer: least squares

slide-55
SLIDE 55

55

Calculus 101

x f(x) x∗

slide-56
SLIDE 56

56

Calculus 101

x f(x) x∗ x∗ = argmaxxf(x)

slide-57
SLIDE 57

57

Condition for maximum: derivative is zero

x f(x) x∗ = argmaxxf(x) x∗

slide-58
SLIDE 58

58

Condition for maximum: derivative is zero

x f(x) x∗ = argmaxxf(x) f 0(x⇤) = 0

x∗

slide-59
SLIDE 59

59

Condition for minimum: derivative is zero

x∗ = argminxf(x) f 0(x⇤) = 0

slide-60
SLIDE 60

60

Vector calculus 101

2D function graph isocontours gradient field at minimum of function:

f(x)

f(x) = c

rf(x) = "

∂f ∂x1 ∂f ∂x2

#

rf(x) = 0

slide-61
SLIDE 61

61

Back to least squares..

L(w0, w1) =

N

X

i=1

⇥ yi −

  • w0xi

0 + w1xi 1

⇤2

L(w) =

N

X

i=1

(✏i)2

yi = wT xi + ✏i

Loss function: sum of squared errors Expressed as a function of two variables: Question: what is the best (or least bad) value of w? Answer: least squares training sample feature dimension

slide-62
SLIDE 62

62

L(w0, w1) =

N

X

i=1

⇥ yi −

  • w0xi

0 + w1xi 1

⇤2 ∂L(w0, w1) ∂w0 =

N

X

i=1

∂ ⇥ yi −

  • w0xi

0 + w1xi 1

⇤2 ∂w0

Fitting a line

∂L(w0, w1) ∂w0 = 0

ó ó

N

X

i=1

yixi

0 = w0 N

X

i=1

xi

0xi 0 + w1 N

X

i=1

xi

1xi

= −2

N

X

i=1

  • yixi

0 − w0xi 0xi 0 − w1xi 1xi

  • =

N

X

i=1

2 ⇥ yi −

  • w0xi

0 + w1xi 1

⇤ (−xi

0)

slide-63
SLIDE 63

63

Fitting a line, continued

∂L(w0, w1) ∂w0 = 0

ó ó

N

X

i=1

yixi

0 = w0 N

X

i=1

xi

0xi 0 + w1 N

X

i=1

xi

1xi

ó ó

N

X

i=1

yixi

1 = w0 N

X

i=1

xi

0xi 1 + w1 N

X

i=1

xi

1xi 1

2 linear equations, 2 unknowns

∂L(w0, w1) ∂w1 = 0

slide-64
SLIDE 64

64

Fitting a line, continued

N

X

i=1

yixi

0 = w0 N

X

i=1

xi

0xi 0 + w1 N

X

i=1

xi

1xi

N

X

i=1

yixi

1 = w0 N

X

i=1

xi

0xi 1 + w1 N

X

i=1

xi

1xi 1

" PN

i=1 yixi

PN

i=1 yixi 1

# = " PN

i=1 xi 0xi

PN

i=1 xi 0xi 1

PN

i=1 xi 0xi 1

PN

i=1 xi 1xi 1

#  w0 w1

  • 2x2 system of equations:

That’s it!

slide-65
SLIDE 65

65

Fitting a line, continued

" PN

i=1 yixi

PN

i=1 yixi 1

# = " PN

i=1 xi 0xi

PN

i=1 xi 0xi 1

PN

i=1 xi 0xi 1

PN

i=1 xi 1xi 1

#  w0 w1

  • XT y = XT Xw

y =    y1 . . . yN    X =    x1 x1

1

. . . . . . xN xN

2

  

2x2 system of equations: Or, without summations:

w = (XT X)−1XT y

Solution:

slide-66
SLIDE 66

66

Linear regression in 1D

slide-67
SLIDE 67

67

Linear regression in 2D (or ND)

slide-68
SLIDE 68

68

Least squares solution for linear regression

Nx1 NxD Dx1 Nx1

     y1 y2 . . . yN      =      x1

1

. . . x1

D

x2

1

. . . x2

D

. . . xN

1

. . . xN

D

          w1 w2 . . . wD      +      ✏1 ✏2 . . . ✏N     

D: problem dimension N: training set size

slide-69
SLIDE 69

69

Least squares solution for linear regression

y = Xw + ✏ ✏ ✏

slide-70
SLIDE 70

70

Least squares solution for linear regression

Loss function: L(w) =

N

X

i=1

(yi − wT xi)2 =

N

X

i=1

(✏i)2

L(w) = ⇥ ✏1 ✏2 . . . ✏N ⇤ 2 6 6 6 4 ✏1 ✏2 . . . ✏N 3 7 7 7 5

slide-71
SLIDE 71

71

Least squares solution for linear regression

Loss function: L(w) =

N

X

i=1

(yi − wT xi)2 =

N

X

i=1

(✏i)2

L(w) = ⇥ ✏1 ✏2 . . . ✏N ⇤ 2 6 6 6 4 ✏1 ✏2 . . . ✏N 3 7 7 7 5

L(w) = ✏ ✏ ✏T✏ ✏ ✏

y = Xw + ✏ ✏ ✏

slide-72
SLIDE 72

72

Generalized linear regression

x → φ φ φ(x) =    φ1(x) . . . φM(x)   

slide-73
SLIDE 73

73

1D Example: 2nd degree polynomial fitting

φ φ φ(x) =   1 x (x)2   hw,φ φ φ(x)i = w0 + w1x + w2(x)2

slide-74
SLIDE 74

74

1D Example: k-th degree polynomial fitting

hw,φ φ φ(x)i = w0 + w1x + . . . + wk(x)K φ φ φ(x) =      1 x . . . (x)K     

slide-75
SLIDE 75

75

2D example: second-order polynomials

x = (x1, x2)

hw,φ φ φ(x)i = w0 + w1x1 + w2x2 + w3x2

1 + w4x2 2 + w5x1x2

φ φ φ(x) =         1 x1 x2 (x1)2 (x2)2 x1x2        

slide-76
SLIDE 76

76

Reminder: linear regression

Loss function: L(w) =

N

X

i=1

(yi − wT xi)2 =

N

X

i=1

(✏i)2

     y1 y2 . . . yN      =      x1

1

. . . x1

D

x2

1

. . . x2

D

. . . xN

1

. . . xN

D

          w1 w2 . . . wD      +      ✏1 ✏2 . . . ✏N     

slide-77
SLIDE 77

77

Reminder: linear regression

Loss function: L(w) =

N

X

i=1

(yi − wT xi)2 =

N

X

i=1

(✏i)2

     y1 y2 . . . yN      =      (x1)T (x2)T . . . (xN)T           w1 w2 . . . wD      +      ✏1 ✏2 . . . ✏N     

slide-78
SLIDE 78

78

Generalized linear regression

Loss function: Nx1 NxM Mx1 Nx1

φ φ φ(x) : RD → RM

L(w) =

N

X

i=1

(yi–wT

  • (xi))T =

N

X

i=1

(✏i)2

     y1 y2 . . . yN      =     

  • (x1)T
  • (x2)T

. . .

  • (xN)T

          w1 w2 . . . wM      +      ✏1 ✏2 . . . ✏N     

slide-79
SLIDE 79

79

Least squares solution for linear regression

L(w) = ✏ ✏ ✏T✏ ✏ ✏

Minimize:

y = Xw + ✏ ✏ ✏

X =      (x1)T (x2)T . . . (xN)T     

w∗ = (XT X)−1XT y

slide-80
SLIDE 80

80

Least squares solution for generalized linear regression

y = Φ Φ Φw + ✏ ✏ ✏

Φ Φ Φ =      φ φ φ(x1)T φ φ φ(x2)T . . . φ φ φ(xN)T     

w∗ = (Φ Φ ΦTΦ Φ Φ)−1Φ Φ Φy

L(w) = ✏ ✏ ✏T✏ ✏ ✏

Minimize:

slide-81
SLIDE 81

81

2D example: second-order polynomials

x = (x1, x2)

hw,φ φ φ(x)i = w0 + w1x1 + w2x2 + w3x2

1 + w4x2 2 + w5x1x2

φ φ φ(x) =         1 x1 x2 (x1)2 (x2)2 x1x2        

slide-82
SLIDE 82

82

5D Example: fourth-order polynomials in 5D

x = (x1, . . . , x5)

15625 Dimensions =>15625 parameters

φ φ φ(x) =           1 x1 . . . x5 . . . (x1x2x3x4x5)4          

slide-83
SLIDE 83

83

What was happening before: approximations

Training: ,

S = {(xi, yi)}, i = 1, . . . , N

. . .

If N>D (e.g. 30 points, 2 dimensions) we have more equations than unknowns: overdetermined system! Input-output relations can only hold approximately!

y1 ' w0x1

0 + w1x1 1 + . . . + wDx1 D

y2 ' w0x2

0 + w1x2 1 + . . . + wDx2 D

yN ' w0xN

0 + w1xN 1 + . . . + wDxN D

slide-84
SLIDE 84

84

What is happening now: overfitting

Training: ,

S = {(xi, yi)}, i = 1, . . . , N

If N<D (e.g. 30 points, 15265 dimensions) we have more unknowns than equations: underdetermined system! Input-output equations hold exactly, but we are simply memorizing data

y1 = w0x1

0 + w1x1 1 + . . . + wDx1 D

y2 = w0x2

0 + w1x2 1 + . . . + wDx2 D

yN = w0xN

0 + w1xN 1 + . . . + wDxN D

. . .

slide-85
SLIDE 85

85

Overfitting, in images

Classification Regression

just right

slide-86
SLIDE 86

86

Tuning the model’s complexity

A flexible model approximates the target function well in the training set but can “overtrain” and have poor performance on the test set (“variance”) A rigid model’s performance is more predictable in the test set but the model may not be good even on the training set (“bias”)

slide-87
SLIDE 87

87

Regularization: keeping it simple

In high dimensions: too many solutions for the same problem How? Penalize complexity Regularization: prefer the least complex among them

slide-88
SLIDE 88

88

How to control complexity?

Observation: problem started with high-dimensional embeddings Guess: Number of dimensions relates to “complexity” But what if we force the classifier not to use all of the parameters? (Week 4: we will guess again!) Intuition: with many parameters, we can fit anything Idea: penalize the use of large parameter values How do we measure “large”? How do we enforce small values?

slide-89
SLIDE 89

89

How do we measure “large”?

Method parameters: D-dimensional vector

w = [w1, w2, . . . , wD]

“Large” vector: vector norm L2, (“euclidean”) norm:

kwk2 . = v u u t

D

X

d=1

w2

d =

p hw, wi

kwk1 . =

D

X

d=1

|wd|

L1, (“manhattan”) norm: Lp norm, p>1:

kwkp . = D X

d=1

wp

d

!1/p

slide-90
SLIDE 90

90

Regularized linear regression

L(w) = ✏ ✏ ✏T✏ ✏ ✏

✏ ✏ ✏ = y − Φ Φ Φw

residual vector linear regression: minimize model error Complexity term: (regularizer) R(w) .

= kwk2

2 = wT w

L(w) = ✏ ✏ ✏T✏ ✏ ✏ + wT w

scalar, remains to be determined minimum remains to be determined

“data fidelity” complexity

slide-91
SLIDE 91

91

Least squares solution

L(w) = ✏ ✏ ✏T✏ ✏ ✏

= (y − Xw)T (y − Xw) = yT y − 2yT Xw + wT XT Xw rL(w∗) = 0 w∗ = (XT X)−1XT y

Condition for minimum:

−2XT y + 2XT Xw∗ = 0

slide-92
SLIDE 92

92

Ridge regression: L2-regularized linear regression

L(w) = ✏ ✏ ✏T✏ ✏ ✏ + wT w = yT y − 2yT Xw + wT XT Xw + λwT Iw

as before, for linear regression identity matrix

= yT y − 2yT Xw + wT XT X + λI

  • w

rL(w∗) = 0

Condition for minimum:

−2XT y + 2(XT X + λI)w∗ = 0 w∗ = (XT X + λI)−1XT y

slide-93
SLIDE 93

93

Ridge regression, continued

Regularizer:

L(w) = ✏ ✏ ✏T✏ ✏ ✏ + wT w

R(w) . = kwk2

2 = wT w

New objective: scalar, remains to be determined We just determined minimum

“data fidelity” complexity

λ: “hyperparameter”

Νοτε: direct minimization w.r.t. it would lead to λ=0

slide-94
SLIDE 94

94

Bias-Variance tradeoff as a function of λ

(function of λ) sweet spot!

slide-95
SLIDE 95

95

  • Cross validation technique

– Exclude part of the training data from parameter estimation – Use them only to predict the test error

  • K-fold cross validation:

– K splits, average K errors

  • Use cross-validation for different

values of λ parameter

– pick value that minimizes cross- validation error

Selecting λ with cross-validation

Least glorious, most effective

  • f all methods