Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI - - PowerPoint PPT Presentation

functions and data fitting
SMART_READER_LITE
LIVE PREVIEW

Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI - - PowerPoint PPT Presentation

Functions and Data Fitting COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Functions and Data Fitting 1 / 17 Outline 1 Functions 2 Features 3 Polynomial Fitting: Univariate Least Squares Fitting Choosing a Degree 4


slide-1
SLIDE 1

Functions and Data Fitting

COMPSCI 371D — Machine Learning

COMPSCI 371D — Machine Learning Functions and Data Fitting 1 / 17

slide-2
SLIDE 2

Outline

1 Functions 2 Features 3 Polynomial Fitting: Univariate

Least Squares Fitting Choosing a Degree

4 Polynomial Fitting: Multivariate 5 Limitations of Polynomials 6 The Curse of Dimensionality

COMPSCI 371D — Machine Learning Functions and Data Fitting 2 / 17

slide-3
SLIDE 3

Functions

Functions Everywhere

  • SPAM

A = {all possible emails} Y = {true, false} f : A → Y and y = f(a) ∈ Y for a ∈ A

  • Virtual Tennis

A = {all possible video frames} ⊆ Rd Y = {body configurations} ⊆ Re

  • Medical diagnosis, speech recognition, movie

recommendation

  • Predictor = Regressor or Classifier

COMPSCI 371D — Machine Learning Functions and Data Fitting 3 / 17

slide-4
SLIDE 4

Functions

Classic and ML

  • Classic:
  • Design features by hand
  • Design f by hand
  • ML:

Define A, Y Collect Ta = {(a1, y1), . . . , (aN, yN)} ⊂ A × Y Choose F Design λ : {all possibleTa} → F Train: f = λ(Ta) Hopefully, y ≈ f(a) now and forever

  • Technical: A can be anything. Too difficult to work with.

COMPSCI 371D — Machine Learning Functions and Data Fitting 4 / 17

slide-5
SLIDE 5

Features

Features

  • From A to X ∈ Rd

x = φ(a) y = h(x) = h(φ(a)) = f(a) h : X ⊆ Rd → Y ⊆ Re H ⊆ {X → Y} T = {(x1, y1), . . . , (xN, yN)} ⊂ X × Y

  • Just numbers!

COMPSCI 371D — Machine Learning Functions and Data Fitting 5 / 17

slide-6
SLIDE 6

Features

Features for SPAM

d = 20, 000 φ also useful in order to make d smaller or more informative

COMPSCI 371D — Machine Learning Functions and Data Fitting 6 / 17

slide-7
SLIDE 7

Features

Fitting and Learning

  • Loss ℓ(y, h(x)) : Y × Y → R+
  • Empirical Risk (ER): average loss on T
  • Fitting and Learning:
  • Given T ⊂ X × Y with X ⊆ Rd

H = {h : X → Y} (hypothesis space)

  • Fitting: Choose h ∈ H to minimize ER over T
  • Learning: Choose h ∈ H to minimize some risk over

previously unseen (x, y)

COMPSCI 371D — Machine Learning Functions and Data Fitting 7 / 17

slide-8
SLIDE 8

Features

Summary

  • Features insulate ML from domain vagaries
  • Loss function insulates ML from price considerations
  • Empirical Risk (ER) averages loss for h over T
  • ER measures average performance of h
  • A learner picks an h ∈ H that minimizes some risk
  • Data fitting minimizes ER and stops here
  • ML wants h to do well also tomorrow
  • The risk for ML is on a bigger set

COMPSCI 371D — Machine Learning Functions and Data Fitting 8 / 17

slide-9
SLIDE 9

Polynomial Fitting: Univariate

Data Fitting: Univariate Polynomials

h : R → R h(x) = c0 + c1x + . . . + ckxk with ci ∈ R for i = 0, . . . , k and ck = 0

  • The definition of the structure of h defines the hypothesis

space H

  • T = {(x1, y1), . . . , (xN, yN)} ⊂ R × R
  • Quadratic loss ℓ(y, ˆ

y) = (y − ˆ y)2

  • ER: LT(h)

def

=

1 N

N

n=1 ℓ(yn, h(xn))

  • Choosing h is the same as choosing c = [c0, . . . , ck]T
  • LT is a quadratic function of c

COMPSCI 371D — Machine Learning Functions and Data Fitting 9 / 17

slide-10
SLIDE 10

Polynomial Fitting: Univariate

Rephrasing the Loss

NLT(h) = N

n=1[yn − h(xn)]2 =

N

n=1{yn − [c0 + c1xn + . . . + ckxk n ]}2

=

  y1 − [c0 + c1x1 + . . . + ckxk

1 ]

. . . yN − [c0 + c1xN + . . . + ckxk

N]

  

  • 2

  y1 . . . yN    −    1 x1 . . . xk

1

. . . 1 xN . . . xk

N

      c0 . . . ck   

  • 2

= b − Ac2

COMPSCI 371D — Machine Learning Functions and Data Fitting 10 / 17

slide-11
SLIDE 11

Polynomial Fitting: Univariate

Linear System in c

c0 + c1xn + . . . + ckxk

n = yn

Ac = b A =    1 x1 . . . xk

1

. . . . . . . . . 1 xN . . . xk

N

   and b =    y1 . . . yn   

  • Where are the unknowns?
  • Why is this linear?

COMPSCI 371D — Machine Learning Functions and Data Fitting 11 / 17

slide-12
SLIDE 12

Polynomial Fitting: Univariate Least Squares Fitting

Least Squares

Ac = b b

?

∈ range(A) ˆ c ∈ arg min

c Ac − b2

Thus, we are minimizing the empirical risk LT(h) (with the quadratic loss) over the training set

COMPSCI 371D — Machine Learning Functions and Data Fitting 12 / 17

slide-13
SLIDE 13

Polynomial Fitting: Univariate Choosing a Degree

Choosing a Degree

1 5 1 5 1 5

k = 1 k = 3 k = 9

  • Underfitting, overfitting, interpolation

COMPSCI 371D — Machine Learning Functions and Data Fitting 13 / 17

slide-14
SLIDE 14

Polynomial Fitting: Multivariate

Data Fitting: Multivariate Polynomials

  • The story is not very different:

h(x) = c0 + c1x1 + c2x2 + c3x2

1 + c4x1x2 + c5x2 2

  • Polynomial of degree 2

A =    1 x11 x12 x2

11

x11x12 x2

12

. . . . . . . . . . . . . . . . . . 1 xN1 xN2 x2

N1

xN1xN2 x2

N2

   and b =    y1 . . . yN   

  • The rest is the same
  • Why are we not done?

COMPSCI 371D — Machine Learning Functions and Data Fitting 14 / 17

slide-15
SLIDE 15

Limitations of Polynomials

Counting Monomials

  • Monomial of degree k′ ≤ k:

xk1

1 . . . xkd d

where k1 + . . . + kd = k′

  • How many are there?

m(d, k ′) = d + k′ k′

  • (See an Appendix for a proof)

COMPSCI 371D — Machine Learning Functions and Data Fitting 15 / 17

slide-16
SLIDE 16

Limitations of Polynomials

Asymptotics: Too Many Monomials

m(d, k) = d+k

k

  • = (d+k)!

d!k!

= (d+k) (d+k−1) ... (d+1)

k!

k fixed: O(dk) d fixed: O(k d)

  • When k is O(d), look at m(d, d):

m(d, d) is O(4d/ √ d)

  • Except when k = 1 or d = 1, growth is polynomial (with

typically large power) or exponential (if k and d grow together)

COMPSCI 371D — Machine Learning Functions and Data Fitting 16 / 17

slide-17
SLIDE 17

The Curse of Dimensionality

The Curse of Dimensionality

  • A large d is trouble regardless of H
  • We want T to be “representative”
  • “Filling” Rd with N samples

X = [0, 1]2 ⊂ R2 10 bins per dimension, 102 bins total X = [0, 1]d ⊂ Rd 10 bins per dimension, 10d bins total

  • d is often hundreds or thousands (SPAM d ≈ 20, 000)
  • 1080 atoms in the universe
  • We will always have too few data points

COMPSCI 371D — Machine Learning Functions and Data Fitting 17 / 17