Hi The simplest mo del that P(x) ts the data is also - - PowerPoint PPT Presentation

hi the simplest mo del that p x ts the data is also the
SMART_READER_LITE
LIVE PREVIEW

Hi The simplest mo del that P(x) ts the data is also - - PowerPoint PPT Presentation

PSfrag replaements Review of Leture 17 Sampling bias Oam's Razo r Hi The simplest mo del that P(x) ts the data is also the testing training most plausible. Data sno oping x Hi 30 20 sno oping


slide-1
SLIDE 1 Review
  • f
Le ture 17
  • O am's
Razo r The simplest mo del that ts the data is also the most plausible.
  • mplexit
y
  • f h ←

  • mplexit
y
  • f H
unlik ely event ←

signi ant if it happ ens
  • Sampling
bias

Hi Hi

P(x)

training testing

x

  • Data
sno
  • ping
PSfrag repla ements Da y Cumulative Prot % no sno
  • ping
sno
  • ping
100 200 300 400 500
  • 10
10 20 30
slide-2
SLIDE 2 Lea rning F rom Data Y aser S. Abu-Mostafa Califo rnia Institute
  • f
T e hnology Le ture 18: Epilogue Sp
  • nso
red b y Calte h's Provost O e, E&AS Division, and IST
  • Thursda
y , Ma y 31, 2012
slide-3
SLIDE 3 Outline
  • The
map
  • f
ma hine lea rning
  • Ba
y esian lea rning
  • Aggregation
metho ds
  • A
kno wledgments

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 2/23
slide-4
SLIDE 4 It's a jungle
  • ut
there

stochastic gradient descent nonlinear transformation

  • verfitting

data snooping Occam’s razor

perceptrons data contamination error measures

cross validation

linear models

types of learning

kernel methods

logistic regression

training versus testing

VC dimension

linear regression deterministic noise

noisy targets bias−variance tradeoff

RBF

SVM

weight decay

regularization soft−order constraint

sampling bias neural networks exploration versus exploitation

weak learners

Gaussian processes

active learning

graphical models

decision trees

ensemble learning

Bayesian prior collaborative filtering

clustering

hidden Markov models

distribution−free

  • rdinal regression

Boltzmann machines no free lunch

mixture of experts

Q learning

learning curves

semi−supervised learning

is learning feasible?

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 3/23
slide-5
SLIDE 5 The map

TECHNIQUES PARADIGMS THEORY VC bias−variance complexity bayesian unsupervised reinforcement supervised

  • nline

active neural networks RBF nearest neighbors SVD linear SVM aggregation input processing gaussian processes graphical models models methods regularization validation

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 4/23
slide-6
SLIDE 6 Outline
  • The
map
  • f
ma hine lea rning
  • Ba
y esian lea rning
  • Aggregation
metho ds
  • A
kno wledgments

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 5/23
slide-7
SLIDE 7 Probabilisti app roa h

f: X Y

x

( )

P

y x y

N N 1 1

x

D =

HYPOTHESIS SET ALGORITHM LEARNING FINAL HYPOTHESIS H A X Y

g:

xN

1

x , ... , x x x

( ) ( ) g ~ f ~ UNKNOWN TARGET DISTRIBUTION target function plus noise

P y ( | )

x

P y P y P y

Hi Hi

UNKNOWN INPUT DISTRIBUTION

DATA SET ( , ), ... , ( , )

Extend p robabilisti role to all
  • mp
  • nents

P(D | h = f)

de ides whi h h (lik eliho
  • d)
Ho w ab
  • ut P(h = f | D)
?

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 6/23
slide-8
SLIDE 8 The p rio r

P(h = f | D)

requires an additional p robabilit y distribution:

P(h = f | D) = P(D | h = f) P(h = f) P(D) ∝ P(D | h = f) P(h = f) P(h = f)

is the p rio r

P(h = f | D)

is the p
  • sterio
r Given the p rio r, w e have the full distribution

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 7/23
slide-9
SLIDE 9 Example
  • f
a p rio r Consider a p er eptron: h is determined b y w = w0, w1, · · · , wd A p
  • ssible
p rio r
  • n w
: Ea h wi is indep endent, unifo rm
  • ver [−1, 1]
This determines the p rio r
  • ver h
  • P(h = f)
Given D , w e an
  • mpute P(D | h = f)
Putting them together, w e get P(h = f | D)

∝ P(h = f)P(D | h = f)

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 8/23
slide-10
SLIDE 10 A p rio r is an assumption Even the most neutral p rio r:

x is unknown

1 −1

x

P(x)

x is random

Hi Hi

−1 1

The true equivalent w
  • uld
b e:

x is unknown

1 −1

x x is random

Hi Hi

−1 1

a

δ −a

(x )

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 9/23
slide-11
SLIDE 11 If w e knew the p rio r

. . .

w e
  • uld
  • mpute P(h = f | D)
fo r every h ∈ H

= ⇒

w e an nd the most p robable h given the data w e an derive E(h(x)) fo r every x w e an derive the erro r ba r fo r every x w e an derive everything in a p rin ipled w a y

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 10/23
slide-12
SLIDE 12 When is Ba y esian lea rning justied? 1. The p rio r is valid trumps all
  • ther
metho ds 2. The p rio r is irrelevant just a
  • mputational
atalyst

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 11/23
slide-13
SLIDE 13 Outline
  • The
map
  • f
ma hine lea rning
  • Ba
y esian lea rning
  • Aggregation
metho ds
  • A
kno wledgments

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 12/23
slide-14
SLIDE 14 What is aggregation? Combining dierent solutions h1, h2, · · · , hT that w ere trained
  • n D
: Hi Hi Regression: tak e an average Classi ation: tak e a vote a.k.a. ensemble lea rning and b
  • sting

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 13/23
slide-15
SLIDE 15 Dierent from 2-la y er lea rning In a 2-la y er mo del, all units lea rn jointly:

training data Algorithm Learning

Hi Hi In aggregation, they lea rn indep endently then get
  • mbined:

training data Algorithm Learning

Hi Hi

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 14/23
slide-16
SLIDE 16 T w
  • t
yp es
  • f
aggregation 1. After the fa t:
  • mbines
existing solutions Example. Netix teams merging blending 2. Befo re the fa t: reates solutions to b e
  • mbined
Example. Bagging
  • resampling D

training data Algorithm Learning

Hi Hi

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 15/23
slide-17
SLIDE 17 De o rrelation
  • b
  • sting
Create h1, · · · , ht, · · · sequentially: Mak e ht de o rrelated with p revious h 's:

training data Algorithm Learning

Hi Hi Emphasize p
  • ints
in D that w ere mis lassied Cho
  • se
w eight
  • f ht
based
  • n E
in(ht)

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 16/23
slide-18
SLIDE 18 Blending
  • after
the fa t F
  • r
regression,

h1, h2, · · · , hT − → g(x) =

T

  • t=1

αt ht(x)

Prin ipled hoi e
  • f αt
's: minimize the erro r
  • n
an aggregation data set pseudo-inverse Some αt 's an
  • me
  • ut
negative Most valuable ht in the blend? Un o rrelated ht 's help the blend

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 17/23
slide-19
SLIDE 19 Outline
  • The
map
  • f
ma hine lea rning
  • Ba
y esian lea rning
  • Aggregation
metho ds
  • A
kno wledgments

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 18/23
slide-20
SLIDE 20 Course
  • ntent
Professo r Malik Magdon-Ismail, RPI Professo r Hsuan-Tien Lin, NTU

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 19/23
slide-21
SLIDE 21 Course sta Ca rlos Gonzalez (Head T A) Ron App el Costis Sideris Do ris Xin

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 20/23
slide-22
SLIDE 22 Filming, p ro du tion, and infrastru ture Leslie Maxeld and the AMT sta Ri h F agen and the IMSS sta

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 21/23
slide-23
SLIDE 23 Calte h supp
  • rt
IST
  • Mathieu
Desb run E&AS Division
  • Ares
Rosakis and Mani Chandy Provost's O e
  • Ed
Stolp er and Melany Hunt

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 22/23
slide-24
SLIDE 24 Many
  • thers
Calte h T A's and sta memb ers Calte h alumni and Alumni Asso iation Colleagues all
  • ver
the w
  • rld

A

M L

Creato r: Y aser Abu-Mostafa
  • LFD
Le ture 18 23/23
slide-25
SLIDE 25

Faiza A. Ibrahim

To the fond memory of