Over fitting distribution functions over Bayesian Regression / - - PowerPoint PPT Presentation

over fitting
SMART_READER_LITE
LIVE PREVIEW

Over fitting distribution functions over Bayesian Regression / - - PowerPoint PPT Presentation

Over fitting Under fitting and - Under fitting Over fitting distribution functions over Bayesian Regression / " ' i diggllloise dist ' f. fnpc f) fcxn ) - ,N En n=1 + En~ PIE yn = ... , of Basis Function


slide-1
SLIDE 1 Over fitting and Under fitting
  • Under
fitting Over fitting
slide-2
SLIDE 2 Bayesian Regression distribution
  • ver
functions
  • /
f.

"

' i '

diggllloise

dist yn = fcxn ) + En fnpc f) En~ PIE n=1 , ... ,N Linear Basis Function
  • a
fix )
  • .
f- ' ⇒ of
  • x
x D f ( In ) := It In = Iwaka fCInl÷ Tvtocxn ) D= ' D
  • I=×w

×=tI÷El}

"

t.ae#=t9tEIII1NX1NxDpxl

slide-3
SLIDE 3 Example : Polynomial Basis yn = Wttcfkn ) t En w°~p( I ) En~ PIE ) n= I , ... , N 4 Reduces to linear reg . when ¢ is identity Polynomial Basis ¢d(×n ) = xD Prior
  • n
in implies prior
  • n
functions f
slide-4
SLIDE 4 Ridge Regression ( LZ Regularization ) Objective :

[ TEI

= IF cyn . wto .it +

ia¥

, wa ' I =
  • I=i
I = 10
slide-5
SLIDE 5 Ridge Regression : Probabilistic Interpretation

ETE

)
  • iii. cyn-wtoc.int
to wa ' Maximum a Posteriori Estimation yn = IT

Often

) t En Wd ~ Norm ( o , 5) En
  • Norm (0,62 )
argument pin 15 ) = angry ax log pin 15 ) = anginwax log Pig , hi )
  • ¥155
= anginwax log piylw ) tlogpcw )
slide-6
SLIDE 6 Ridge Regression : Probabilistic Interpretation rider

£

, yn . ¢⇒ ) ' +

btw Lte )
  • n
. . .
  • I

,n=

Maximum a Posteriori Estimation yn = IT
  • ften
) t En Wd ~ Norm ( o , 5) En
  • Norm (0,62 )
yn ~ Norm ( Bto ( In ) , 62 ) argwfaxpcw-iys-a-gwyaxlogplyiwstlogp.tw ) log piyiwi = In

log¥e

yn
  • into
ku ) ) ' I log pins = log (

w--oThi

slide-7
SLIDE 7 Ridge Regression S yn = I Tolka ) t En Wdr Norm ( o , S ) Enn Norinco ,6 ) argue ax log piety ) D= =
  • D= %
= , D=
  • no
6
  • I
precise
  • bservations
) s → as C uninformative prior )
slide-8
SLIDE 8 Posterior Predictive

Distribution

gf

  • Plf 'S)
Yi ya Ys Ys b 5 Small D ° O
  • I
y # ° a
  • ) ↳
O Large A
  • µ
# x* y #
  • [
Posterior Predictive ply 't' ly ) = Idf ply 't I f) pcfly ) given previous far new y*
  • bservations
slide-9
SLIDE 9 Posterior Predictive Distribution f * . I y , ya 43 Ss b 5 I , Small °
  • 00
I 9--91 ,

,*y*

O " I
  • hi
CO l y # , f * O I Large A
  • oo€
y # y* . I Idea : Incorporate uncertainty about f in predictions
slide-10
SLIDE 10 Gaussian Processes

y

Informal Definition
  • f
" Nonparametric " : determined from data Formal View : Nonparametric Distribution
  • n
Functions f ~ PC fly , :w ) ° µ( x ) Mean function
  • *
J
  • .

...

h( × ' ) Covariance function fcx) .
  • .
F .
  • ~
GPC µ , h ) ×

YnIF=f~

Norm ( f I

In

) ,6 )
  • Likelihood
prior Posterior : PC fly , ) = p( y , in , I f) pcf ) PC y ; :n ) Marginal likelihood
slide-11
SLIDE 11 Gaussian Processes Practical View : Generalization
  • f
Multivariate Normal pitch I blini ) flu

i=

µ ) Prior
  • n
function values at finite [ m :-. lil In , Em ) set
  • f
points t ' " E n Norm ( pi , E ) E
  • fetal
, . . . ,FEmD Yul E. fan Norml fu , 6 ) x Predictive : pcj*Iy ) = Pl5* pigs
slide-12
SLIDE 12 Joint Distributions an Function Values N El N x N N X M I ¥1
  • Norm IlqY¥ , 1,1ha .*x
I

acx.xx.gg)

be CX , X ) h ( x ' ,X* I M Xl M X N MX M pic x ) = ( ME ) , . . . , pika ) ) thx "
  • ( HIM
, . . . ,µEn , hcxt.xt.fm?:ii...hcx.?.xnyhlxE,I , ) . . . thx 'm , In ) I = E t E E
  • Now ( pic x )
, hit ,X ) ) 4- ~ Nonmlpikl , kk ,Xlt6I ) E n Norm I
  • ,
EI ) I s Y pelxl tell ,XltdI hfxx 't )

Igf

  • Nord #
*

t.lu#xlhcx:xmtotD

slide-13
SLIDE 13 Properties
  • f
Multivariate Normals

lil

" Probability density for jointly Gaussian variables N N MXN NXM ZE RN pie , f) = Norm

til

:/

, 1¥81 ) perm M M ur xN MXM Marginals poet = Norm a- , A ) pcp's = Norm (p ;D B ) next ' N XX I N x I Conditional pip If ) = Norm 1/5 ; b-
  • E
A- ' ( I
  • a )
, B
  • CT
A- ' c ) Mx M MXN N Xxl NXM
slide-14
SLIDE 14 Computing the Predictive Distribution a- a- A B

*

tuft

:

"i¥lK¥

"

:&

:¥¥÷D

B b- c D Predictive
  • n
new values ( C 4*1 4 nu Norm ( b- . CT A- ' I

I

  • a- I
, B
  • CTA
" C ) ~ Norm ( pix 't't
  • kfxtxlfhlx,xlt62IJ
' II
  • MIN)
, k(x4x* )
  • lek
( hlx.xlt62IThk.x.tl ) \
slide-15
SLIDE 15 Computing the Predictive Distribution

I

a- A B I

Y

*In Norm ( / ° ) , ( he CXX ) t at lek ,x* ,
  • delxtt.NL#x*)t62IH
B b- C D Predictive
  • n
new values ~ y ' .

4*14=5

nu Norm ( O , B
  • CTA
" C ) ~ Norm ( let XIX ) ( klx ,x)taI ) ' ' 5 k(x*,x* )
  • tix
( hlx.xlt62IThk.x.tl ) In practice : Assume 5=4
  • yuk )
pix ) =
  • \
slide-16
SLIDE 16 Gaussian Processes Practical View : Generalization
  • f
Multivariate Normal pitch I blini ) flu

i=

µ ) Prior
  • n
function values at finite [ m :-. lil In , Em ) set
  • f
points t ' " E n Norm ( pi , E ) E
  • fetal
, . . . ,FEmD Yul E. fan Norml fu , 6 ) x Can compute Predictive ; pcyttly ) = P'5t using linear pig ) algebra
slide-17
SLIDE 17 does not have Gaussian Processes i Interpretation to be diagonal Bayesian Ridge Regression ( with Features ) N XD DXI Y n Norm (

OICXIW

, EI )

Wr

Norm ( m )

ELY

) = OIK

)Efw

] = OI (Xo ) mo = tell ) Cov [ 4 ] = Cov I IK ) w ] t 62 I = IN Covlw ] # I Xlt + 62 I µ xD Dxb DX N = ④ G) So # HT t 6- I = kk ,Nt5I .
  • Cov
fun , Ym ] CENTS . dem )
  • hkh.im )
slide-18
SLIDE 18 does not have Gaussian Processes i Interpretation to be diagonal Bayesian Ridge Regression ( with Features ) Y n Norm (

OIKIW

, EI ) W
  • Norm (mo ,
Equivalent Gaussian Process Y n Norm ( FIX ) , 62 I ) Fr GPG.ucxt.hk.lt) MX ) = OI mo kcx ,x ) = IO ( x ) So KIT Computational Advantage : Use kernel trick to evaluate inner products in high
  • dimensional
feat spaces
slide-19
SLIDE 19 Gaussian Processes us Ridge Regression 4 n Norm ( FIX ) , 62 I ) Fr GPC.mx/,hlx.Xl) MX ) = OI CIDmo kcx ,x ) = IO ( x ) So # KIT to N Derive this in Homework ; S ' I Ely 't 14=51 = his :X ) ( let x. x )

titty

Complexity :

0447=011×9

# KIT ( Ik ) # Cx)Ttg÷I5' y OCD ) = OI Htt ) ( OI KY # K ) + III ) " It Cxlty Conclusion : Posterior mean equivalent to Ridge Regression
slide-20
SLIDE 20 Regularization in Kennel
  • based
Regression . I ,

Ifl

ya Ys 3h55 , Small

be yz9uy5

⑧ 008

I

¥1

y* O " I D= I
  • hi
15.1 ' co i y # O I Large A
  • oo€
y # y* . > Choice
  • f
kernel ht

In

,

In

) = § Old (

In

) So

.de/0eCXm

) functor implies e choice
  • f
D
slide-21
SLIDE 21 Choosing Kernel Hyperparameters Source : Carl Rasmussen Squared
  • Exponential
: h( x. x ' ) = exp (-1×2*3) Large l means stronger regularization
slide-22
SLIDE 22 Matern kernels Idea : prior
  • ver
h times differentiable functions Bessel function Source : Tom Rainforth , PhD Thesis
  • k=Lu
  • D
leu ( E ,I ' ) = 6f }l÷, KfM2q HE
  • I 'll )
0=312,512 , " . Gamma function
slide-23
SLIDE 23 Basis Kernels Source : David Duvenaud , PhD Thesis
slide-24
SLIDE 24 Combinations
  • f
kernels Source : David Duvenaud , PhD Thesis Sumi . h( I ,E ' ) = be ,( I. In thzk ,I ' ) Product : hl I ,k ' ) = be ,( I. In . hz( I ,I ' )
slide-25
SLIDE 25 Gaussian Processes : Limitations Limitation I i Computational Complexity NXM µxN FIT

.fr/Vonm(k(xIx)lklx.xlto2IYy

k(x4x* )
  • lek
( hlx.xlt62IThk.x.tl ) MXN MAN NXN NXM Limitation 2 : Dimensionality Gaussian processes work best when Xue Rib with I f D f to ( Intuition : smoothness assumptions not helpful in high D because curse
  • f
dimensionality )
slide-26
SLIDE 26 Bayesian Regression and Gaussian Processes Bayesian Regression ; f = angfmae pcfly )
  • Ridge
regression is MAP estimation with Gaussian prior
  • n
weights
  • Regularization
D=

increases

with
  • bservation
noise

67

and decreases with weight

variances

' Gaussian Processes : . y # r

Idf

ply 'Tf) plfly )
  • Provides
estimate
  • f
uncertainty in predictions
  • Posterior
mean is equivalent to ridge regression
  • (XP
) vs OCD > ) complexity
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30