Probability*and*Statistics* ! for*Computer*Science** - - PowerPoint PPT Presentation

probability and statistics
SMART_READER_LITE
LIVE PREVIEW

Probability*and*Statistics* ! for*Computer*Science** - - PowerPoint PPT Presentation

Probability*and*Statistics* ! for*Computer*Science** All!models!are!wrong,!but!some! models!are!useful555!George!Box! Credit:!wikipedia! Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!4.14.2020! Last*time*


slide-1
SLIDE 1

!

Probability*and*Statistics* for*Computer*Science**

“All!models!are!wrong,!but!some! models!are!useful”555!George!Box!

Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!4.14.2020! Credit:!wikipedia!

slide-2
SLIDE 2

Last*time*

StochasOc!Gradient!Descent! Naïve!Bayesian!Classifier!

slide-3
SLIDE 3

An*example*of*Naive*Bayes*training*

Training!data!

X(1)% X(2)% y% 3.5! 10! 1! 1.0! 8! 1! 0.0! 10! 51! 53.0! 14! 51!

Modeling!!!!!!!!!!!!!!! as!normal!

P(x(1)|y)

P(x(1)|y = 1)

µMLE = 3.5 + 1.0 2 = 2.25

σMLE = 1.25

P(x(1)|y = −1)

µMLE = −1.5 σMLE = 1.5

Modeling!!!!!!!!!!!!!!! as!Poisson!

P(x(2)|y)

P(x(2)|y = 1)

λMLE = 10 + 8 2 = 9

P(x(2)|y = −1)

λMLE = 12

Modeling!!!!!!!!!!!!!!! as!Bernoulli!

P(y)

P(y = 1) = 2 4 = 0.5

P(y = −1) = 0.5

laser

V

÷

discrete

&

a

3. 9

?

pay 1*7 PIXIYIPCYI

T

  • pix )
slide-4
SLIDE 4

Classification*example:*

argmax

y

  • d
  • j=1

logP(x(j)|y) + log P(y)

  • For!a!new!feature!vector!x!=![x1,x2,…],!ie!x!=![3,9]!in!

the!example!

!!!!!!!!!!!!!!!!!!!!!!!!! ! ! ! ! !

slide-5
SLIDE 5

Classification*example:*

argmax

y

  • d
  • j=1

logP(x(j)|y) + log P(y)

  • For!a!new!feature!vector!x!=![x1,x2,…],!ie!x!=![3,9]!in!

the!example!

g(y) =

  • y = 1

y = −1

!!!!!!!!!!!!!!!!!!!!!!!!! ! ! ! ! !

slide-6
SLIDE 6

Example*of*Naïve*Bayesian*Model*

“Bag!of!words”!Naive!Bayesian!models!for! document!class!

!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!!!!!!!!!!!!! X5windows! MS5windows! document!(represented!as!a! bag5of5words!bit!vector),! each!column!is!a!word!

g-

'

subject

" -

I

slide-7
SLIDE 7

What*about*the*decision*boundary?*

Not!explicit!as!in!the!case!of!decision!tree! This!method!is!parametric,!generaOve!

The!model!was!specified!with!parameters!to!

generate!label!for!test!data!

slide-8
SLIDE 8

Pros*and*Cons*of*Naïve*Bayesian*Classifier*

Pros:!

Simple!approach! Good!accuracy! Good!for!high!dimensional!data!

Cons:!

The!assumpOon!of!condiOonal!independence!of!

features!

No!explicit!decision!boundary! SomeOmes!has!numerical!issues!

slide-9
SLIDE 9

Content*

Naïve!Bayesian!Classifier!(cont)! Linear!regression!

The%problem% The!least!square!soluOon! The!training!and!predicOon! The!R5squared!for!the!

evaluaOon!of!the!fit.!

slide-10
SLIDE 10

Some*popular*topics*in*Ngram*

slide-11
SLIDE 11

Regression*models*are*Machine*learning* methods*

Regression!models!have!been!

around!for!a!while!

Dr.!Kelvin!Murphy’s!Machine!

Learning!book!has!3+!chapters!on! regression!

slide-12
SLIDE 12

Content*

Linear!regression!

The%problem% The!least!square!soluOon! The!training!and!predicOon! The!R5squared!for!the!

evaluaOon!of!the!fit.! !

slide-13
SLIDE 13

Wait,*have*we*seen*the*linear*regression* before?*

Texts

.

Tj

= @I

'

com

=T

slide-14
SLIDE 14

It’s*about*Relationship*between* data*features*

Example:!does!the!Height!of!people!relate!to!

people’s!weight?!!

x!:!!HIGHT,!!y:!WEIGHT!

slide-15
SLIDE 15

Chicago*social*economic*census*

The!census!included!77!communiOes!in!Chicago! The!census!evaluated!the!average!hardship!index!of!the!residents! The!census!evaluated!the!following!parameters!for!each!

community:!

  • PERCENT_OF_HOUSING_CROWDED%
  • PERCENT_HOUSEHOLD_BELOW_POVERTY%
  • PERCENT_AGED_16p_UNEMPLOYED%
  • PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA%
  • PERCENT_AGED_UNDER_18_OR_OVER_64%
  • PER_CAPITA_INCOME%

!

Given&a&new&community&and&its&parameters,&& can&you&predict&its&average&hardship&index&with&all&these&parameters?&

=

.
slide-16
SLIDE 16

Chicago*social*economic*census*

The!scaler!plots!and!

the!k5means!clusters!

Take!a!log!of!the!

income!for!it!shows! beler!trend!

!

slide-17
SLIDE 17

The*regression*problem*

Given!a!set!of!feature!vectors!xi!where!!each!has!a!

numerical!label!yi,!we!want!to!train!a!model!that!can! map!unlabeled!vectors!to!numerical!values!

We!can!think!of!regression!as!fimng!a!line!(or!curve!

  • r!hyperplane,!etc.)!to!data!

Regression!is!like!classificaOon!except!that!the!

predicOon!target!is!a!number,!not!a!class!label.! (PredicOng!class!label!can!be!considered!a!special! case!of!regression)!0

feel

valued

slide-18
SLIDE 18

Some*terminology*

Suppose!the!dataset!!!!!!!!!!!!!consists!of!N!labeled!

items!

If!we!represent!the!dataset!as!a!table!

The!d!columns!represenOng!!!!!!!!are!called!

explanatory%variables%

The!numerical!column!y!

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

{(x, y)} (xi, yi)

x(j)

{x} is!called!the!dependent% variable%%

a

slide-19
SLIDE 19

Variables*of*the*Chicago*census*

[1]!"PERCENT_OF_HOUSING_CROWDED"!!!!!!!!!!!!!!!!!!!! [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY"!!!!!!!!!!!!! [3]!"PERCENT_AGED_16p_UNEMPLOYED"!!!!!!!!!!!!!!!!!!! [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA"! [5]!"PERCENT_AGED_UNDER_18_OR_OVER_64"!!!!!!!!!!!!!! [6]"PER_CAPITA_INCOME"!!!!!!!!!!!!!!!!!!!!!!!!!!!! [7]!"HardshipIndex"!!

slide-20
SLIDE 20

Which*is*the*dependent*variable*in*the* census*example?*

A.!"PERCENT_OF_HOUSING_CROWDED"!!! B.!"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA”! C.!"HardshipIndex”!! D.!"PERCENT_AGED_UNDER_18_OR_OVER_64"!!!!!!!!!!!!!!

slide-21
SLIDE 21

Linear*model*

We!begin!by!modeling!y!as!a!linear!funcOon!of!!!!!!!!!!!!!!

plus!randomness!

!In!vector!notaOon:!

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

x(j)

y = x(1)β1 + x(2)β2 + ... + x(d)βd + ξ

Where!!!!is!a!zero5mean!random!variable!that! represents!model!error!!

ξ

y = xTβ + ξ

Where!!!!!is!the!d5dimensional! vector!of!coefficients!that!we!train!

β

i

¥7

Cl

slide-22
SLIDE 22

Each*data*item*gives*an*equation*

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ

The!model:!!

Training!data!

  • =

It fit 3949 ,

2

= 2+4.439452

5=3×1 , -16pct 'S ,

slide-23
SLIDE 23

Which*together*form*a*matrix*equation*

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ

The!model!!

Training!data!

  2 5   =   1 3 2 3 3 6   β1 β2

  • +

  ξ1 ξ2 ξ3  

slide-24
SLIDE 24

Which*together*form*a*matrix*equation*

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ

The!model!!

Training!data!

  2 5   =   1 3 2 3 3 6   β1 β2

  • +

  ξ1 ξ2 ξ3  

y = X · β + e

x

" '

x'4

µ

'

'
  • ① T

u t t

(

?

  • It. ii.

Ex ,

a-

slide-25
SLIDE 25

Q.*What’s*the*dimension*of*matrix*X?*

  • A. !N!!×!d!
  • B. !d!!×!N!
  • C. !N!!×!N!
  • D. !d!×!!d!
  • Z
slide-26
SLIDE 26

Training*the*model*is*to*choose*β*

Given!a!training!dataset!!!!!!!!!!!!!,!we!want!to!fit!a!

model!!

Define!!!!!!!!!!!!!!!!!!!!and!!!!!!!!!!!!!!!!!!!!and!! To!train!the!model,!we!need!to!choose!!!!!!that!makes!!!!!

small!in!the!matrix!equaOon!!!!

{(x, y)}

y = xTβ + ξ y =    y1 . . . yN   

X =    xT

1

. . . xT

N

  

e =    ξ1 . . . ξN    β

e

y = X · β + e

slide-27
SLIDE 27

Training*using*least*squares*

In!the!least!squares!method,!we!aim!to!minimize!! DifferenOaOng!with!respect!to!!!!!and!semng!to!zero!! If!!!!!!!!!!!!!is!inverOble,!the!least!squares!esOmate!of!

the!coefficient!is:!!

β e2 e2 = y − Xβ2 = (y − Xβ)T(y − Xβ)

XTXβ − XTy = 0

XTX

  • β = (XTX)−1XTy
  • µ

,.IT

  • → scalar
  • tee
'

Xp = y

xYip = c×txT

' XTY

slide-28
SLIDE 28 Hell
  • = Cy - xp 5cg - Xp )
CA B)T= B'AT = yty - ptxty
  • ytxptpixtxp
all useful derivative involving vectorlnratr:X a , b are vectors

aiazAa =

( A+ Atta

Aismatr.is#⇒ zqptxtxp>

2lbIaaL=b

⇒ HpM=×Ty

zp-

= 2 Xp since
  • b. Ta
is scalar

HbI=2'bIaa=aig

' = , ⇒ dPItp=xTy Note Hell " is scalar , all items in Cl) are scalar 211ell '
  • 2g
=
  • - XTy
  • xTytzxTxp=o
⇒ xTxB=xTy XTX tx XT

\

T B = cxtx,
  • 'x'
  • y
= zxx

I' x

  • 11×11
'
  • if
× I ' -11×112
slide-29
SLIDE 29

Convex*set*and*convex*function*

If!a!set!is!convex,!

any!line!connecOng! two!points!in!the! set!is!completely! included!in!the!set!!

A!convex!funcOon:!

the!area!above!the! curve!is!convex!!

The!least!square!

funcOon!is!convex%

Credit:!Dr.!Kelvin!Murphy!

f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)

y-txtb-zcyi-axi.bg

i

slide-30
SLIDE 30

What’s*the*dimension*of*matrix*XTX?*

  • A. !N!!×!d!
  • B. !d!!×!N!
  • C. !N!!×!N!
  • D. !d!×!!d!

I

slide-31
SLIDE 31

What’s*the*dimension*of*matrix*XTX?*

  • A. !N!!×!d!
  • B. !d!!×!N!
  • C. !N!!×!N!
  • D. !d!×!!d!
slide-32
SLIDE 32

Is*this*statement*true?*

  • A. !TRUE!
  • B. !FALSE!

If!the!matrix!XTX!does!NOT!have!zero!valued!eigenvalues,! it!is!inverOble.!

  • Xp o

T

X x

symmetric , d xd

det to

detcxtxl

> o

slide-33
SLIDE 33

Is*this*statement*true?*

  • A. !TRUE!
  • B. !FALSE!

If!the!matrix!XTX!does!NOT!have!zero!valued!eigenvalues,! it!is!inverOble.!

slide-34
SLIDE 34

Training*using*least*squares*example*

Model:!!

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ Training!data!

  • β = (XTX)−1XTy
  • β1 = 2
  • β2 = −1

3

=

  • 2

−1

3

slide-35
SLIDE 35

Prediction*

If!we!train!the!model!coefficients!!!!!,!we!can!predict!!!

from!!

In!the!model!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!with!!

The!predicOon!for!!!!!!!!!!!!!!!!!!!is!! The!predicOon!for!!!!!!!!!!!!!!!!!!!is!!

  • β

yp

x0

y = x(1)β1 + x(2)β2 + ξ

  • β =
  • 2

−1

3

  • yp

0 = xT

β

x0 =

  • 2

1

  • x0 =
  • yp

yp

=

2×2-1 IN

  • tz )

= oxzx

  • x C- tz,

y:c: )

slide-36
SLIDE 36

A*linear*model*with*constant*offset*

The!problem!with!the!model!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

is:it!always!predicts!!!!!!=!0!if!the!input!vector!

!Let’s!add!a!constant!offset!!!!!!to!the!model!

y = x(1)β1 + x(2)β2 + ξ

yp

x0 =

  • y = β0 + x(1)β1 + x(2)β2 + ξ

β0

slide-37
SLIDE 37

Training*and*prediction*with*constant*

  • ffset*

1% 1! 1! 3! 0! 1! 2! 3! 2! 1! 3! 6! 5!

x(1) x(2)

y

y = β0 + x(1)β1 + x(2)β2 + ξ = xTβ + ξ

The!model!! Training!data:! For!!

1 x(1) x(2)

  • β = (XTX)−1XTy =

  −3 2

1 3

 

x0 =

  • yp

0 =

  • 1

 −3 2

1 3

  = −3

slide-38
SLIDE 38

Variance*of*the*linear*regression*model*

The!least!squares!esOmate!saOsfies!this!property!! The!random!error!is!uncorrelated!to!the!least!square!

soluOon!of!linear!combinaOon!of!explanatory! variables.! var({yi}) = var({xT

i

β}) + var({ξi})

→ van [ Xt 'll =uarCX7tuwCY ,

if

X, 'T cure

cleanroom

.
slide-39
SLIDE 39

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi})

y = X · β + e

var[y] = (1/N)(y − y)T(y − y)

Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

⌃! Cov Cx

, 71 =

KI Ye

slide-40
SLIDE 40

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

slide-41
SLIDE 41

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

[ e- E) THE - xpti-cxp-xp.IT e-ET

co

  • .
' var Cy ) is

scalar

if

Ct

CT

is

scalar

( =

CT

slide-42
SLIDE 42

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

Because!!!!!!!!!!!!!!!!!!!!;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!!

e = 0

eT1 = 0

eTX β = 0

slide-43
SLIDE 43

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

var[y] = (1/N)([X ˆ β − X ˆ β]T[X ˆ β − X ˆ β] + [e − e]T[e − e])

Because!!!!!!!!!!!!!!!!!!!!;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!and!!

e = 0

eT1 = 0

eTX β = 0

Due!to!Least!square!minimized!

slide-44
SLIDE 44

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

Because!!!!!!!!!!!!!!!!!!!!;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!and!!

e = 0

eT1 = 0

var[y] = (1/N)([X ˆ β − X ˆ β]T[X ˆ β − X ˆ β] + [e − e]T[e − e]) eTX β = 0

= var[X β] + var[e]

Due!to!Least!square!minimized!

var[y]

slide-45
SLIDE 45

Evaluating*models*using*RRsquared*

The!least!squares!esOmate!saOsfies!this!property!! This!property!gives!us!an!evaluaOon!metric!called!R5

squared!

We!have!!!!!!!!!!!!!!!!!!!!!!!!!!!with!a!larger!value!meaning!a!

beler!fit.!

!!

var({yi}) = var({xT

i

β}) + var({ξi})

R2 = var({xT

i

β}) var({yi})

0 ≤ R2 ≤ 1

slide-46
SLIDE 46

Q:*What*is*RRsquared*if*there*is*only*one* explanatory*variable*in*the*model?*

slide-47
SLIDE 47

Q:*What*is*RRsquared*if*there*is*only*one* explanatory*variable*in*the*model?*

R5squared!would!be!the%correlaTon% coefficient%squared!(textbook!pgs!43544)!

slide-48
SLIDE 48

RRsquared*examples*

slide-49
SLIDE 49

Comparing*our*example*models*

1! 3! 0! 1! 2! 3! 2! 3! 3! 6! 5! 4!

x(1) x(2)

y xT β

1% 1! 1! 3! 0! 0! 1! 2! 3! 2! 2! 1! 3! 6! 5! 5!

x(1) x(2)

y xT β

y = β0 + x(1)β1 + x(2)β2 + ξ

y = x(1)β1 + x(2)β2 + ξ

  • β =
  • 2

−1

3

  • β =

  −3 2

1 3

 

slide-50
SLIDE 50

Linear*regression*model*for*the*Chicago* census*data*

I

slide-51
SLIDE 51

Residual*is*normally*distributed?*

The!Q5Q!plot!of! the!residuals!is! roughly!normal!

slide-52
SLIDE 52

Prediction*for*another*community*

[1]!"PERCENT_OF_HOUSING_CROWDED"!!!!!!!!!!!!!!!!!!!!! [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY "!!!!!!!!!!!!! [3]!"PERCENT_AGED_16p_UNEMPLOYED"!!!!!!!!!!!!!!!!!!! [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SC HOOL_DIPLOMA"! [5]! "PERCENT_AGED_UNDER_18_OR_OVER_64"!!!!!!!!!!!!!! [6]"PER_CAPITA_INCOME"!!!!!!!!!!!!!!!!!!!!!!!!!!!! 4.7% 19.7% 12.9% 19.5% 33.5% Log(28202)% Predicted!hardship!index:!41.46038% Note:!maximum!of!hardship!index!in!the!training!data!is!98,!minimum!is!1!

slide-53
SLIDE 53

The*clusters*of*the*Chicago*communities:* clusters*and*hardship*

  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 factor(value)
  • 1
2 3 4 5 6 Heatmap of clusters
  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship

Clusters%of%community% Hardship%index%of%communiTes%

slide-54
SLIDE 54

The*clusters*of*the*Chicago*communities:* per*capital*income*and*hardship*

PER_CAPITAL_INCOME% Hardship%index%of%communiTes%

  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 9.5 10.0 10.5 11.0 PER_CAPITA_INCOME Heatmap of PER_CAPITA_INCOME (log scale)
  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship
slide-55
SLIDE 55

The*clusters*of*the*Chicago*communities:* without*diploma*and*hardship*

Hardship%index%of%communiTes%

  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship

PERCENT_AGED_25p_WITHOUT _HIGH_SCHOOL_DIPLOMA%

slide-56
SLIDE 56

Assignments*

Read!Chapter!13!of!the!textbook! Next!Ome:!More!on!linear!regression!

!

slide-57
SLIDE 57

P lol b) = PCD lol Pco,

  • Pco,-4
'

O'

  • I
Other

PCD)

  • I
  • I
  • f

.

  • ther

Pirlo

  • (f) It)

' =L

Pl D) = I PCDIO) pco,

=

PCD to Pco

  • I,
= t
slide-58
SLIDE 58

Additional*References*

✺ Robert!V.!Hogg,!Elliot!A.!Tanis!and!Dale!L.!

Zimmerman.!“Probability!and!StaOsOcal! Inference”!!

Kelvin!Murphy,!“Machine!learning,!A!

ProbabilisOc!perspecOve”!

slide-59
SLIDE 59

See*you*next*time*

See You!