[PPT] - Probability*and*Statistics* ! for*Computer*Science** PowerPoint Presentation

SLIDE 1

!

Probability*and*Statistics* for*Computer*Science**

“All!models!are!wrong,!but!some! models!are!useful”555!George!Box!

Hongye!Liu,!Teaching!Assistant!Prof,!CS361,!UIUC,!4.14.2020! Credit:!wikipedia!

SLIDE 2

Last*time*

StochasOc!Gradient!Descent! Naïve!Bayesian!Classifier!

SLIDE 3

An*example*of*Naive*Bayes*training*

Training!data!

X(1)% X(2)% y% 3.5! 10! 1! 1.0! 8! 1! 0.0! 10! 51! 53.0! 14! 51!

Modeling!!!!!!!!!!!!!!! as!normal!

P(x(1)|y)

P(x(1)|y = 1)

µMLE = 3.5 + 1.0 2 = 2.25

σMLE = 1.25

P(x(1)|y = −1)

µMLE = −1.5 σMLE = 1.5

Modeling!!!!!!!!!!!!!!! as!Poisson!

P(x(2)|y)

P(x(2)|y = 1)

λMLE = 10 + 8 2 = 9

P(x(2)|y = −1)

λMLE = 12

Modeling!!!!!!!!!!!!!!! as!Bernoulli!

P(y)

P(y = 1) = 2 4 = 0.5

P(y = −1) = 0.5

laser

V

÷

discrete

&

a

3. 9

?

pay 1*7 PIXIYIPCYI

T

pix )

SLIDE 4

Classification*example:*

argmax

y

d
j=1

logP(x(j)|y) + log P(y)

For!a!new!feature!vector!x!=![x1,x2,…],!ie!x!=![3,9]!in!

the!example!

!!!!!!!!!!!!!!!!!!!!!!!!! ! ! ! ! !

SLIDE 5

Classification*example:*

argmax

y

d
j=1

logP(x(j)|y) + log P(y)

For!a!new!feature!vector!x!=![x1,x2,…],!ie!x!=![3,9]!in!

the!example!

g(y) =

y = 1

y = −1

!!!!!!!!!!!!!!!!!!!!!!!!! ! ! ! ! !

SLIDE 6

Example*of*Naïve*Bayesian*Model*

“Bag!of!words”!Naive!Bayesian!models!for! document!class!

!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!! !!!!!!!!!!!!!!! X5windows! MS5windows! document!(represented!as!a! bag5of5words!bit!vector),! each!column!is!a!word!

g-

'

subject

" -

I

SLIDE 7

What*about*the*decision*boundary?*

Not!explicit!as!in!the!case!of!decision!tree! This!method!is!parametric,!generaOve!

The!model!was!specified!with!parameters!to!

generate!label!for!test!data!

SLIDE 8

Pros*and*Cons*of*Naïve*Bayesian*Classifier*

Pros:!

Simple!approach! Good!accuracy! Good!for!high!dimensional!data!

Cons:!

The!assumpOon!of!condiOonal!independence!of!

features!

No!explicit!decision!boundary! SomeOmes!has!numerical!issues!

SLIDE 9

Content*

Naïve!Bayesian!Classifier!(cont)! Linear!regression!

The%problem% The!least!square!soluOon! The!training!and!predicOon! The!R5squared!for!the!

evaluaOon!of!the!fit.!

SLIDE 10

Some*popular*topics*in*Ngram*

SLIDE 11

Regression*models*are*Machine*learning* methods*

Regression!models!have!been!

around!for!a!while!

Dr.!Kelvin!Murphy’s!Machine!

Learning!book!has!3+!chapters!on! regression!

SLIDE 12

Content*

Linear!regression!

The%problem% The!least!square!soluOon! The!training!and!predicOon! The!R5squared!for!the!

evaluaOon!of!the!fit.! !

SLIDE 13

Wait,*have*we*seen*the*linear*regression* before?*

Texts

.

Tj

= @I

'

com

=T

SLIDE 14

It’s*about*Relationship*between* data*features*

Example:!does!the!Height!of!people!relate!to!

people’s!weight?!!

x!:!!HIGHT,!!y:!WEIGHT!

SLIDE 15

Chicago*social*economic*census*

The!census!included!77!communiOes!in!Chicago! The!census!evaluated!the!average!hardship!index!of!the!residents! The!census!evaluated!the!following!parameters!for!each!

community:!

PERCENT_OF_HOUSING_CROWDED%
PERCENT_HOUSEHOLD_BELOW_POVERTY%
PERCENT_AGED_16p_UNEMPLOYED%
PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA%
PERCENT_AGED_UNDER_18_OR_OVER_64%
PER_CAPITA_INCOME%

!

Given&a&new&community&and&its&parameters,&& can&you&predict&its&average&hardship&index&with&all&these&parameters?&

=

.

SLIDE 16

Chicago*social*economic*census*

The!scaler!plots!and!

the!k5means!clusters!

Take!a!log!of!the!

income!for!it!shows! beler!trend!

!

SLIDE 17

The*regression*problem*

Given!a!set!of!feature!vectors!xi!where!!each!has!a!

numerical!label!yi,!we!want!to!train!a!model!that!can! map!unlabeled!vectors!to!numerical!values!

We!can!think!of!regression!as!fimng!a!line!(or!curve!

r!hyperplane,!etc.)!to!data!

Regression!is!like!classificaOon!except!that!the!

predicOon!target!is!a!number,!not!a!class!label.! (PredicOng!class!label!can!be!considered!a!special! case!of!regression)!0

feel

valued

SLIDE 18

Some*terminology*

Suppose!the!dataset!!!!!!!!!!!!!consists!of!N!labeled!

items!

If!we!represent!the!dataset!as!a!table!

The!d!columns!represenOng!!!!!!!!are!called!

explanatory%variables%

The!numerical!column!y!

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

{(x, y)} (xi, yi)

x(j)

{x} is!called!the!dependent% variable%%

→

a

SLIDE 19

Variables*of*the*Chicago*census*

[1]!"PERCENT_OF_HOUSING_CROWDED"!!!!!!!!!!!!!!!!!!!! [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY"!!!!!!!!!!!!! [3]!"PERCENT_AGED_16p_UNEMPLOYED"!!!!!!!!!!!!!!!!!!! [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA"! [5]!"PERCENT_AGED_UNDER_18_OR_OVER_64"!!!!!!!!!!!!!! [6]"PER_CAPITA_INCOME"!!!!!!!!!!!!!!!!!!!!!!!!!!!! [7]!"HardshipIndex"!!

SLIDE 20

Which*is*the*dependent*variable*in*the* census*example?*

A.!"PERCENT_OF_HOUSING_CROWDED"!!! B.!"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA”! C.!"HardshipIndex”!! D.!"PERCENT_AGED_UNDER_18_OR_OVER_64"!!!!!!!!!!!!!!

SLIDE 21

Linear*model*

We!begin!by!modeling!y!as!a!linear!funcOon!of!!!!!!!!!!!!!!

plus!randomness!

!In!vector!notaOon:!

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

x(j)

y = x(1)β1 + x(2)β2 + ... + x(d)βd + ξ

Where!!!!is!a!zero5mean!random!variable!that! represents!model!error!!

ξ

y = xTβ + ξ

Where!!!!!is!the!d5dimensional! vector!of!coefficients!that!we!train!

β

i

¥7

Cl

SLIDE 22

Each*data*item*gives*an*equation*

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ

The!model:!!

Training!data!

=

It fit 3949 ,

2

= 2+4.439452

5=3×1 , -16pct 'S ,

SLIDE 23

Which*together*form*a*matrix*equation*

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ

The!model!!

Training!data!

  2 5   =   1 3 2 3 3 6   β1 β2

+

  ξ1 ξ2 ξ3  

SLIDE 24

Which*together*form*a*matrix*equation*

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ

The!model!!

Training!data!

  2 5   =   1 3 2 3 3 6   β1 β2

+

  ξ1 ξ2 ξ3  

y = X · β + e

x

" '

x'4

µ

'

① T

u t t

(

?

It. ii.

Ex ,

a-

SLIDE 25

Q.*What’s*the*dimension*of*matrix*X?*

A. !N!!×!d!
B. !d!!×!N!
C. !N!!×!N!
D. !d!×!!d!
Z

SLIDE 26

Training*the*model*is*to*choose*β*

Given!a!training!dataset!!!!!!!!!!!!!,!we!want!to!fit!a!

model!!

Define!!!!!!!!!!!!!!!!!!!!and!!!!!!!!!!!!!!!!!!!!and!! To!train!the!model,!we!need!to!choose!!!!!!that!makes!!!!!

small!in!the!matrix!equaOon!!!!

{(x, y)}

y = xTβ + ξ y =    y1 . . . yN   

X =    xT

1

. . . xT

N

  

e =    ξ1 . . . ξN    β

e

y = X · β + e

SLIDE 27

Training*using*least*squares*

In!the!least!squares!method,!we!aim!to!minimize!! DifferenOaOng!with!respect!to!!!!!and!semng!to!zero!! If!!!!!!!!!!!!!is!inverOble,!the!least!squares!esOmate!of!

the!coefficient!is:!!

β e2 e2 = y − Xβ2 = (y − Xβ)T(y − Xβ)

XTXβ − XTy = 0

XTX

β = (XTX)−1XTy
µ

,.IT

→ scalar
tee

'

Xp = y

xYip = c×txT

' XTY

SLIDE 28 Hell

= Cy - xp 5cg - Xp )

CA B)T= B'AT = yty - ptxty

ytxptpixtxp

all useful derivative involving vectorlnratr:X a , b are vectors

aiazAa =

( A+ Atta

Aismatr.is#⇒ zqptxtxp>

2lbIaaL=b

⇒ HpM=×Ty

zp-

= 2 Xp since

b. Ta

is scalar

HbI=2'bIaa=aig

' = , ⇒ dPItp=xTy Note Hell " is scalar , all items in Cl) are scalar 211ell '

2g

=

- XTy
xTytzxTxp=o

⇒ xTxB=xTy XTX tx XT

\

T ⇒ B = cxtx,

'x'
y

= zxx

✓

I' x

11×11

'

if

× I ' -11×112

SLIDE 29

Convex*set*and*convex*function*

If!a!set!is!convex,!

any!line!connecOng! two!points!in!the! set!is!completely! included!in!the!set!!

A!convex!funcOon:!

the!area!above!the! curve!is!convex!!

The!least!square!

funcOon!is!convex%

Credit:!Dr.!Kelvin!Murphy!

f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)

y-txtb-zcyi-axi.bg

i

SLIDE 30

What’s*the*dimension*of*matrix*XTX?*

A. !N!!×!d!
B. !d!!×!N!
C. !N!!×!N!
D. !d!×!!d!

I

SLIDE 31

What’s*the*dimension*of*matrix*XTX?*

A. !N!!×!d!
B. !d!!×!N!
C. !N!!×!N!
D. !d!×!!d!

SLIDE 32

Is*this*statement*true?*

A. !TRUE!
B. !FALSE!

If!the!matrix!XTX!does!NOT!have!zero!valued!eigenvalues,! it!is!inverOble.!

Xp o

T

X x

symmetric , d xd

det to

detcxtxl

> o

SLIDE 33

Is*this*statement*true?*

A. !TRUE!
B. !FALSE!

If!the!matrix!XTX!does!NOT!have!zero!valued!eigenvalues,! it!is!inverOble.!

SLIDE 34

Training*using*least*squares*example*

Model:!!

1! 3! 0! 2! 3! 2! 3! 6! 5!

y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ Training!data!

β = (XTX)−1XTy
β1 = 2
β2 = −1

3

=

2

−1

3

SLIDE 35

Prediction*

If!we!train!the!model!coefficients!!!!!,!we!can!predict!!!

from!!

In!the!model!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!with!!

The!predicOon!for!!!!!!!!!!!!!!!!!!!is!! The!predicOon!for!!!!!!!!!!!!!!!!!!!is!!

β

yp

x0

y = x(1)β1 + x(2)β2 + ξ

β =
2

−1

3

yp

0 = xT

β

x0 =

2

1

x0 =
yp

yp

=

2×2-1 IN

tz )

= oxzx

x C- tz,

y:c: )

SLIDE 36

A*linear*model*with*constant*offset*

The!problem!with!the!model!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

is:it!always!predicts!!!!!!=!0!if!the!input!vector!

!Let’s!add!a!constant!offset!!!!!!to!the!model!

y = x(1)β1 + x(2)β2 + ξ

yp

x0 =

y = β0 + x(1)β1 + x(2)β2 + ξ

β0

SLIDE 37

Training*and*prediction*with*constant*

ffset*

1% 1! 1! 3! 0! 1! 2! 3! 2! 1! 3! 6! 5!

x(1) x(2)

y

y = β0 + x(1)β1 + x(2)β2 + ξ = xTβ + ξ

The!model!! Training!data:! For!!

1 x(1) x(2)

β = (XTX)−1XTy =

  −3 2

1 3

 

x0 =

yp

0 =

1


 −3 2

1 3

  = −3

SLIDE 38

Variance*of*the*linear*regression*model*

The!least!squares!esOmate!saOsfies!this!property!! The!random!error!is!uncorrelated!to!the!least!square!

soluOon!of!linear!combinaOon!of!explanatory! variables.! var({yi}) = var({xT

i

β}) + var({ξi})

→ van [ Xt 'll =uarCX7tuwCY ,

if

X, 'T cure

cleanroom

.

SLIDE 39

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi})

y = X · β + e

var[y] = (1/N)(y − y)T(y − y)

Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

⌃! Cov Cx

, 71 =

KI Ye

SLIDE 40

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

SLIDE 41

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

[ e- E) THE - xpti-cxp-xp.IT e-ET

co

.

' var Cy ) is

scalar

if

Ct

CT

is

scalar

( =

CT

SLIDE 42

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

Because!!!!!!!!!!!!!!!!!!!!;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!;!!

e = 0

eT1 = 0

eTX β = 0

SLIDE 43

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

var[y] = (1/N)([X ˆ β − X ˆ β]T[X ˆ β − X ˆ β] + [e − e]T[e − e])

Because!!!!!!!!!!!!!!!!!!!!;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!and!!

e = 0

eT1 = 0

eTX β = 0

Due!to!Least!square!minimized!

→

SLIDE 44

Variance*of*the*linear*regression*model:* proof*

The!least!squares!esOmate!saOsfies!this!property!!

!! var({yi}) = var({xT

i

β}) + var({ξi}) Proof:%

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e])

Because!!!!!!!!!!!!!!!!!!!!;!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!and!!

e = 0

eT1 = 0

var[y] = (1/N)([X ˆ β − X ˆ β]T[X ˆ β − X ˆ β] + [e − e]T[e − e]) eTX β = 0

= var[X β] + var[e]

Due!to!Least!square!minimized!

var[y]

SLIDE 45

Evaluating*models*using*RRsquared*

The!least!squares!esOmate!saOsfies!this!property!! This!property!gives!us!an!evaluaOon!metric!called!R5

squared!

We!have!!!!!!!!!!!!!!!!!!!!!!!!!!!with!a!larger!value!meaning!a!

beler!fit.!

!!

var({yi}) = var({xT

i

β}) + var({ξi})

R2 = var({xT

i

β}) var({yi})

0 ≤ R2 ≤ 1

SLIDE 46

Q:*What*is*RRsquared*if*there*is*only*one* explanatory*variable*in*the*model?*

SLIDE 47

Q:*What*is*RRsquared*if*there*is*only*one* explanatory*variable*in*the*model?*

R5squared!would!be!the%correlaTon% coefficient%squared!(textbook!pgs!43544)!

SLIDE 48

RRsquared*examples*

SLIDE 49

Comparing*our*example*models*

1! 3! 0! 1! 2! 3! 2! 3! 3! 6! 5! 4!

x(1) x(2)

y xT β

1% 1! 1! 3! 0! 0! 1! 2! 3! 2! 2! 1! 3! 6! 5! 5!

x(1) x(2)

y xT β

y = β0 + x(1)β1 + x(2)β2 + ξ

y = x(1)β1 + x(2)β2 + ξ

β =
2

−1

3

β =

  −3 2

1 3

 

SLIDE 50

Linear*regression*model*for*the*Chicago* census*data*

I

SLIDE 51

Residual*is*normally*distributed?*

The!Q5Q!plot!of! the!residuals!is! roughly!normal!

SLIDE 52

Prediction*for*another*community*

[1]!"PERCENT_OF_HOUSING_CROWDED"!!!!!!!!!!!!!!!!!!!!! [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY "!!!!!!!!!!!!! [3]!"PERCENT_AGED_16p_UNEMPLOYED"!!!!!!!!!!!!!!!!!!! [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SC HOOL_DIPLOMA"! [5]! "PERCENT_AGED_UNDER_18_OR_OVER_64"!!!!!!!!!!!!!! [6]"PER_CAPITA_INCOME"!!!!!!!!!!!!!!!!!!!!!!!!!!!! 4.7% 19.7% 12.9% 19.5% 33.5% Log(28202)% Predicted!hardship!index:!41.46038% Note:!maximum!of!hardship!index!in!the!training!data!is!98,!minimum!is!1!

SLIDE 53

The*clusters*of*the*Chicago*communities:* clusters*and*hardship*

●
−15

−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 factor(value)

1

2 3 4 5 6 Heatmap of clusters

●
●
−15

−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship

Clusters%of%community% Hardship%index%of%communiTes%

SLIDE 54

The*clusters*of*the*Chicago*communities:* per*capital*income*and*hardship*

PER_CAPITAL_INCOME% Hardship%index%of%communiTes%

●
●
−15

−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 9.5 10.0 10.5 11.0 PER_CAPITA_INCOME Heatmap of PER_CAPITA_INCOME (log scale)

●
●
−15

−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship

SLIDE 55

The*clusters*of*the*Chicago*communities:* without*diploma*and*hardship*

Hardship%index%of%communiTes%

●
●
−15

−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship

PERCENT_AGED_25p_WITHOUT _HIGH_SCHOOL_DIPLOMA%

SLIDE 56

Assignments*

Read!Chapter!13!of!the!textbook! Next!Ome:!More!on!linear!regression!

!

SLIDE 57

P lol b) = PCD lol Pco,

Pco,-4

'

O'

I

Other

PCD)

I
I
f

.

ther

Pirlo

(f) It)

' =L

Pl D) = I PCDIO) pco,

=

PCD to Pco

I,

= t

SLIDE 58

Additional*References*

✺ Robert!V.!Hogg,!Elliot!A.!Tanis!and!Dale!L.!

Zimmerman.!“Probability!and!StaOsOcal! Inference”!!

Kelvin!Murphy,!“Machine!learning,!A!

ProbabilisOc!perspecOve”!

SLIDE 59

See*you*next*time*

See You!