Probability and Statistics for Computer Science All models are - - PowerPoint PPT Presentation

probability and statistics
SMART_READER_LITE
LIVE PREVIEW

Probability and Statistics for Computer Science All models are - - PowerPoint PPT Presentation

Probability and Statistics for Computer Science All models are wrong, but some models are useful--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020 Last time StochasOc Gradient Descent


slide-1
SLIDE 1

ì

Probability and Statistics for Computer Science

“All models are wrong, but some models are useful”--- George Box

Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020 Credit: wikipedia
slide-2
SLIDE 2

Last time

StochasOc Gradient Descent Naïve Bayesian Classifier }

classifier

  • Regression
slide-3
SLIDE 3

Some popular topics in Ngram

slide-4
SLIDE 4

Objectives

*

Linear regression

detrition

.

* The

least square

solution

* Training

and

prediction

*

R - squares

for

evaluating else fit

.
slide-5
SLIDE 5

Regression models are Machine learning methods

Regression models have been around

for a while

  • Dr. Kelvin Murphy’s Machine Learning

book has 3+ chapters on regression

slide-6
SLIDE 6

The regression problem

classification

Y

x; " . .
  • x'
* d '

yo!

  • sets

xY¥?

  • Regression
y ' 's

gut

. . . x' d ' Y 0.5 I *
  • I. 56

is

YP?
slide-7
SLIDE 7

Chicago social economic census

The census included 77 communiOes in Chicago The census evaluated the average hardship index of the residents The census evaluated the following parameters for each community:
  • PERCENT_OF_HOUSING_CROWDED
  • PERCENT_HOUSEHOLD_BELOW_POVERTY
  • PERCENT_AGED_16p_UNEMPLOYED
  • PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA
  • PERCENT_AGED_UNDER_18_OR_OVER_64
  • PER_CAPITA_INCOME
Given a new community and its parameters, can you predict its average hardship index with all these parameters?
slide-8
SLIDE 8

Wait, have we seen the linear regression before?

Correlation

X

T

T

:

iii.

÷÷÷:

"

slide-9
SLIDE 9

It’s about Relationship between data features

Example: Is the height of people related to

their weight?

x : HIGHT, y: WEIGHT

slide-10
SLIDE 10

Some terminology

Suppose the dataset consists of N labeled

items

If we represent the dataset as a table The d columns represenOng are called

explanatory variables

The numerical column y 1 3 2 3 2 3 6 5 y

x(1) x(2)

{(x, y)} (xi, yi)

x(j)

{x} is called the dependent variable

÷

w/

slide-11
SLIDE 11

Variables of the Chicago census

[1] "PERCENT_OF_HOUSING_CROWDED" [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY" [3] "PERCENT_AGED_16p_UNEMPLOYED" [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA" [5] "PERCENT_AGED_UNDER_18_OR_OVER_64" [6]"PER_CAPITA_INCOME" [7] "HardshipIndex"

slide-12
SLIDE 12

Which is the dependent variable in the census example?

  • A. "PERCENT_OF_HOUSING_CROWDED"
  • B. "PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA”
  • C. "HardshipIndex”
  • D. "PERCENT_AGED_UNDER_18_OR_OVER_64"

e

slide-13
SLIDE 13

Linear model

We begin by modeling y as a linear funcOon of

plus randomness

In vector notaOon: 1 3 2 3 2 3 6 5 y

x(1) x(2)

x(j)

y = x(1)β1 + x(2)β2 + ... + x(d)βd + ξ

Where is a zero-mean random variable that represents model error

ξ y = xTβ + ξ

Where is the d-dimensional vector of coefficients that we train

β

re i

xT=[ x"

x "
  • x'
d')
slide-14
SLIDE 14

Each data item gives an equation

1 3 2 3 2 3 6 5 y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ The model:

Training data

y =
  • =
I xp , -13 * Bz tf l Z = zxp , t 3 tf u t f ~

5=3 xp it Gxpu -193

slide-15
SLIDE 15

Which together form a matrix equation

1 3 2 3 2 3 6 5 y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ The model

Training data

  2 5   =   1 3 2 3 3 6   β1 β2

  • +

  ξ1 ξ2 ξ3  

ECT 1=0

17

'

y

  • tx
. t.tk

.

If

w'kno

slide-16
SLIDE 16

Which together form a matrix equation

1 3 2 3 2 3 6 5 y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ The model

Training data

  2 5   =   1 3 2 3 3 6   β1 β2

  • +

  ξ1 ξ2 ξ3  

y = X · β + e

I

slide-17
SLIDE 17
  • Q. What’s the dimension of matrix X?
  • A. N × d
  • B. d × N
  • C. N × N
  • D. d × d
slide-18
SLIDE 18

Training the model is to choose β

Given a training dataset , we want to fit a

model

Define and and To train the model, we need to choose that makes

small in the matrix equaOon

{(x, y)} y = xTβ + ξ y =    y1 . . . yN    X =    xT 1 . . . xT N    e =    ξ1 . . . ξN    β

e

y = X · β + e ① Least Square =

MLE

loss function Textbook pg 309
slide-19
SLIDE 19

Training using least squares

In the least squares method, we aim to minimize DifferenOaOng with respect to and semng to zero If is inverOble, the least squares esOmate of

the coefficient is:

β e2 e2 = y − Xβ2 = (y − Xβ)T(y − Xβ)

XTXβ − XTy = 0

XTX
  • β = (XTX)−1XTy
Loss ( cost

suit

is

'II.

O -

"×F=xty

F= arguing

Hell '

Y

  • Xp te
slide-20
SLIDE 20

XTX

XT ~ Ix N

X ~

Nxd

+ns.XN.d > ai =

XTX

~ dxd

A

symmetric

. real wagged

fr

XTX ,

we n :3 o have -
slide-21
SLIDE 21

Derivation of least square solution

Hell
  • = Cy - xp 5cg - xp ,
= yty - pTxTy
  • ytxptpixtxp
all useful derivative involving vectorlnratr:X 2la}Aa# = ( A+ AT, a a, b are vectors ; A is a square matrix

21bII=b⇒ym=×Ty#⇒2lMfTpzxTxp

W since bta is scalar 2fpT×Ty , . . . XTX is symmetric

Hb'=2lbIaa=2'f=b

⇒ zp- = Y . : ×T×= x)T Note yell ' is scalar , all items in c ') are scalar ,,×T×+ ( XTXJTIZXTX 211ell '

Ip

=
  • - XTY - xTytzxTxp=o

⇒xTxp=5#

  • B. = cxtx,
  • 'x'
  • y

here y is vector
slide-22
SLIDE 22

Derivation of least square solution

xty-xtx.pt

⇒ xtly
  • x'
p' 1=0

X'

'ndxn ⇒ XTe=o Cdt ' I e - ax , ⇒

eTX=o

lied )

# e)Io

^

⇒ eTxp=o city

^

et XP

uncorrelated

! :
slide-23
SLIDE 23

( east square

Loss function

K

Hell? ftp.s-EQjcps-7?,cxTjp-ygj2

ja

j

  • Qjcps-cxt.jp
  • yjs

' Lei

in the final

project

I

Qjl 01=1540

  • yjl

I Qj= ? 2 = ?

20
slide-24
SLIDE 24

Convex set and convex function

If a set is convex,

any line connecOng two points in the set is completely included in the set

A convex funcOon:

the area above the curve is convex

The least square

funcOon is convex

Credit: Dr. Kelvin Murphy f(λx + (1 − λ)y) < λf(x) + (1 − λ)f(y)

full = o -

VE

.
slide-25
SLIDE 25

What’s the dimension of matrix XTX?

  • A. N × d
  • B. d × N
  • C. N × N
  • D. d × d

X n Nx d

XT ~

Ix N

D

Xtx - did

d

→ # of features

(explanatory

van .
slide-26
SLIDE 26

Is this statement true?

  • A. TRUE
  • B. FALSE

If the matrix XTX does NOT have zero valued eigenvalues, it is inverOble.

Rizo

El

it dit -

to - all i

t s

slide-27
SLIDE 27

Training using least squares example

Model: 1 3 2 3 2 3 6 5 y

x(1) x(2)

y = xTβ + ξ = x(1)β1 + x(2)β2 + ξ Training data

  • β = (XTX)−1XTy
  • β1 = 2
  • β2 = −1
3 =
  • 2
−1 3
slide-28
SLIDE 28

Prediction

If we train the model coefficients , we can predict

from

In the model with The predicOon for is The predicOon for is
  • β
yp

x0

y = x(1)β1 + x(2)β2 + ξ
  • β =
  • 2

−1

3
  • yp
0 = xT

β

x0 =
  • 2
1
  • x0 =
  • yp
yp

Tf

= ztf.tl/- fu
  • = 4
  • It }
slide-29
SLIDE 29

A linear model with constant offset

The problem with the model

is:

Let’s add a constant offset to the model y = x(1)β1 + x(2)β2 + ξ

y = β0 + x(1)β1 + x(2)β2 + ξ

β0 when x° y
  • ftp.t
x " .ft
  • .
slide-30
SLIDE 30

Training and prediction with constant

  • ffset
1 1 1 3 1 2 3 2 1 3 6 5

x(1) x(2)

y y = β0 + x(1)β1 + x(2)β2 + ξ = xTβ + ξ The model Training data: For

1 x(1) x(2)

  • β = (XTX)−1XTy =

  −3 2

1 3

 

x0 =

  • yp
0 =
  • 1

 −3 2

1 3

  = −3

  • r . jjtaant

El

slide-31
SLIDE 31

Comparing our example models

1 3 1 2 3 2 3 3 6 5 4

x(1) x(2)

y xT β 1 1 1 3 1 2 3 2 2 1 3 6 5 5

x(1) x(2)

y xT β y = β0 + x(1)β1 + x(2)β2 + ξ y = x(1)β1 + x(2)β2 + ξ
  • β =
  • 2
−1 3
  • β =
  −3 2 1 3   y-54=4
  • l
l
slide-32
SLIDE 32

Variance of the linear regression model

The least squares esOmate saOsfies this property The random error is uncorrelated to the least square

soluOon of linear combinaOon of explanatory variables. var({yi}) = var({xT

i

β}) + var({ξi})

y

  • X Este

e.txt

XT y

= XT x '

pi

slide-33
SLIDE 33

Variance of the linear regression model: proof

The least squares esOmate saOsfies this property

var({yi}) = var({xT

i

β}) + var({ξi}) Proof:

Y

= X ft e var ( Y ) = var ( X f ) t var Ce )

t z CoV l Xp , e )

. :

xpt e

CoV lX f , e) = o
slide-34
SLIDE 34

Variance of the linear regression model: proof

The least squares esOmate saOsfies this property

var({yi}) = var({xT

i

β}) + var({ξi}) Proof:

var[y] = (1/N)([X ˆ β − X ˆ β] + [e − e])T([X ˆ β − X ˆ β] + [e − e])

var[y] = (1/N)([X ˆ β−X ˆ β]T[X ˆ β−X ˆ β]+2[e−e]T[X ˆ β−X ˆ β]+[e−e]T[e−e]) Because ; and e = 0 eT1 = 0 var[y] = (1/N)([X ˆ β − X ˆ β]T[X ˆ β − X ˆ β] + [e − e]T[e − e]) eTX β = 0 = var[X β] + var[e] Due to Least square minimized var[y]
slide-35
SLIDE 35

Evaluating models using R-squared

The least squares esOmate saOsfies this property This property gives us an evaluaOon metric called R-

squared

We have with a larger value meaning a

beoer fit.

  • var({yi}) = var({xT
i

β}) + var({ξi})

R2 = var({xT

i

β}) var({yi})

0 ≤ R2 ≤ 1
slide-36
SLIDE 36

Q: What is R-squared if there is only one explanatory variable in the model?

if

X =

Nxt
  • d. =/

RI

> r2

r

is corn .
slide-37
SLIDE 37

Q: What is R-squared if there is only one explanatory variable in the model?

'

y'

= rite

varcijf-r~varciittuas.EE)

p~=rZVarE

was 7=1

varcij )

✓aright 2 = r
slide-38
SLIDE 38

Q: What is R-squared if there is only one explanatory variable in the model?

R-squared would be the correlaQon coefficient squared (textbook pgs 43-44)

slide-39
SLIDE 39

R-squared examples

r
  • EG
r
  • tog - 0.8
slide-40
SLIDE 40

Linear regression model for the Chicago census data

If= = N - d * d * → # of explanatory + intercept
slide-41
SLIDE 41

Residual is normally distributed?

The Q-Q plot of the residuals is roughly normal

get

aunties

*pie

T e

e

{e ,}

corrie . Xp ) small

f

r

yo=xIqteo

EEf7=o

' { i}

@ Normal

slide-42
SLIDE 42

Prediction for another community

[1] "PERCENT_OF_HOUSING_CROWDED" [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY " [3] "PERCENT_AGED_16p_UNEMPLOYED" [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SC HOOL_DIPLOMA" [5] "PERCENT_AGED_UNDER_18_OR_OVER_64" [6]"PER_CAPITA_INCOME" 4.7 19.7 12.9 19.5 33.5 Log(28202) Predicted hardship index: 41.46038 Note: maximum of hardship index in the training data is 98, minimum is 1
slide-43
SLIDE 43

The clusters of the Chicago communities: clusters and hardship

  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 factor(value)
  • 1
2 3 4 5 6 Heatmap of clusters
  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship Clusters of community Hardship index of communiQes
slide-44
SLIDE 44

The clusters of the Chicago communities: per capital income and hardship

PER_CAPITAL_INCOME Hardship index of communiQes
  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 9.5 10.0 10.5 11.0 PER_CAPITA_INCOME Heatmap of PER_CAPITA_INCOME (log scale)
  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship
slide-45
SLIDE 45

The clusters of the Chicago communities: without diploma and hardship

Hardship index of communiQes
  • −15
−10 −5 5 10 −40 −20 20 tSNE1 tSNE2 25 50 75 Hardship index Hardship PERCENT_AGED_25p_WITHOUT _HIGH_SCHOOL_DIPLOMA
slide-46
SLIDE 46

Assignments

Read Chapter 13 of the textbook Next Ome: More on linear regression

slide-47
SLIDE 47

Additional References

✺ Robert V. Hogg, Elliot A. Tanis and Dale L.

  • Zimmerman. “Probability and StaOsOcal

Inference”

Kelvin Murphy, “Machine learning, A

ProbabilisOc perspecOve”

slide-48
SLIDE 48

See you next time

See You!