Probability and Statistics for Computer Science All models are - PowerPoint PPT Presentation

Probability and Statistics ì for Computer Science “All models are wrong, but some models are useful”--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020

Last time � StochasOc Gradient Descent � Naïve Bayesian Classifier } classifier - Regression

Some popular topics in Ngram

Objectives Linear regression detrition . * solution least square * The prediction and * Training evaluating else fit for R - squares * .

Regression models are Machine learning methods � Regression models have been around for a while � Dr. Kelvin Murphy’s Machine Learning book has 3+ chapters on regression

The regression problem yo ! d ' - sets - x' " Y x ; classification . . * xY ¥? - Regression Y d ' ' 's . x' gut y . . I. 56 * I 0.5 is YP ? -

Chicago social economic census � The census included 77 communiOes in Chicago � The census evaluated the average hardship index of the residents � The census evaluated the following parameters for each community: PERCENT_OF_ HOUSING_CROWDED � PERCENT_ HOUSEHOLD_BELOW_POVERTY � PERCENT_ AGED_16p_UNEMPLOYED � PERCENT_ AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA � PERCENT_ AGED_UNDER_18_OR_OVER_64 � PER_CAPITA_ INCOME � Given a new community and its parameters, can you predict its average hardship index with all these parameters?

Wait, have we seen the linear regression before? T X Correlation T iii. : ÷÷÷ : "

It’s about Relationship between data features � Example: Is the height of people related to their weight? � x : HIGHT, y: WEIGHT

Some terminology � Suppose the dataset consists of N labeled { ( x , y ) } items ( x i , y i ) � If we represent the dataset as a table � The d columns represenOng are called { x } ÷ explanatory variables x ( j ) � The numerical column y x (1) x (2) y is called the dependent 1 3 0 w/ variable 2 3 2 3 6 5

Variables of the Chicago census [1] "PERCENT_OF_HOUSING_CROWDED" [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY" [3] "PERCENT_AGED_16p_UNEMPLOYED" [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA" [5] "PERCENT_AGED_UNDER_18_OR_OVER_64" [6]"PER_CAPITA_INCOME" [7] "HardshipIndex"

Which is the dependent variable in the census example? A. "PERCENT_OF_HOUSING_CROWDED" B. "PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA” e C. "HardshipIndex” D. "PERCENT_AGED_UNDER_18_OR_OVER_64"

Linear model x ( j ) � We begin by modeling y as a linear funcOon of re i plus randomness y = x (1) β 1 + x (2) β 2 + ... + x ( d ) β d + ξ Where is a zero-mean random variable that ξ represents model error x " d ' ) xT=[ x " - x' � In vector notaOon: x (1) x (2) y y = x T β + ξ 1 3 0 Where is the d-dimensional 2 3 2 β vector of coefficients that we train 3 6 5

Each data item gives an equation � The model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ I xp , -13 * Bz tf y = l o = = zxp , t u t f ~ 3 tf Z Training data 5=3 xp it Gxpu -193 x (1) x (2) y 1 3 0 2 3 2 3 6 5

Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ ECT 1=0       ξ 1 0 1 3 � β 1 �  = Training data ξ 2 2 2 3 +      β 2 ξ 3 5 3 6 . t.tk x (1) x (2) y ' - tx 17 1 3 0 y 2 3 2 - . 3 6 5 w'kno If

Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ       ξ 1 0 1 3 � β 1 � I  = Training data ξ 2 2 2 3 +      β 2 ξ 3 5 3 6 x (1) x (2) y 1 3 0 y = X · β + e 2 3 2 3 6 5

Q. What’s the dimension of matrix X? o A. N × d B. d × N C. N × N D. d × d

Training the model is to choose β � Given a training dataset , we want to fit a { ( x , y ) } model y = x T β + ξ       ξ 1 y 1 x T 1 � Define and and . .   . . y = X =     . e = . . . .       x T ξ N y N N � To train the model, we need to choose that makes β e small in the matrix equaOon y = X · β + e ① Least Square ② MLE = Textbook loss function 309 pg

Training using least squares � In the least squares method, we aim to minimize � e � 2 Loss ( cost ' II . � e � 2 = � y − X β � 2 = ( y − X β ) T ( y − X β ) is suit � DifferenOaOng with respect to and semng to zero β " × F=xty X T X β − X T y = 0 O - � If is inverOble, the least squares esOmate of X T X the coefficient is: Hell ' F= arguing ① � β = ( X T X ) − 1 X T y - Xp te Y -

XTX XT ~ Ix N Nxd X ~ + ns.XN.d > ai XTX ~ dxd = A real wagged symmetric . XTX , we n :3 o fr have -

⇒ Derivation of least square solution = Cy - xp 5cg - xp , - Hell = yty - pTxTy - ytxptpixtxp all vector lnratr :X involving derivative useful a square matrix A is a , b are vectors ; 2la}Aa# = ( A+ AT , a 21bII=b ⇒ ym= × Ty# ⇒ 2lMfTpzxTxp W is symmetric XTX bta scalar is 2fpT × Ty , . since . ⇒ zp- . Hb'=2lbIaa=2'f=b x )T Y × T × = = . : ' is , , × T × + ( XTXJTIZXTX all items c ' ) Note yell scalar , in scalar are 211 ell ' o - XTY - xTytzxTxp=o Ip = ⇒ xTxp=5# ✓ - - ' x' B. = cxtx , y vector is here y

⇒ Derivation of least square solution xty-xtx.pt X' ' ndxn p' 1=0 - x' ⇒ xtly e - ax , Cdt ' I ⇒ XTe=o # e) Io lied ) eTX=o city ^ ⇒ eTxp=o ^ et XP uncorrelated ! :

Loss function ( east square Hell ? ftp.s-EQjcps-7 ? ,cxTjp-ygj2 K - j ja ' Lei - yjs Qjcps-cxt.jp the final project in I Qjl 01=1540 - yjl I Qj= ? 2 = ? 20

Convex set and convex function � If a set is convex, full = o - any line connecOng VE two points in the set is completely . included in the set � A convex funcOon: the area above the curve is convex f ( λ x + (1 − λ ) y ) < λ f ( x ) + (1 − λ ) f ( y ) � The least square - funcOon is convex - Credit: Dr. Kelvin Murphy

What’s the dimension of matrix X T X? X n Nx d A. N × d B. d × N XT ~ C. N × N Ix N D D. d × d Xtx - did → # of features d ( explanatory van .

Is this statement true? If the matrix X T X does NOT have zero valued eigenvalues, it is inverOble. Rizo El A. TRUE it dit - B. FALSE to - all i t s o

Training using least squares example 0 ¥ � Model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ � � 2 � β = ( X T X ) − 1 X T y = − 1 Training data 3 x (1) x (2) y � β 1 = 2 1 3 0 β 2 = − 1 � 2 3 2 3 3 6 5

Prediction � If we train the model coefficients , we can predict � y p β 0 from x 0 0 � y p 0 = x T β � � 2 � � In the model with y = x (1) β 1 + x (2) β 2 + ξ β = − 1 Tf 3 � � = ztf.tl/- fu 2 � The predicOon for is y p x 0 = 0 1 - It } • = 4 � � 0 � The predicOon for is y p x 0 = 0 0

A linear model with constant offset � The problem with the model y = x (1) β 1 + x (2) β 2 + ξ when x° y is: - � Let’s add a constant offset to the model β 0 y = β 0 + x (1) β 1 + x (2) β 2 + ξ " . ft ftp.t x - - .

Training and prediction with constant offset � The model y = β 0 + x (1) β 1 + x (2) β 2 + ξ = x T β + ξ - r . jjtaant � Training data: � 1 x (1) x (2) �   x (1) x (2) y 1 − 3 El �   1 1 3 0 β = ( X T X ) − 1 X T y = 2 1 1 2 3 2 3 1 3 6 5   − 3 � � 0 � �  = − 3 y p � For 0 = 1 0 0 2 x 0 =  0 1 3

Comparing our example models y = β 0 + x (1) β 1 + x (2) β 2 + ξ y = x (1) β 1 + x (2) β 2 + ξ   � � − 3 2 � �   β = β = 2 − 1 1 3 x T � x T � 3 x (1) x (2) x (1) x (2) β y 1 β y y -54=4 1 3 0 1 1 1 3 0 0 2 3 2 3 1 2 3 2 2 - l 1 3 6 5 5 3 6 5 4 l

Variance of the linear regression model � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) - X Este e. txt y - � The random error is uncorrelated to the least square soluOon of linear combinaOon of explanatory variables. ' pi = XT x XT y

Variance of the linear regression model: proof � The least squares esOmate saOsfies this property = X ft e Y i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) var ( Y ) = var ( X f ) t var Ce ) Proof: t z CoV l Xp , e ) xp t e . : CoV l X f , e ) = o

Variance of the linear regression model: proof � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) Proof: var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] + [ e − e ]) T ([ X ˆ β − X ˆ β ] + [ e − e ]) var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ]+2[ e − e ] T [ X ˆ β − X ˆ β ]+[ e − e ] T [ e − e ]) e T X � Because ; and Due to Least square minimized e T 1 = 0 e = 0 β = 0 var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ] + [ e − e ] T [ e − e ]) = var [ X � var [ y ] β ] + var [ e ]

Probability and Statistics for Computer Science All models are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science All models are wrong, but some models are useful--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020 Last time StochasOc Gradient Descent

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Introduction to Linux Kernel Some History . .

Identity Metasystem in Location Based Persistent Authentication EuroCAT 2009 European Workshop

2017 IGRINS Users Workshop 4 IGRINS Science Prog. Stats Kyoung Hee Kim T Tauri YSO

Semantics and Logical Form Semantics and Logical Form Berlin Chen 2003 References: 1. Speech

LArIAT Calibration with Stopping Tracks LArTPC Calibration & Reconstruction Workshop

Kernel level task management 1. Advanced/scalable task management schemes 2. (Multi-core) CPU

SLAM Landmark-based FastSLAM Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano

1 Collection architecture (cont.) Generic utility methods Map implemented by HashMap,

Probability and Statistics for Computer Science All models are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science All models are wrong, but some models are useful--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020 Last time StochasOc Gradient Descent

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Categorical Probability and Statistics Peter McCullagh Department of Statistics University of

Counting and Probability Whats to come? Counting and Probability Whats to come?

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

ACMS 20340 Statistics for Life Sciences Chapter 9: Introducing Probability Why Consider

Statistics 1B Statistics 1B 1 (11) 0. Lecture 1. Introduction and probability review

Statistics 370 Probability and Statistics for Engineers Instructor: Peter Bloomfield Course

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Official Statistics Matt Dray, Assistant Statistician Official Statistics 2 Official

Introduction to Linux Kernel Some History . .

Identity Metasystem in Location Based Persistent Authentication EuroCAT 2009 European Workshop

2017 IGRINS Users Workshop 4 IGRINS Science Prog. Stats Kyoung Hee Kim T Tauri YSO

Semantics and Logical Form Semantics and Logical Form Berlin Chen 2003 References: 1. Speech

LArIAT Calibration with Stopping Tracks LArTPC Calibration &amp; Reconstruction Workshop

Kernel level task management 1. Advanced/scalable task management schemes 2. (Multi-core) CPU

SLAM Landmark-based FastSLAM Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano

1 Collection architecture (cont.) Generic utility methods Map implemented by HashMap,

LArIAT Calibration with Stopping Tracks LArTPC Calibration & Reconstruction Workshop