probability and statistics
play

Probability and Statistics for Computer Science All models are - PowerPoint PPT Presentation

Probability and Statistics for Computer Science All models are wrong, but some models are useful--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020 Last time StochasOc Gradient Descent


  1. Probability and Statistics ì for Computer Science “All models are wrong, but some models are useful”--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.17.2020

  2. Last time � StochasOc Gradient Descent � Naïve Bayesian Classifier } classifier - Regression

  3. Some popular topics in Ngram

  4. Objectives Linear regression detrition . * solution least square * The prediction and * Training evaluating else fit for R - squares * .

  5. Regression models are Machine learning methods � Regression models have been around for a while � Dr. Kelvin Murphy’s Machine Learning book has 3+ chapters on regression

  6. The regression problem yo ! d ' - sets - x' " Y x ; classification . . * xY ¥? - Regression Y d ' ' 's . x' gut y . . I. 56 * I 0.5 is YP ? -

  7. Chicago social economic census � The census included 77 communiOes in Chicago � The census evaluated the average hardship index of the residents � The census evaluated the following parameters for each community: PERCENT_OF_ HOUSING_CROWDED � PERCENT_ HOUSEHOLD_BELOW_POVERTY � PERCENT_ AGED_16p_UNEMPLOYED � PERCENT_ AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA � PERCENT_ AGED_UNDER_18_OR_OVER_64 � PER_CAPITA_ INCOME � Given a new community and its parameters, can you predict its average hardship index with all these parameters?

  8. Wait, have we seen the linear regression before? T X Correlation T iii. : ÷÷÷ : "

  9. It’s about Relationship between data features � Example: Is the height of people related to their weight? � x : HIGHT, y: WEIGHT

  10. Some terminology � Suppose the dataset consists of N labeled { ( x , y ) } items ( x i , y i ) � If we represent the dataset as a table � The d columns represenOng are called { x } ÷ explanatory variables x ( j ) � The numerical column y x (1) x (2) y is called the dependent 1 3 0 w/ variable 2 3 2 3 6 5

  11. Variables of the Chicago census [1] "PERCENT_OF_HOUSING_CROWDED" [2]"PERCENT_HOUSEHOLDS_BELOW_POVERTY" [3] "PERCENT_AGED_16p_UNEMPLOYED" [4]"PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DI PLOMA" [5] "PERCENT_AGED_UNDER_18_OR_OVER_64" [6]"PER_CAPITA_INCOME" [7] "HardshipIndex"

  12. Which is the dependent variable in the census example? A. "PERCENT_OF_HOUSING_CROWDED" B. "PERCENT_AGED_25p_WITHOUT_HIGH_SCHOOL_DIPLOMA” e C. "HardshipIndex” D. "PERCENT_AGED_UNDER_18_OR_OVER_64"

  13. Linear model x ( j ) � We begin by modeling y as a linear funcOon of re i plus randomness y = x (1) β 1 + x (2) β 2 + ... + x ( d ) β d + ξ Where is a zero-mean random variable that ξ represents model error x " d ' ) xT=[ x " - x' � In vector notaOon: x (1) x (2) y y = x T β + ξ 1 3 0 Where is the d-dimensional 2 3 2 β vector of coefficients that we train 3 6 5

  14. Each data item gives an equation � The model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ I xp , -13 * Bz tf y = l o = = zxp , t u t f ~ 3 tf Z Training data 5=3 xp it Gxpu -193 x (1) x (2) y 1 3 0 2 3 2 3 6 5

  15. Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ ECT 1=0       ξ 1 0 1 3 � β 1 �  = Training data ξ 2 2 2 3 +      β 2 ξ 3 5 3 6 . t.tk x (1) x (2) y ' - tx 17 1 3 0 y 2 3 2 - . 3 6 5 w'kno If

  16. Which together form a matrix equation � The model y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ       ξ 1 0 1 3 � β 1 � I  = Training data ξ 2 2 2 3 +      β 2 ξ 3 5 3 6 x (1) x (2) y 1 3 0 y = X · β + e 2 3 2 3 6 5

  17. Q. What’s the dimension of matrix X? o A. N × d B. d × N C. N × N D. d × d

  18. Training the model is to choose β � Given a training dataset , we want to fit a { ( x , y ) } model y = x T β + ξ       ξ 1 y 1 x T 1 � Define and and . .   . . y = X =     . e = . . . .       x T ξ N y N N � To train the model, we need to choose that makes β e small in the matrix equaOon y = X · β + e ① Least Square ② MLE = Textbook loss function 309 pg

  19. Training using least squares � In the least squares method, we aim to minimize � e � 2 Loss ( cost ' II . � e � 2 = � y − X β � 2 = ( y − X β ) T ( y − X β ) is suit � DifferenOaOng with respect to and semng to zero β " × F=xty X T X β − X T y = 0 O - � If is inverOble, the least squares esOmate of X T X the coefficient is: Hell ' F= arguing ① � β = ( X T X ) − 1 X T y - Xp te Y -

  20. XTX XT ~ Ix N Nxd X ~ + ns.XN.d > ai XTX ~ dxd = A real wagged symmetric . XTX , we n :3 o fr have -

  21. ⇒ Derivation of least square solution = Cy - xp 5cg - xp , - Hell = yty - pTxTy - ytxptpixtxp all vector lnratr :X involving derivative useful a square matrix A is a , b are vectors ; 2la}Aa# = ( A+ AT , a 21bII=b ⇒ ym= × Ty# ⇒ 2lMfTpzxTxp W is symmetric XTX bta scalar is 2fpT × Ty , . since . ⇒ zp- . Hb'=2lbIaa=2'f=b x )T Y × T × = = . : ' is , , × T × + ( XTXJTIZXTX all items c ' ) Note yell scalar , in scalar are 211 ell ' o - XTY - xTytzxTxp=o Ip = ⇒ xTxp=5# ✓ - - ' x' B. = cxtx , y vector is here y

  22. ⇒ Derivation of least square solution xty-xtx.pt X' ' ndxn p' 1=0 - x' ⇒ xtly e - ax , Cdt ' I ⇒ XTe=o # e) Io lied ) eTX=o city ^ ⇒ eTxp=o ^ et XP uncorrelated ! :

  23. Loss function ( east square Hell ? ftp.s-EQjcps-7 ? ,cxTjp-ygj2 K - j ja ' Lei - yjs Qjcps-cxt.jp the final project in I Qjl 01=1540 - yjl I Qj= ? 2 = ? 20

  24. Convex set and convex function � If a set is convex, full = o - any line connecOng VE two points in the set is completely . included in the set � A convex funcOon: the area above the curve is convex f ( λ x + (1 − λ ) y ) < λ f ( x ) + (1 − λ ) f ( y ) � The least square - funcOon is convex - Credit: Dr. Kelvin Murphy

  25. What’s the dimension of matrix X T X? X n Nx d A. N × d B. d × N XT ~ C. N × N Ix N D D. d × d Xtx - did → # of features d ( explanatory van .

  26. Is this statement true? If the matrix X T X does NOT have zero valued eigenvalues, it is inverOble. Rizo El A. TRUE it dit - B. FALSE to - all i t s o

  27. Training using least squares example 0 ¥ � Model: y = x T β + ξ = x (1) β 1 + x (2) β 2 + ξ � � 2 � β = ( X T X ) − 1 X T y = − 1 Training data 3 x (1) x (2) y � β 1 = 2 1 3 0 β 2 = − 1 � 2 3 2 3 3 6 5

  28. Prediction � If we train the model coefficients , we can predict � y p β 0 from x 0 0 � y p 0 = x T β � � 2 � � In the model with y = x (1) β 1 + x (2) β 2 + ξ β = − 1 Tf 3 � � = ztf.tl/- fu 2 � The predicOon for is y p x 0 = 0 1 - It } • = 4 � � 0 � The predicOon for is y p x 0 = 0 0

  29. A linear model with constant offset � The problem with the model y = x (1) β 1 + x (2) β 2 + ξ when x° y is: - � Let’s add a constant offset to the model β 0 y = β 0 + x (1) β 1 + x (2) β 2 + ξ " . ft ftp.t x - - .

  30. Training and prediction with constant offset � The model y = β 0 + x (1) β 1 + x (2) β 2 + ξ = x T β + ξ - r . jjtaant � Training data: � 1 x (1) x (2) �   x (1) x (2) y 1 − 3 El �   1 1 3 0 β = ( X T X ) − 1 X T y = 2 1 1 2 3 2 3 1 3 6 5   − 3 � � 0 � �  = − 3 y p � For 0 = 1 0 0 2 x 0 =  0 1 3

  31. Comparing our example models y = β 0 + x (1) β 1 + x (2) β 2 + ξ y = x (1) β 1 + x (2) β 2 + ξ   � � − 3 2 � �   β = β = 2 − 1 1 3 x T � x T � 3 x (1) x (2) x (1) x (2) β y 1 β y y -54=4 1 3 0 1 1 1 3 0 0 2 3 2 3 1 2 3 2 2 - l 1 3 6 5 5 3 6 5 4 l

  32. Variance of the linear regression model � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) - X Este e. txt y - � The random error is uncorrelated to the least square soluOon of linear combinaOon of explanatory variables. ' pi = XT x XT y

  33. Variance of the linear regression model: proof � The least squares esOmate saOsfies this property = X ft e Y i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) var ( Y ) = var ( X f ) t var Ce ) Proof: t z CoV l Xp , e ) xp t e . : CoV l X f , e ) = o

  34. Variance of the linear regression model: proof � The least squares esOmate saOsfies this property i � var ( { y i } ) = var ( { x T β } ) + var ( { ξ i } ) Proof: var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] + [ e − e ]) T ([ X ˆ β − X ˆ β ] + [ e − e ]) var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ]+2[ e − e ] T [ X ˆ β − X ˆ β ]+[ e − e ] T [ e − e ]) e T X � Because ; and Due to Least square minimized e T 1 = 0 e = 0 β = 0 var [ y ] = (1 /N )([ X ˆ β − X ˆ β ] T [ X ˆ β − X ˆ β ] + [ e − e ] T [ e − e ]) = var [ X � var [ y ] β ] + var [ e ]

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend