Machine Learning - Regressions Amir H. Payberah payberah@kth.se - PowerPoint PPT Presentation

Machine Learning - Regressions Amir H. Payberah payberah@kth.se 07/11/2018

The Course Web Page https://id2223kth.github.io 1 / 81

Where Are We? 2 / 81

Where Are We? 3 / 81

Let’s Start with an Example 4 / 81

The Housing Price Example (1/3) ◮ Given the dataset of m houses. Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 . . . . . . . . . ◮ Predict the prices of other houses, as a function of the size of living area and number of bedrooms? 5 / 81

The Housing Price Example (2/3) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 . . . . . . . . . � 2104 � � 1600 � � 2400 � x ( 1 ) = y ( 1 ) = 400 x ( 2 ) = y ( 2 ) = 330 x ( 3 ) = y ( 3 ) = 369 3 3 3  x ( 1 ) ⊺   2104 3   400  x ( 2 ) ⊺ 1600 3 330         X = = y = x ( 3 ) ⊺ 2400 3 369             . . . .       . . . . . . . . ◮ x ( i ) ∈ R 2 : x ( i ) is the living area, and x ( i ) is the number of bedrooms of the i th 1 2 house in the training set. 6 / 81

The Housing Price Example (3/3) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 . . . . . . . . . ◮ Predict the prices of other houses ^ y as a function of the size of their living areas x 1 , and number of bedrooms x 2 , i.e., ^ y = f ( x 1 , x 2 ) ◮ E.g., what is ^ y , if x 1 = 4000 and x 2 = 4 ? ◮ As an initial choice: ^ y = f w ( x ) = w 1 x 1 + w 2 x 2 7 / 81

Linear Regression 8 / 81

Linear Regression (1/2) ◮ Our goal: to build a system that takes input x ∈ R n and predicts output ^ y ∈ R . ◮ In linear regression, the output ^ y is a linear function of the input x . y = f w ( x ) = w 1 x 1 + w 2 x 2 + · · · + w n x n ^ y = w ⊺ x ^ • ^ y : the predicted value • n : the number of features • x i : the i th feature value • w j : the j th model parameter ( w ∈ R n ) 9 / 81

Linear Regression (2/2) ◮ Linear regression often has one additional parameter, called intercept b : y = w ⊺ x + b ^ ◮ Instead of adding the bias parameter b , we can augment x with an extra entry that is always set to 1. y = f w ( x ) = w 0 x 0 + w 1 x 1 + w 2 x 2 + · · · + w n x n , where x 0 = 1 ^ 10 / 81

Linear Regression - Model Parameters ◮ Parameters w ∈ R n are values that control the behavior of the model. ◮ w are a set of weights that determine how each feature affects the prediction. • w i > 0 : increasing the value of the feature x i , increases the value of our prediction ^ y . • w i < 0 : increasing the value of the feature x i , decreases the value of our prediction ^ y . • w i = 0 : the value of the feature x i , has no effect on the prediction ^ y . 11 / 81

How to Learn Model Parameters w ? 12 / 81

Linear Regression - Cost Function (1/2) ◮ One reasonable model should make ^ y close to y , at least for the training dataset. ◮ Residual: the difference between the dependent variable y and the predicted value ^ y . r ( i ) = y ( i ) − ^ y ( i ) 13 / 81

Linear Regression - Cost Function (2/2) ◮ Cost function J ( w ) y ( i ) is to the corresponding y ( i ) . • For each value of the w , it measures how close the ^ • We can define J ( w ) as the mean squared error (MSE): m J ( w ) = MSE ( w ) = 1 y ( i ) − y ( i ) ) 2 � ( ^ m i y − y ) 2 ] = 1 y − y || 2 = E [( ^ m || ^ 2 14 / 81

How to Learn Model Parameters? ◮ We want to choose w so as to minimize J ( w ). ◮ Two approaches to find w : • Normal equation • Gradient descent 15 / 81

Normal Equation 16 / 81

Derivatives and Gradient (1/3) ◮ The first derivative of f ( x ), shown as f ′ ( x ), shows the slope of the tangent line to the function at the poa x . ◮ f ( x ) = x 2 ⇒ f ′ ( x ) = 2x ◮ If f(x) is increasing, then f ′ ( x ) > 0 ◮ If f(x) is decreasing, then f ′ ( x ) < 0 ◮ If f(x) is at local minimum/maximum, then f ′ ( x ) = 0 17 / 81

Derivatives and Gradient (2/3) ◮ What if a function has multiple arguments, e.g., f ( x 1 , x 2 , · · · , x n ) ◮ Partial derivatives: the derivative with respect to a particular argument. ∂ f ∂ x 1 , the derivative with respect to x 1 • ∂ f ∂ x 2 , the derivative with respect to x 2 • ∂ f ∂ x i : shows how much the function f will change, if we change x i . ◮ ◮ Gradient: the vector of all partial derivatives for a function f .  ∂ f  ∂ x 1 ∂ f   ∂ x 2 ∇ x f ( x ) =   . .   .   ∂ f ∂ x n 18 / 81

Derivatives and Gradient (3/3) ◮ What is the gradient of f ( x 1 , x 2 , x 3 ) = x 1 − x 1 x 2 + x 2 3 ?   ∂ ∂ x 1 ( x 1 − x 1 x 2 + x 2 3 )  1 − x 2  ∂ x 2 ( x 1 − x 1 x 2 + x 2 ∂ ∇ x f ( x ) = 3 )  = − x 1      ∂ ∂ x 3 ( x 1 − x 1 x 2 + x 2 3 ) 2x 3 19 / 81

Normal Equation (1/2) ◮ To minimize J ( w ), we can simply solve for where its gradient is 0: ∇ w J ( w ) = 0 y = w ⊺ x ^ [ x ( 1 ) 1 , x ( 1 ) 2 , · · · , x ( 1 )   x ( 1 ) ⊺ y ( 1 ) n ]     ^ [ x ( 2 ) 1 , x ( 2 ) 2 , · · · , x ( 2 ) x ( 2 ) ⊺ y ( 2 ) ^  n ]        X = = y = ^ .  .   .    . . .     . . .         x ( m ) ⊺ y ( m ) [ x ( m ) 1 , x ( m ) 2 , · · · , x ( m ) ^ n ] y = w ⊺ X ⊺ or ^ y = Xw ^ 20 / 81

Normal Equation (2/2) ◮ To minimize J ( w ), we can simply solve for where its gradient is 0: ∇ w J ( w ) = 0 J ( w ) = 1 y − y || 2 m || ^ 2 , ∇ w J ( w ) = 0 1 y − y || 2 ⇒ ∇ w m || ^ 2 = 0 1 m || Xw − y || 2 ⇒ ∇ w 2 = 0 ⇒ ∇ w ( Xw − y ) ⊺ ( Xw − y ) = 0 ⇒ ∇ w ( w ⊺ X ⊺ Xw − 2 w ⊺ X ⊺ y + y ⊺ y ) = 0 ⇒ 2 X ⊺ Xw − 2 X ⊺ y = 0 ⇒ w = ( X ⊺ X ) − 1 X ⊺ y 21 / 81

Normal Equation - Example (1/7) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 ◮ Predict the value of ^ y , when x 1 = 4000 and x 2 = 4 . ◮ We should find w 0 , w 1 , and w 2 in ^ y = w 0 + w 1 x 1 + w 2 x 2 . ◮ w = ( X ⊺ X ) − 1 X ⊺ y . 22 / 81

Normal Equation - Example (2/7) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 1 2104 3 400     1 1600 3 330     X = 1 2400 3 y = 369         1 1416 2 232     1 3000 4 540 import breeze.linalg._ val X = new DenseMatrix(5, 3, Array(1.0, 1.0, 1.0, 1.0, 1.0, 2104.0, 1600.0, 2400.0, 1416.0, 3000.0, 3.0, 3.0, 3.0, 2.0, 4.0)) val y = new DenseVector(Array(400.0, 330.0, 369.0, 232.0, 540.0)) 23 / 81

Normal Equation - Example (3/7) 1 2104 3       1 1 1 1 1 1 1600 3 5 10520 15   X ⊺ X =  = 2104 1600 2400 1416 3000 1 2400 3 10520 23751872 33144         3 3 3 2 4 1 1416 2 15 33144 47  1 3000 4 val Xt = X.t val XtX = Xt * X 24 / 81

Normal Equation - Example (4/7)   4 . 90366455e + 00 7 . 48766737e − 04 − 2 . 09302326e + 00 ( X ⊺ X ) − 1 = 7 . 48766737e − 04 2 . 75281889e − 06 − 2 . 18023256e − 03   − 2 . 09302326e + 00 − 2 . 18023256e − 03 2 . 22674419e + 00 val XtXInv = inv(XtX) 25 / 81

Normal Equation - Example (5/7) 400       1 1 1 1 1 330 1871   X ⊺ y =  = 2104 1600 2400 1416 3000  369  4203712       3 3 3 2 4 232 5921  540 val Xty = Xt * y 26 / 81

Normal Equation - Example (6/7)  4 . 90366455e + 00 − 2 . 09302326e + 00   1871  7 . 48766737e − 04 w = ( X ⊺ X ) − 1 X ⊺ y = 7 . 48766737e − 04 2 . 75281889e − 06 − 2 . 18023256e − 03 4203712     − 2 . 09302326e + 00 2 . 22674419e + 00 5921 − 2 . 18023256e − 03   − 7 . 04346018 e + 01 = 6 . 38433756 e − 02   1 . 03436047 e + 02 val w = XtXInv * Xty 27 / 81

Normal Equation - Example (7/7) ◮ Predict the value of y , when x 1 = 4000 and x 2 = 4 . y = − 7 . 04346018e + 01 + 6 . 38433756e − 02 × 4000 + 1 . 03436047e + 02 × 4 ≈ 599 ^ val test = new DenseVector(Array(1.0, 4000.0, 4.0)) val yHat = w * test 28 / 81

Normal Equation in Spark case class house(x1: Long, x2: Long, y: Long) val trainData = Seq(house(2104, 3, 400), house(1600, 3, 330), house(2400, 3, 369), house(1416, 2, 232), house(3000, 4, 540)).toDF val testData = Seq(house(4000, 4, 0)).toDF import org.apache.spark.ml.feature.VectorAssembler val va = new VectorAssembler().setInputCols(Array("x1", "x2")).setOutputCol("features") val train = va.transform(trainData) val test = va.transform(testData) import org.apache.spark.ml.regression.LinearRegression val lr = new LinearRegression().setFeaturesCol("features").setLabelCol("y").setSolver("normal") val lrModel = lr.fit(train) lrModel.transform(test).show 29 / 81

Machine Learning - Regressions Amir H. Payberah payberah@kth.se - PowerPoint PPT Presentation

Machine Learning - Regressions Amir H. Payberah payberah@kth.se 07/11/2018 The Course Web Page https://id2223kth.github.io 1 / 81 Where Are We? 2 / 81 Where Are We? 3 / 81 Lets Start with an Example 4 / 81 The Housing Price Example

On the Causal Interpretation of Race in Regressions Adjusting for Confounding and Mediating

After Fitting Regressions Paul E. Johnson 1 2 1 Department of Political Science 2 Center for

Session 03 Classical Linear Models Regression with factor variables Separate quadratic

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Course Review 2 Quantitative Overview Chapter 1 (Introduction) 0.5 session, 14 slides

Forums: A Love Story @cheddam who are you tho whats a forum tho why do I care tho How did

Gerncia de processos Requisitos fundamentais Implementar um modelo de processo.

Estrutura do tema Avaliao de Desempenho (IA32) Soma Int Acesso a Disco Multiplicao FP

Planning and Optimization C3. Delete Relaxation: AND/OR Graphs Malte Helmert and Gabriele R

Josh Bloch Charlie Garrod 17-214 1 Administrivia Homework 6 available Checkpoint

An application of Thurston's theorem on branched coverings Mitsuhiro Shishikura (Kyoto

Climate change and hydrological extreme events Risks and perspectives for water management in