Introduction to Machine Learning Linear Regression Prof. Andreas - PowerPoint PPT Presentation

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Basic Supervised Learning Pipeline Training Data Test Data “spam” ? Learning Predic- “ham” Model method ? tion “spam” ? f : X → Y : X → Y Prediction/ Model fitting Generalization 2

Regression Instance of supervised learning Goal : Predict real valued labels (possibly vectors) Examples: X Y Flight route Delay (minutes) Real estate objects Price Customer & ad features Click-through probability 3

Running example: Diabetes [Efron et al ‘04] Features X: Age Sex Body mass index Average blood pressure Six blood serum measurements (S1-S6) Label (target) Y quantitative measure of disease progression 4

Regression y + + + + + + + + + x Goal : learn real valued mapping f : R d → R 5

Important choices in regression What types of functions f should we consider? Examples f(x) f(x) + + + + + + + + + + + + + + + + + + + ++ + + + + x x How should we measure goodness of fit? 6

Example: linear regression y + + + + + + + + + x 7

Homogeneous representation 8

Quantifying goodness of fit x i ∈ R d D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } y i ∈ R y + + + + + + ++ + x 9

<latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit> Least-squares linear regression optimization [Legendre 1805, Gauss 1809] Given data set D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } How do we find the optimal weight vector? n w ∗ = arg min X ( y i − w T x i ) 2 ˆ w w i =1 10

<latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit> <latexit sha1_base64="fDIMbQ9YlPRf7DMzabXpUaK0cNU=">AB+XicbVDLSsNAFJ34rPUVdelmsAiuSlIFXbgouHFZwT6gCWUynbRDJ5Mwc1MpIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEySCa3Ccb2tfWNza7uyU93d2z84tI+OzpOFWVtGotY9QKimeCStYGDYL1EMRIFgnWDyV3hd6dMaR7LR5glzI/ISPKQUwJGti2NyaQeRGBcRBmT3k+sGtO3ZkDrxK3JDVUojWwv7xhTNOISaCaN13nQT8jCjgVLC86qWaJYROyIj1DZUkYtrP5slzfG6UIQ5jZ4EPFd/b2Qk0noWBWayiKiXvUL8z+unEN74GZdJCkzSxaEwFRhiXNSAh1wxCmJmCKGKm6yYjokiFExZVOCu/zlVdJp1N3LeuPhqta8LeuoFN0hi6Qi65RE92jFmojiqboGb2iNyuzXqx362MxumaVOyfoD6zPH1P/lBc=</latexit> Method 1: Closed form solution n The problem w ∗ = arg min X ( y i − w T x i ) 2 ˆ w w i =1 can be solved in closed form: w ∗ = ( X T X ) − 1 X T y ˆ w Hereby: 11

How to solve? Example: Scikit Learn 12

Demo Disease progression Body mass index 13

Method 2: Optimization The objective function ˆ X ( y i − w T x i ) 2 R ( w ) = i is convex! 14

Gradient Descent Start at an arbitrary w 0 ∈ R d For t=1,2,... do w t +1 = w t � η t r ˆ R ( w t ) Hereby, is called learning rate η t 15

Convergence of gradient descent Under mild assumptions, if step size sufficiently small, gradient descent converges to a stationary point (gradient = 0) For convex objectives, it therefore finds the optimal solution! In the case of the squared loss, constant stepsize ½ converges linearly 16

Computing the gradient 17

Demo: Gradient descent 18

Choosing a stepsize What happens if we choose a poor stepsize? 19

Adaptive step size Can update the step size adaptively. For example: 1) Via line search (optimizing step size every step) 2) „Bold driver“ heuristic If function decreases, increase step size: If function increases, decrease step size: 20

Demo: Gradient Descent for Linear Regression 21

Gradient descent vs closed form Why would one ever consider performing gradient descent, when it is possible to find closed form solution? Computational complexity May not need an optimal solution Many problems don‘t admit closed form solution 22

Other loss functions So far: Measure goodness of fit via squared error Many other loss functions possible (and sensible!) 23

Introduction to Machine Learning Linear Regression Prof. Andreas - PowerPoint PPT Presentation

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch) Basic Supervised Learning Pipeline Training Data Test Data spam ? Learning Predic- ham Model method ?

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

Generalized Approach for Analysing Quantum Key Distribution Experiments Arpita Maitra and Suvra

Statistical Analysis in the Please interrupt: Lexis Diagram: Most likely I did a mistake or

Learning Step Size Controllers for Robust Neural Network Training Christian Daniel et al. Recent

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Leamer Monoids and the Huneke-Wiegand Conjecture Roberto Carlos Pelayo Christopher ONeill