Machine Learning Lecture 2 Justin Pearson 1 2020 1 - PowerPoint PPT Presentation

Machine Learning Lecture 2 Justin Pearson 1 2020 1 http://user.it.uu.se/~justin/Teaching/MachineLearning/index.html 1 / 39

Today’s plan A (review) of elementary calculus. Gradient Descent for optimisation. Linear Regression as Gradient Descent. An exact Method of Linear Regression. Features and non-linear features. Looking at your model. Introduction to regularisation. 2 / 39

Gradients and Derivatives Given a continuous function f what does the derivative d d x f ( x ) = f ′ ( x ) , tell us? 3 / 39

Tangent Line 2 The slope of the tangent line is equal to first derivative of the function at that point. 2 Picture from https://commons.wikimedia.rg/wiki/File:Tangent_to_a_curve.svg 4 / 39

Gradients Taylor Expansion For a reasonably well behaved function, f , the Taylor expansion about a point x 0 is the following: f ( x ) = f ( x 0 )+ f ′ ( x 0 )( x − x 0 )+ 1 2! f ′′ ( a )( x − x 0 ) 2 + 1 3! f ′′′ ( x 0 )( x − x 0 ) 3 + · · · . The non-linear terms get smaller and smaller. Thus we could say that around a point x 0 f ( x ) ≈ f ( x 0 ) + f ′ ( x 0 )( x − x 0 ) 5 / 39

Gradients What happens when d d x f ( x ) = 0 ? We are at a minima or an inflection point. To check that it is a true minima we must check if f ′′ ( x ) = 0. 6 / 39

Gradient Descent If you are at a point, and you go in the direction of the gradient then you should decrease the value of the function. You are on a hill, you along a vector that has the steepest gradient. 7 / 39

Gradient Descent - One variable Given a learning rate α and an initial guess x 0 x ← x 0 ; while not converged do x ← x − α d d x f ( x ); end Question, what happens when α is very small and what happens if α is too large? 8 / 39

Minima 2 3.5 1 3.0 0 2.5 1 2.0 2 1.5 1.0 3 0.5 4 0.0 5 5 0 5 5 0 5 The red function on the left only has 1 minimum, while the function on the right as multiple local minima. 9 / 39

Minima Gradient descent is only guaranteed to find the global minimum if there is only one. If there are many local minima, then you can restart the algorithm with another guess and hope that you converge to a smaller local minima. Even so, gradient descent is a widely used optimisation method in machine learning 10 / 39

Partial derivatives How do you differentiate functions of multiple parameters? For example f ( x , y ) = xy + y 2 + x 2 y We can compute partial derivatives. The expression ∂ f ( x , y ) ∂ x is the derivative with respect to x where the other variables ( y ) in this case are treated as constants. So ∂ f ( x , y ) = y + 0 + 2 yx ∂ x 11 / 39

Gradient Descent — Multiple Variables Suppose that we have a function that depends on an n -dimensional vector, x = ( x 1 , . . . , x n ) Then the tangent vector or gradient is given by ∇ f ( x ) = ( ∂ f , . . . , ∂ f ) ∂ x 1 ∂ x n Gradient descent works in multiple dimensions, but there is even more of a chance that we can have multiple local minima. 12 / 39

New Notation Given a data set, x , y of m points we will denote the i th data item as x ( i ) , y ( i ) x ( i ) � 2 be more � This is an attempt to make expressions like understandable. I will try to be consistent. 13 / 39

Linear Hypothesises Consider a very simple data set x = (3 , 6 , 9) y = (6 . 9 , 12 . 1 , 16) We want to fit a straight line to the data. Our hypothesises is a function parameterised by θ 0 , θ 1 h θ 0 ,θ 1 ( x ) = θ 0 + θ 1 x 14 / 39

Hypothesises theta0 = 1.0, theta1 = 3.0 Just looking at the training data we theta0 = 1.5, theta1 = 2.0 25 Training data would say that the green line is 20 better. The question is how to we 15 quantify this? 10 5 0 0 2 4 6 8 15 / 39

Measuring Error - RMS Root Mean Squared is a common cost function for regression. In our case given the parameters θ 0 , θ 1 the RMS is defined as follows m J ( θ 0 , θ 1 , x , y ) = 1 � ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 2 m i =1 We assume that we have m data points where x ( i ) represents the i th data point and y ( i ) is the i th value we want to predict. Then h θ 0 ,θ 1 ( x ( i ) ) is the model’s prediction given θ 0 and θ 1 . For our data set we get J (1 . 0 , 3 . 0) = 33 . 54 J (1 . 5 , 2 . 0) = 2 . 43 Obviously the second is a better fit to the data. Question why ( h θ ( x ) − y ) 2 and not ( h θ ( x ) − y ) or even | h θ ( x ) − y | . 16 / 39

Learning The general form of regression learning algorithm is as follows: Given training data x = ( x (1) , . . . , x ( i ) , . . . , x ( m ) ) and y = ( y (1) , . . . , y ( i ) , . . . , y ( m ) ) A set of parameters Θ where each θ ∈ Θ gives rise to a hypothesis function h θ ( x ); A loss function J ( θ, x , y ) the computes the error or the cost for some hypothesis θ for the given data x , y ; Find a (the) value θ that minimises J . 17 / 39

Linear Regression Given m data samples x = ( x (1) , . . . , x ( m ) ) and y = ( y (1) , . . . , y ( m ) ). We want to find θ 0 and theta θ 1 such that J ( θ 0 , θ 1 , x , y ) is minimised. That is we want to minimise m J ( θ 0 , θ 1 , x , y ) = 1 � ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 2 m i =1 Where h θ 0 ,θ 1 = θ 0 + θ 1 x 18 / 39

Linear Regression — Gradient Descent To apply gradient descent we have to compute ∂ J ( θ 0 , θ 1 ) ∂θ 0 and ∂ J ( θ 0 , θ 1 ) ∂θ 1 19 / 39

Linear Regression — Gradient Descent For θ 0 we get m ∂ J ( θ 0 , θ 1 ) = 1 ∂ � ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 ∂θ 0 2 m ∂θ 0 i =1 So how do we compute ∂ ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 ? (1) ∂θ 0 We could expand out the square term or use the chain rule. 20 / 39

The Chain Rule d f ( g ( x )) = f ′ ( g ( x )) g ′ ( x ) d x If you set f ( x ) = x 2 then you get (since f ′ ( x ) = 2 x ) d g ( x ) 2 = 2( g ( x )) g ′ ( x ) d x 21 / 39

Linear Regression — Gradient Descent Using the chain rule � ∂ � ∂ ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 = 2( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) ∂θ 0 ∂θ 0 With a bit more algebra and expanding out h ∂ ∂ ( θ 0 + θ 1 x ( i ) − y ( i ) ) = 1 ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) = ∂θ 0 ∂θ 0 For the partial derivative anything not concerning θ 0 is treated as a constant and hence has a derivative of 0. 22 / 39

Linear Regression — Gradient Descent So putting it all together we get m J ( θ 0 , θ 1 ) = 1 ∂ ∂ � ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 2 m ∂θ 0 ∂θ 0 i =1 Which equals m 1 � 2( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 m i =1 23 / 39

Linear Regression — Gradient Descent For θ 1 we go through a similar exercise: m ∂ J ( θ 0 , θ 1 ) = 1 ∂ � ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 2 m ∂θ 1 ∂θ 1 i =1 Again we can compute the partial derivative using the chain rule � ∂ � ∂ ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) 2 = 2( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) ∂θ 1 ∂θ 1 With a bit more algebra ∂ ( θ 0 + θ 1 x ( i ) − y ( i ) ) = x ( i ) ∂θ 1 24 / 39

Linear Regression — Gradient Descent So our two partial derivatives are: m m ∂ J ( θ 0 , θ 1 ) = 1 ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) = 1 ( θ 0 + θ 1 x ( i ) − y ( i ) ) � � ∂θ 0 m m i =1 i =1 m m J ( θ 0 , θ 1 ) = 1 ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) x ( i ) = 1 ∂ � � ( θ 0 + θ 1 x ( i ) − y ( i ) ) x ( i ) ∂θ 1 m m i =1 i =1 25 / 39

Linear Regression — Gradient Descent Our simultaneous update rule for θ 0 and θ 1 is now � m θ 0 ← θ 0 − α 1 i =1 ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) m θ 1 ← θ 1 − α 1 � m i =1 ( h θ 0 ,θ 1 ( x ( i ) ) − y ( i ) ) x ( i ) m Since the error function is quadratic we have only one minima. So with suitable choices of α we should converge to the solution. 26 / 39

Linear Regression — Exact Solution Remember that at a local or global minimum have that ∂ ∂ J ( θ 0 , θ 1 ) = 0 = J ( θ 0 , θ 1 ) ∂θ 0 ∂θ 1 We can try to solve these two equations for θ 0 and θ 1 . In the case of linear regression we can. 27 / 39

Linear Regression – Exact Solution The details are not important. The reason why you can solve it is much more interesting. When you fix the data. You get two linear equations in θ 0 and θ 1 . m m 1 ( θ 0 + θ 1 x ( i ) − y ( i ) ) = θ 0 + 1 ( θ 1 x ( i ) − y ( i ) ) = 0 � � m m i =1 i =1 m m 1 ( θ 0 + θ 1 x ( i ) − y ( i ) ) x ( i ) = 1 ( θ 0 x ( i ) + θ 1 ( x ( i ) ) 2 − y ( i ) x ( i ) ) = 0 � � m m i =1 i =1 Since you have two equations and two unknowns θ 0 and θ 1 you can use linear algebra find a solution. This generalises to multiple dimensions and is implemented in most numerical packages. 28 / 39

Multiple Dimensions or features So far we have just had one feature. In general we want to model multiple features x 1 , . . . , x n . Our hypotheses become h θ 0 ,θ 1 ,...,θ n ( x 1 , . . . , x n ) = θ 0 + θ 1 x 1 + · · · + θ n x n We will need vectors. Let θ = ( θ 0 , θ 1 , . . . , θ n ) and x = (1 , x 1 , . . . x n ). Then our hypotheses is simply the dot produce of the two vectors n � h θ ( x ) = θ · x = θ j · x j j =0 Notice that we can factor out the constant by adding an extra feature that is always 1. The loss or error function is then m J ( θ ) = 1 � ( θ · x − y ) 2 2 m i =1 29 / 39

Machine Learning Lecture 2 Justin Pearson 1 2020 1 - PowerPoint PPT Presentation

Machine Learning Lecture 2 Justin Pearson 1 2020 1 http://user.it.uu.se/~justin/Teaching/MachineLearning/index.html 1 / 39 Todays plan A (review) of elementary calculus. Gradient Descent for optimisation. Linear Regression as Gradient

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

MAT 137 LEC 0601 Instructor: Alessandro Malus TA: Julia Kim December 3rd, 2020 Warm-up

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M

Pseudotriangulations and the Expansion Polytope A pointed pseudotriangulation of a set of points in

Generalized Single Degree of Freedom Systems PVD, Generalized Parameters, Rayleigh Quotjent

Monodromy and Real Wronskians Jake Levinson (Simon Fraser University) joint with Kevin Purbhoo

DCRPiU: data center on a Rpi Ubuntu Marco Zennaro, PhD ICTP LoRaWAN architecture MQTT Broker!

IPv6 deployment at Google Lorenzo Colitti, Angus Lees {lorenzo,alees}@google.com Why? Lorenzo

Challenges of Monitoring Distributed Systems May 2017 Nenad Bozic SmartCat @NenadBozicNs

Machine Learning Lecture 2 Justin Pearson 1 2020 1 - PowerPoint PPT Presentation

Machine Learning Lecture 2 Justin Pearson 1 2020 1 http://user.it.uu.se/~justin/Teaching/MachineLearning/index.html 1 / 39 Todays plan A (review) of elementary calculus. Gradient Descent for optimisation. Linear Regression as Gradient

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

MAT 137 LEC 0601 Instructor: Alessandro Malus TA: Julia Kim December 3rd, 2020 Warm-up

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&amp;M

Pseudotriangulations and the Expansion Polytope A pointed pseudotriangulation of a set of points in

Generalized Single Degree of Freedom Systems PVD, Generalized Parameters, Rayleigh Quotjent

Monodromy and Real Wronskians Jake Levinson (Simon Fraser University) joint with Kevin Purbhoo

DCRPiU: data center on a Rpi Ubuntu Marco Zennaro, PhD ICTP LoRaWAN architecture MQTT Broker!

IPv6 deployment at Google Lorenzo Colitti, Angus Lees {lorenzo,alees}@google.com Why? Lorenzo

Challenges of Monitoring Distributed Systems May 2017 Nenad Bozic SmartCat @NenadBozicNs

MATH 676 Finite element methods in scientifjc computing Wolfgang Bangerth, T exas A&M