A First Supervised Learning Problem How do you measure the biomass - PDF document

A First Supervised Learning Problem How do you measure the biomass of a forest? Linear Regression Hard to measure: ◮ Mass of tree ◮ Height of tree (but can be done) Dan Sheldon Easy to measure: ◮ Diameter at breast height (DBH) Let’s simplify the problem: devise method to easily estimate the height of a tree A First Supervised Learning Problem Some data Idea ? ◮ Collect data on DBH and height for some trees ◮ Determine relationship between DBH and height ◮ Use DBH to predict height for a new tree What do you predict for the height of a tree with DBH 15cm? 35cm? Why? A First Supervised Learning Problem Supervised Learning Idea : DBH ( x ) Height ( y ) ◮ Collect data on DBH and height for some trees ◮ Determine relationship between DBH and height 17 63 ◮ Use DBH to predict height for a new tree 19 65 20.5 66 · · · · · · This is supervised learning : ◮ Collect training data ◮ Use a learning algorithm to fit a model Find h such that h ( x ) ≈ y ◮ Use model to make a prediction Illustration on board: supervised learning What model? What algorithm? Largely what this class is about.

Supervised Learning: Notation and Terminology Linear Regression in One Variable ◮ Observe m “training examples” of form ( x ( i ) , y ( i ) ) First example of supervised learning. Assume hypothesis is a linear function: ◮ x ( i ) : features / input / what we observe / DBH ◮ y ( i ) : target / output / what we want to predict / height ◮ Training set { ( x (1) , y (1) ) , . . . , ( x ( m ) , y ( m ) ) } h θ ( x ) = θ 0 + θ 1 x ◮ Find function (“hypothesis”) h such that h ( x ) ≈ y ◮ θ 0 : intercept, θ 1 : slope ◮ h ( x ( i ) ) ≈ y ( i ) – good fit on training data ◮ “parameters” or “weights” ◮ Generalize well to new x values How to find “best” θ 0 , θ 1 ? Illustration: hypotheses. Variations: type of x , y , h Finding the best hypothesis Squared Error Cost Function Simplification : “slope-only” model h θ ( x ) = θ 1 x The “squared error” cost function is: ◮ We only need to find θ 1 m J ( θ 1 ) = 1 � h θ ( x ( i ) ) − y ( i ) � 2 � Idea : design cost function J ( θ 1 ) to numerically measure the 2 i =1 quality of hypothesis h θ ( x ) ◮ E.g., θ 1 = 3 : Exercise: which cost functions below make sense? (3 x − y ) 2 / 2 x y 1. A only A. J ( θ 1 ) = � m � h θ ( x ( i ) ) − y ( i ) � (51 − 63) 2 = 144 / 2 17 63 i =1 2. B only (57 − 65) 2 = 64 / 2 19 65 h θ ( x ( i ) ) − y ( i ) � 2 B. J ( θ 1 ) = � m 3. C only � (61 . 5 − 65) 2 = 12 . 25 / 2 i =1 20.5 66 4. B and C C. J ( θ 1 ) = � m � h θ ( x ( i ) ) − y ( i ) � � i =1 � 5. A, B, and C J (3) = (144 + 64 + 12 . 25) / 2 = 220 . 25 / 2 Answer. 4 Our First Algorithm Our First Algorithm In Action We can use calculus to find the hypothesis of minimum cost. Set the derivative of J to zero and solve for θ 1 . For this example: 80 J ( θ 1 ) = 1 � (17 · θ 1 − 63) 2 + (19 · θ 1 − 65) 2 + (20 . 5 · θ 1 − 66) 2 � Height (in.) 70 2 = 535 . 125 · θ 2 1 − 3659 · θ 1 + 6275 60 d 0 = J ( θ 1 ) = 1070 . 25 · θ 1 − 3659 dθ 1 50 16 18 20 22 24 3659 Knee height (in.) θ 1 = 1070 . 25 = 3 . 4188 (See http://www.wolframalpha.com)

The General Algorithm Two Problems Remain In general, we don’t want to plug numbers into J ( θ 1 ) and solve a calculus problem every time . Problem one: we only fit the slope. What if θ 0 � = 0 ? Instead, we can solve for θ 1 in terms of x ( i ) and y ( i ) . Problem two: we will need a better optimization algorithm than d “Set dθ J ( θ ) = 0 and solve for θ .” The general problem: find θ 1 to minimize ◮ Wiggly functions ◮ Equation(s) may be non-linear, hard to solve m J ( θ 1 ) = 1 ( θ 1 x ( i ) − y ( i ) ) 2 � 2 Exercise: ideas for problem one? i =1 You will solve this in HW1. Solution to Problem One Functions of multiple variables! Here is an example cost function: Design a cost function that takes two parameters: m J ( θ 0 , θ 1 ) = 1 2 ( θ 0 + 17 · θ 1 − 63) 2 + 1 J ( θ 0 , θ 1 ) = 1 2 ( θ 0 + 19 · θ 1 − 65) 2 � h θ ( x ( i ) ) − y ( i ) � 2 � 2 2 ( θ 0 + 20 . 5 · θ 1 − 66) 2 + 1 2 ( θ 0 + 18 . 9 · θ 1 − 62 . 9) 2 + . . . + 1 i =1 m = 1 � θ 0 + θ 1 x ( i ) − y ( i ) � 2 � 2 Gain intuition on http://www.wolframalpha.com i =1 ◮ Surface plot Find θ 0 , θ 1 to minimize J ( θ 0 , θ 1 ) ◮ Contour plot Solution to Problem Two: Gradient Descent Gradient Descent To minimize a function J ( θ 0 , θ 1 ) of two variables ◮ Intialize θ 0 , θ 1 arbitrarily ◮ Gradient descent is a general purpose optimization algorithm. A “workhorse” of ML. ◮ Repeat until convergence ◮ Idea: repeatedly take steps in steepest downhill direction, with θ 0 := θ 0 − α ∂ step length proportional to “slope” J ( θ 0 , θ 1 ) ∂θ 0 ◮ Illustration: contour plot and pictorial definition of gradient θ 1 := θ 1 − α ∂ descent J ( θ 0 , θ 1 ) ∂θ 1 ◮ α = step-size or learning rate (not too big)

Partial derivatives Partial derivative intuition ◮ The partial derivative with respect to θ j is denoted ∂ ∂θ j J ( θ 0 , θ 1 ) ∂ Interpretation of partial derivative: ∂θ j J ( θ 0 , θ 1 ) is the rate of ◮ Treat all other variables as constants, then take derivative change along the θ j axis ◮ Example Example: illustrate funciton with elliptical contours ∂u 5 u 2 v 3 = 5 v 3 ∂ ∂ ∂uu 2 ◮ Sign of ∂ ∂θ 0 J ( θ 0 , θ 1 ) ? = 5 v 3 · 2 u ∂ ◮ Sign of ∂θ 1 J ( θ 0 , θ 1 ) ? ◮ Which has larger absolute value? = 10 v 3 u ∂ ∂v 5 u 2 v 3 =?? Gradient Descent The Result in Our Problem 80 ◮ Repeat until convergence Height (in.) 70 θ 0 = θ 0 − α ∂ J ( θ 0 , θ 1 ) ∂θ 0 θ 1 = θ 1 − α ∂ 60 J ( θ 0 , θ 1 ) ∂θ 1 ◮ Issues (explore in HW1) 50 16 18 20 22 24 ◮ Pitfalls Knee height (in.) ◮ How to set step-size α ? ◮ How to diagnose convergence? h θ ( x ) = 39 . 75 + 1 . 25 x Gradient descent intuition Gradient descent for linear regression Algorithm θ j := θ j − α ∂ θ 0 := θ 0 − α ∂ J ( θ 0 , θ 1 ) for j = 0 , 1 J ( θ 0 , θ 1 ) ∂θ j ∂θ 0 θ 1 := θ 1 − α ∂ J ( θ 0 , θ 1 ) Cost function ∂θ 1 m 1 ◮ Why does this move in the direction of steepest descent? � h θ ( x ( i ) ) − y ( i ) � 2 � J ( θ 0 , θ 1 ) = 2 ◮ What would we do if we wanted to maximize J ( θ 0 , θ 1 ) instead? i =1 We need to calculate partial derivatives.

Linear regression partial derivatives Linear regression partial derivatives Let’s first do this with a single training example ( x, y ) : More generally, with many training examples (work this out): m ∂ ∂ ∂ 1 � h θ ( x ( i ) ) − y ( i ) � � � 2 J ( θ 0 , θ 1 ) = J ( θ 0 , θ 1 ) = � h θ ( x ) − y ∂θ 0 ∂θ j ∂θ j 2 i =1 m = 2 · 1 2( h θ ( x ) − y ) · ∂ ∂ � � h θ ( x ( i ) ) − y ( i ) � x ( i ) ( h θ ( x ) − y ) J ( θ 0 , θ 1 ) = ∂θ j ∂θ 1 i =1 � · ∂ � � � h θ ( x ) − y = θ 0 + θ 1 x − y ∂θ j So the algorithm is: m So we get � h θ ( x ( i ) ) − y ( i ) � � θ 0 := θ 0 − α ∂ J ( θ 0 , θ 1 ) = � h θ ( x ) − y � i =1 ∂θ 0 m � � h θ ( x ( i ) ) − y ( i ) � x ( i ) θ 1 := θ 1 − α ∂ J ( θ 0 , θ 1 ) = � h θ ( x ) − y � x i =1 ∂θ 1 Demo: parameter space vs. hypotheses Summary ◮ What to know ◮ Supervised learning setup ◮ Cost function ◮ Convert a learning problem to an optimization problem Show gradient descent demo ◮ Squared error ◮ Gradient descent ◮ Next time ◮ More on gradient descent ◮ Linear algebra review

A First Supervised Learning Problem How do you measure the biomass - PDF document

A First Supervised Learning Problem How do you measure the biomass of a forest? Linear Regression Hard to measure: Mass of tree Height of tree (but can be done) Dan Sheldon Easy to measure: Diameter at breast height (DBH) Lets

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Matthieu R Bloch January 7, 2020

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Core API : linear regression IN TR OD U C TION TO TE N SOR FL OW IN R Colleen Bobbie Instr u

Linear programming Input: System of inequalities or equalities over the reals R A linear cost

CSC321 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC321 Lecture 2: Linear

Margaret M. Cigno Edward S. Pearsall The views presented in the paper are solely those of the

Linear and Logistic Regression Marta Arias marias@cs.upc.edu Dept. CS, UPC Fall 2018 Linear

Linear regression DS GA 1002 Probability and Statistics for Data Science

CS 445 Introduction to Machine Learning Logistic Regression Instructor: Dr. Kevin Molloy Review

4. Minimax and planning problems Optimizing piecewise linear functions Minimax problems