Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, - PowerPoint PPT Presentation

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, 2016 Stefano Ermon April 13, 2016 1 / 51 Machine Learning 2: Nonlinear Regression

Non-linear regression 3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) High temperature / peak demand observations for all days in 2008-2011 Stefano Ermon April 13, 2016 2 / 51 Machine Learning 2: Nonlinear Regression

Stefano Ermon April 13, 2016 3 / 51 Machine Learning 2: Nonlinear Regression

Central idea of non-linear regression: same as linear regression, just with non-linear features  x 2  i E.g. φ ( x i ) = x i   1 Two ways to construct non-linear features: explicitly (construct actual feature vector), or implicitly (using kernels) Stefano Ermon April 13, 2016 4 / 51 Machine Learning 2: Nonlinear Regression

Observed Data 3 d = 2 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) Degree 2 polynomial Stefano Ermon April 13, 2016 5 / 51 Machine Learning 2: Nonlinear Regression

Constructing explicit feature vectors Polynomial features (max degree d ) z d   z d − 1   .   ∈ R d +1 . Special case, n=1: φ ( z ) =   .     z   1 � n n � ∈ R ( n + d d ) � z b i � General case: φ ( z ) = i : b i ≤ d i =1 i =1 Stefano Ermon April 13, 2016 8 / 51 Machine Learning 2: Nonlinear Regression

1 0.5 φ i (x) 0 1 −0.5 x x 2 x 3 −1 −1 −0.5 0 0.5 1 x Plot of polynomial bases Stefano Ermon April 13, 2016 9 / 51 Machine Learning 2: Nonlinear Regression

Radial basis function (RBF) features Defined by bandwidth σ and k RBF centers µ j ∈ R n , j = 1 , . . . , k � −� z − µ j � 2 � φ j ( z ) = exp 2 σ 2 1 0.8 Feature Value 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Input Stefano Ermon April 13, 2016 10 / 51 Machine Learning 2: Nonlinear Regression

Difficulties with non-linear features Problem #1: Computational difficulties Polynomial features, � n + d � = O ( d n ) k = d RBF features; suppose we want centers in uniform grid over input space (w/ d centers along each dimension) k = d n In both cases, exponential in the size of the input dimension; quickly intractable to even store in memory Stefano Ermon April 13, 2016 11 / 51 Machine Learning 2: Nonlinear Regression

Problem #2: Representational difficulties With many features, our prediction function becomes very expressive Can lead to overfitting (low error on input data points, but high error nearby) Let’s see an intuitive example Stefano Ermon April 13, 2016 12 / 51 Machine Learning 2: Nonlinear Regression

Observed Data Observed Data 3 3 d = 1 d = 2 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Observed Data Observed Data 3 3 d = 4 d = 50 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Least-squares fits for polynomial features of different degrees Stefano Ermon April 13, 2016 13 / 51 Machine Learning 2: Nonlinear Regression

Observed Data Observed Data 3 3 num RBFs = 2 num RBFs = 4 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Observed Data Observed Data 3 3 num RBFs = 10 num RBFs = 50, λ = 0 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Least-squares fits for different numbers of RBFs Stefano Ermon April 13, 2016 14 / 51 Machine Learning 2: Nonlinear Regression

A few ways to deal with representational problem: Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth) Regularization : penalize large parameters θ m � y i , y i ) + λ � θ � 2 minimize ℓ (ˆ 2 θ i =1 λ : regularization parameter, trades off between low loss and small values of θ (often, don’t regularize constant term) Stefano Ermon April 13, 2016 15 / 51 Machine Learning 2: Nonlinear Regression

6000 5000 4000 J( θ ) 3000 2000 1000 0 0 2 4 6 8 10 12 || θ || 2 Pareto optimal surface for 20 RBF functions Stefano Ermon April 13, 2016 16 / 51 Machine Learning 2: Nonlinear Regression

Observed Data Observed Data 3 3 num RBFs = 50, λ = 0 num RBFs = 50, λ = 2 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) Observed Data Observed Data 3 3 num RBFs = 50, λ = 50 num RBFs = 50, λ = 1000 Peak Hourly Demand (GW) Peak Hourly Demand (GW) 2.5 2.5 2 2 1.5 1.5 0 20 40 60 80 100 0 20 40 60 80 100 High Temperature (F) High Temperature (F) RBF fits varying regularization parameter (not regularizing constant term) Stefano Ermon April 13, 2016 17 / 51 Machine Learning 2: Nonlinear Regression

Regularization : penalize large parameters θ m � y i , y i ) + λ � θ � 2 minimize ℓ (ˆ 2 θ i =1 λ : regularization parameter, trades off between low loss and small values of θ (often, don’t regularize constant term) Solve with normal equations like before 2 + λθ T θ � Φ θ − y � 2 minimize θ θ T Φ T Φ θ − 2 y T Φ θ + y T y + λθ T θ minimize θ Φ T Φ + λI θ − 2 y T Φ θ + y T y θ T � � minimize θ Setting gradient to zero θ ⋆ = (Φ T Φ + λI ) − 1 Φ T y Stefano Ermon April 13, 2016 18 / 51 Machine Learning 2: Nonlinear Regression

Evaluating algorithms How do we determine when an algorithm achieves “good” performance? How should we tune the parameters of the learning algorithms (regularization parameter, choice of features, etc?) How do we report the performance of learning algorithms? Stefano Ermon April 13, 2016 19 / 51 Machine Learning 2: Nonlinear Regression

One possibility: just look at the loss function m ℓ ( θ T φ ( x i ) , y i ) � J ( θ ) = i =1 The problem: adding more features will always decrease the loss Example example: random outputs, random features, we can get zero loss for enough features m = 500; y = randn(m,1); Phi = randn(m,m); theta = (Phi´ * Phi) \ (Phi´ * y); norm(Phi*theta - y)^2 ans = 2.3722e-22 Stefano Ermon April 13, 2016 20 / 51 Machine Learning 2: Nonlinear Regression

A better criterion: training and testing loss Training set: x i ∈ R n , y i ∈ R , i = 1 , . . . , m i ∈ R n , y ′ Testing set: x ′ i ∈ R , i = 1 , . . . , m ′ Find parameters by minimizing loss on the training set, but evaluate on the testing set m Training: θ ⋆ = arg min ℓ ( θ T φ ( x i ) , y i ) � θ i =1 Evaluation: Average Loss = 1 m ′ ℓ (( θ ⋆ ) T φ ( x ′ i ) , y ′ i ) Performance on test set called generalization performance. Stefano Ermon April 13, 2016 21 / 51 Machine Learning 2: Nonlinear Regression

Sometimes, there is a natural breakdown between training and testing data (e.g., train system on one year, test on the next) More common, simply divide the data: for example, use 70% for training, 30% for testing % Phi, y, m are all the data m train = ceil(0.7*m); m test = m - m train; p = randperm(m); Phi train = Phi(p(1:m train),:); y train = y(p(1:m train)); Phi test = Phi(p(m train+1:end),:); y test = y(p(m train+1:end)); Stefano Ermon April 13, 2016 22 / 51 Machine Learning 2: Nonlinear Regression

3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High Temperature (F) High temperature / peak demand observations Stefano Ermon April 13, 2016 23 / 51 Machine Learning 2: Nonlinear Regression

0.09 Training set 0.08 Testing set Average squared loss 0.07 0.06 0.05 0.04 0.03 0.02 0 2 4 6 8 10 Polynomial degree, d Testing loss versus degree of polynomial Stefano Ermon April 13, 2016 24 / 51 Machine Learning 2: Nonlinear Regression

4 10 Training set Testing set Average squared loss 2 10 0 10 −2 10 0 10 20 30 40 Polynomial degree, d Testing loss (log-scale) versus degree of polynomial Stefano Ermon April 13, 2016 25 / 51 Machine Learning 2: Nonlinear Regression

0.1 Training set 0.09 Testing set 0.08 Average squared loss 0.07 0.06 0.05 0.04 0.03 0.02 2 4 6 8 10 12 Number of RBFs y Testing loss versus number of RBF bases Stefano Ermon April 13, 2016 26 / 51 Machine Learning 2: Nonlinear Regression

1 10 Training set Testing set Average squared loss 0 10 −1 10 −2 10 0 20 40 60 80 100 Number of RBFs Testing loss (log-scale) versus number of RBF bases Stefano Ermon April 13, 2016 27 / 51 Machine Learning 2: Nonlinear Regression

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, - PowerPoint PPT Presentation

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, 2016 Stefano Ermon April 13, 2016 1 / 51 Machine Learning 2: Nonlinear Regression Non-linear regression 3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

MACHINE LEARNING Linear and Weighted Regression Support Vector Regression 1 APPLIED MACHINE

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Nonlinear models Posterior Gradient Ascent Adaptive Step Size Approach to Limit Example Will

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 Stefano Ermon March 31, 2016

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

The probabilistic viewpoint and dynamics in arithmetic geometry Juan Rivera-Letelier Roots

Relativistic effects and non-collinear DFT What is relativistic effects? Dirac equation

Approximation power of deep networks Matus Telgarsky <mjt@illinois.edu> (with help from many

On existence and behavior of radial minimizers for the Schrdinger-Poisson-Slater problem.

Two classes of Blaschke products and their applications to operator theory Pamela Gorkin

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Self-induced dust traps: overcoming planet formation barriers Jean-Franois Gonzalez Guillaume

Chapter 11 Rolling-Contact Bearings 11-1. bearing Types Function: Carry load in one or

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, - PowerPoint PPT Presentation

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, 2016 Stefano Ermon April 13, 2016 1 / 51 Machine Learning 2: Nonlinear Regression Non-linear regression 3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

MACHINE LEARNING Linear and Weighted Regression Support Vector Regression 1 APPLIED MACHINE

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Nonlinear models Posterior Gradient Ascent Adaptive Step Size Approach to Limit Example Will

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 Stefano Ermon March 31, 2016

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

The probabilistic viewpoint and dynamics in arithmetic geometry Juan Rivera-Letelier Roots

Relativistic effects and non-collinear DFT What is relativistic effects? Dirac equation

Approximation power of deep networks Matus Telgarsky &lt;mjt@illinois.edu&gt; (with help from many

On existence and behavior of radial minimizers for the Schrdinger-Poisson-Slater problem.

Two classes of Blaschke products and their applications to operator theory Pamela Gorkin

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Self-induced dust traps: overcoming planet formation barriers Jean-Franois Gonzalez Guillaume

Chapter 11 Rolling-Contact Bearings 11-1. bearing Types Function: Carry load in one or

Approximation power of deep networks Matus Telgarsky <mjt@illinois.edu> (with help from many