9.54 class 8 Supervised learning Optimization, regularization, - PowerPoint PPT Presentation

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54, fall semester 2014

The Regularization Kingdom • Loss functions and empirical risk minimization • Basic regularization algorithms

Given a Training Set S = ( x 1 , y 1 ) , . . . , ( x n , y n ) Find f ( x ) ∼ y

We need a way to measure errors Loss function V ( f ( x ) , y )

• 0 − 1 -loss V ( f ( x ) , y ) = ✓ ( − yf ( x )) ( ✓ is the step function) • square loss (L2) V ( f ( x ) , y ) = ( f ( x ) − y ) 2 = (1 − yf ( x )) 2 • absolute value (L1) V ( f ( x ) , y ) = | f ( x ) − y | • Vapnik’s ✏ - insensitive loss V ( f ( x ) , y ) = ( | f ( x ) − y | − ✏ ) + • hinge loss V ( f ( x ) , y ) = (1 − yf ( x )) + • logistic loss V ( f ( x ) , y ) = log(1 − e − yf ( x ) ) logistic regression • exponential loss V ( f ( x ) , y ) = e − yf ( x )

Given a loss function V ( f ( x ) , y ) We can define the Empirical Error P n I S [ f ] = 1 i =1 V ( f ( x i ) , y i ) n

``Learning processes do not take place in vacuum.’’ � Cucker and Smale, AMS 2001 We need to fix a Hypotheses Space F H ⊂ F = { f | f : X → Y } H

parametric • Linear model f ( x ) = P p j =1 x j w j • Generalized linear models f ( x ) = P p j =1 Φ ( x ) j w j non-parametric j � 1 Φ ( x ) j w j = P • Reproducing kernel Hilbert spaces f ( x ) = P i � 1 K ( x, x i ) α i K ( x, x 0 ) is a symmetric positive definite function called reproducing kernel F H ⊂ F = { f | f : X → Y } H

parametric • Linear model f ( x ) = P p j =1 x j w j semi-parametric • Generalized linear models f ( x ) = P p j =1 Φ ( x ) j w j j � 1 Φ ( x ) j w j = P • Reproducing kernel Hilbert spaces f ( x ) = P i � 1 K ( x, x i ) α i K ( x, x 0 ) is a symmetric positive definite function called reproducing kernel F H ⊂ F = { f | f : X → Y } H

parametric • Linear model f ( x ) = P p j =1 x j w j semi-parametric • Generalized linear models f ( x ) = P p j =1 Φ ( x ) j w j non-parametric j � 1 Φ ( x ) j w j = P • Reproducing kernel Hilbert spaces f ( x ) = P i � 1 K ( x, x i ) α i K ( x, x 0 ) is a symmetric positive definite function called reproducing kernel F H ⊂ F = { f | f : X → Y } H

Empirical Risk Minimization (ERM) n 1 X min f ∈ H I S [ f ] = min V ( f ( x i ) , y i ) n f ∈ H i =1

Which is a good solution? Empirical Risk Minimization (ERM) n 1 X min f ∈ H E S [ f ] = min V ( f ( x i ) , y i ) n f ∈ H i =1

Training set Learning algorithm x predicted y h (living area of (predicted price) house.) of house) The training set S = ( x 1 , y 1 ) , . . . , ( x n , y n ) is sampled identically and independently (i.i.d) from a fixed unknown probability distribution p ( x, y ) = p ( x ) p ( y | x )

Learning is an ill-posed problem Ill posed problems often arise if one tries to infer general laws from few data the hypothesis space is too large there are not enough data Jacques Hadamard � In general ERM leads to ill-posed solutions because the solution may be too complex it may be not unique it may change radically when leaving one sample out Regularization Theory provides results and techniques to restore well-posedness, that is stability (hence generalization )

• Beyond drawings & intuitions (...) there is a deep, rigorous mathematical foundation of regularized learning algorithms (Cucker and Smale, Vapnik and Chervonenkis, ). � � Theory of learning is a synthesis of different fields, e.g. Computer Science (Algorithms, Complexity) and Mathematics (Optimization, Probability, Statistics). � � • Central to the Theory of Machine Learning is the problem of understanding condition under which ERM can solve inf E ( f ) , E ( f ) = E ( x,y ) V ( y, f ( x ))

Algorithms: The Regularization Kingdom � • loss functions and empirical risk minimization � • basic regularization algorithms �

(Tikhonov) Regularization regularization parameter n f ∈ H { 1 X V ( y i , f ( x i )) + λ R ( f )) } → f λ min S n i =1 regularizer • The regularizer describes the complexity of the solution � f 1 f 2 R ( f 2 ) is bigger than R ( f 1 ) • The regularization parameter determines the trade-off between complexity and empirical risk

Stability and (Tikhonov) Regularization Math Consider f ( x ) = w T x = P p j =1 w j x j , and R ( f ) = w T w , n 1 w T = Y X T ( XX T ) − 1 X ( y i − f ( x i )) 2 min n f ∈ H i =1 ( ) n 1 ( y i � f ( x i )) 2 + λ k f k 2 X w T = Y X T ( XX T + λ I ) − 1 min n f ∈ H i =1

From Linear to Semi-parametric Models p p X X x j w j Φ ( x ) j w j f ( x ) = = ⇒ f ( x ) = j =1 j =1 | {z } | {z } linear model generalized linear model If instead of a linear model we have a generalized linear model we simply have to consider  Φ ( x 1 ) 1  Φ ( x 1 ) p . . . . . . . . . X n =   . . . . . . . . . .   . . . . . Φ ( x n ) 1 Φ ( x n ) p . . . . . . . . .

From Parametric to Nonparametric Models How about nonparametric models? Math Some simple linear algebra shows that w T = Y X T ( XX T ) − 1 = Y ( X T X ) − 1 X T = CX T since X T ( XX T ) − 1 = ( X T X ) − 1 X T Then n f ( x ) = w T x = CX T x = X c i x T i x i We can compute C n or w n depending whether n ≤ p . The above result is the most basic form of the � Representer Theorem.

From Linear to Nonparametric Models Math Note that p n X X n x j = w j x T f ( x ) = i x c i |{z} j =1 i =1 P p j =1 x j i x j We can now consider a truly non parametric model n X X w j Φ ( x ) j = f ( x ) = K ( x, x i ) } c i | {z j ≥ 1 i =1 X Φ ( x i ) j Φ ( x ) j j ≥ 1

From Linear to Nonparametric Models Math We can now consider a truly non parametric model n X X w j Φ ( x ) j = f ( x ) = K ( x, x i ) } c i | {z j ≥ 1 i =1 X Φ ( x i ) j Φ ( x ) j We have j ≥ 1 C n = ( X n X T | {z } + λ nI ) − 1 Y n |{z} + λ nI ) − 1 Y n C n = ( K n n ( K n ) i,j = K ( x i , x j ) ( X n X T n ) i,j = x T i x j

Kernels • Linear kernel K ( x, x 0 ) = x T x 0 • Gaussian kernel K ( x, x 0 ) = e � k x � x 0k 2 σ > 0 , σ 2 • Polynomial kernel K ( x, x 0 ) = ( x T x 0 + 1) d , d ∈ N • Inner Product kernel/Features p X Φ ( x ) j Φ ( x 0 ) j Φ : X → R p . K ( x, x 0 ) = j =1

Reproducing Kernel Hilbert Spaces Math Given K , 9 ! Hilbert space of functions ( H , h · , · i ) such that, • K x := K ( x, · ) 2 H , for all x 2 X , and • f ( x ) = h f, K x i , for all x 2 X , f 2 H . The norm of a function f ( x ) = P n i =1 K ( x, x i ) c i is given by n k f k 2 = X K ( x j , x i ) c i c j i,j =1 and is a natural complexity measure. Note : An RKHS is equivalently defined as a Hilbert space where the evaluation functionals are continuous.

Extensions: Other Loss Functions For most loss functions the solution of Tikhonov regularization is of the form n X f ( x ) = K ( x, x i ) c i . i =1 • V ( f ( x ) , y ) = ( f ( x ) − y ) 2 , RLS • V ( f ( x ) , y ) = ( | f ( x ) − y | − ✏ ) + SVM regression • V ( f ( x ) , y ) = (1 − yf ( x )) + SVM classification • V ( f ( x ) , y ) = log(1 − e − yf ( x ) ) logistic regression • V ( f ( x ) , y ) = e − yf ( x ) boosting

Extensions: Other Loss Functions (cont) By changing the loss function we change the way we compute the coefficients in expansion n X f ( x ) = K ( x, x i ) c i . i =1

• Regularization avoids overfitting, ensures stability of the solution and generalization � • There are many different instance of regularization beyond Tikhonov , e.g. early stopping... min I S [ f ] + λ R ( f ) f | {z } | {z } complexity/smoothness term data fit term

• Regularization ensures stability of the solution and generalization � • There are different instance of regularization beyond Tikhonov, e.g. early stopping

Conclusions • Regularization Theory provides results and techniques to avoid overfitting (stability is key to generalization) � • Regularization provide a core set of concepts and techniques to solve a variety of problems � • Most algorithms can be seen as a form of regularization

Hebbian mechanisms can be used for biological supervised learning (Knudsen, 1990)

9.54 class 8 Supervised learning Optimization, regularization, - PowerPoint PPT Presentation

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54, fall semester 2014 The Regularization Kingdom Loss functions and empirical risk

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Inheritance A class can be a sub-type of another class The inheriting class contains all

UML Class Diagrams Steven Zeil February 25, 2013 UML Class Diagrams Outline Class

Multiple inheritance Multiple inheritance Can derive a class from more than one base class

CS 135: File Systems Class Overview 1 / 11 Class Overview Todays Topics Purpose of class

Chapter 10 Object-Oriented Thinking 1 Class Abstraction and Encapsulation Class abstraction is

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I.

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

Lecture Series - MSG 141 C2-Simula5on Interoperability

Special Issues in SNFs/NFs during the COVID-19 Pandemic Alice Bonner, I HI Senior Advisor for

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk

CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml