Optimization for Machine Learning Lecture 4: Quasi-Newton Methods - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28

The Story So Far Two Different Philosophies Online Algorithms: Use a small subset of the data at a time and repeatedly cycle Batch Optimization: Use the entire dataset to compute gradients and function values Gradient Based Approaches Bundle Methods: Lower bound the objective function using gradients Quasi-Newton algorithms: Use the gradients to estimate the Hessian (build a quadratic approximation of the objective) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 28

Classical Quasi-Newton Algorithms Outline Classical Quasi-Newton Algorithms 1 Non-smooth Problems 2 BFGS with Subgradients 3 Experiments 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 28

Classical Quasi-Newton Algorithms Broyden, Fletcher, Goldfarb, Shanno S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 28

Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − H − 1 ∇ J ( w t ) t S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − η t H − 1 ∇ J ( w t ) t η t is a step size usually found via a line search S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

Classical Quasi-Newton Algorithms Standard BFGS - I Locally Quadratic Approximation ∇ J ( w t ) is the gradient of J at w t H t is an n × n estimate of the Hessian of J m t ( w ) = J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) Parameter Update J ( w t ) + �∇ J ( w t ) , w − w t � + 1 2( w − w t ) ⊤ H t ( w − w t ) w t +1 = argmin w w t +1 = w t − η t B t ∇ J ( w t ) η t is a step size usually found via a line search B t = H − 1 is a symmetric PSD matrix t S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 28

Classical Quasi-Newton Algorithms Standard BFGS - II B Matrix Update Update B by B t +1 = argmin || B − B t || w s.t. s t = By t B y t = ∇ J ( w t +1 ) − ∇ J ( w t ) is the difference of gradients s t = w t +1 − w t is the difference in parameters This yields the update formula I − s t y ⊤ I − y t s ⊤ s t s ⊤ � � � � t t t B t +1 = + B t � s t , y t � � s t , y t � � s t , y t � Limited memory variant: use a low-rank approximation to B S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 28

Classical Quasi-Newton Algorithms Line Search Wolfe Conditions Sufficient decrease: J ( w t + η t d t ) ≤ J ( w t ) + c 1 η t �∇ J ( w t ) , d t � Curvature condition: �∇ J ( w t + η t d t ) , d t � ≥ c 2 �∇ J ( w t ) , d t � , where 0 < c 1 < c 2 < 1. S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 28

Non-smooth Problems Outline Classical Quasi-Newton Algorithms 1 Non-smooth Problems 2 BFGS with Subgradients 3 Experiments 4 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 28

Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

Non-smooth Problems Non-smooth Convex Optimization BFGS assumes that the objective function is smooth But, some of our losses look like this Houston we Have a Problem! S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 28

Non-smooth Problems Subgradients A subgradient at x ′ is any vector s which satisfies f ( x ) ≥ f ( x ′ ) + � x − x ′ , s � for all x Set of all subgradients is denoted as ∂ f ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 28

Non-smooth Problems Why is Non-Smooth Optimization Hard? The Key Difficulties A negative subgradient direction � = a descent direction Abrupt changes in function value can occur It is difficult to detect convergence 3 2 1 0 − 3 − 2 − 1 0 1 2 3 f ( x ) = | x | and ∂ f (0) = [ − 1 , 1] S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 28

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods - PowerPoint PPT Presentation

Optimization for Machine Learning Lecture 4: Quasi-Newton Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 28 The Story So

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Mtrace Version 2: Traceroute Facility for IP Mul7cast dra9ie;mbonedmtracev201

Scalable Two-hop Relaying for mmWave Networks Junquan Deng Doctoral candidate Aalto University

The DoF of the MIMO Y-channel Anas Chaaban, Karlheinz Ochs, and Aydin Sezgin Chair of Digital

Superluminal Velocities in Causal Media Department of Electrical and Computer Engineering

five moments and http://www.physics.umd.edu same translation but different rotation rigid

Discrete models and algorithms for packet scheduling in smart antennas Edoardo Amaldi Antonio

VHF

Introduction to Convolutional Neural Networks for Homogeneous Neutrino Detectors Outline 1.