Second order machine learning Michael W. Mahoney ICSI and - PowerPoint PPT Presentation

Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96

Outline Machine Learning’s “Inverse” Problem Your choice: 1st Order Methods: FLAG n’ FLARE, or disentangle geometry from sequence of iterates 2nd Order Methods: Stochastic Newton-Type Methods “simple” methods for convex “more subtle” methods for non-convex Michael W. Mahoney (UC Berkeley) Second order machine learning 2 / 96

Introduction Big Data ... Massive Data ... Michael W. Mahoney (UC Berkeley) Second order machine learning 3 / 96

Introduction Humongous Data ... Michael W. Mahoney (UC Berkeley) Second order machine learning 4 / 96

Introduction Big Data How do we view BIG data? Michael W. Mahoney (UC Berkeley) Second order machine learning 5 / 96

Introduction Algorithmic & Statistical Perspectives ... Computer Scientists Data: are a record of everything that happened. Goal: process the data to find interesting patterns and associations. Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard. Statisticians (and Natural Scientists, etc) Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world. Goal: is to extract information about the world from noisy data. Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic model. Michael W. Mahoney (UC Berkeley) Second order machine learning 6 / 96

Introduction ... are VERY different paradigms Statistics, natural sciences, scientific computing, etc: Problems often involve computation, but the study of computation per se is secondary Only makes sense to develop algorithms for well-posed problems 1 First, write down a model, and think about computation later Computer science: Easier to study computation per se in discrete settings, e.g., Turing machines, logic, complexity classes Theory of algorithms divorces computation from data First, run a fast algorithm, and ask what it means later 1 Solution exists, is unique, and varies continuously with input data Michael W. Mahoney (UC Berkeley) Second order machine learning 7 / 96

Introduction Context: My first stab at deep learning Michael W. Mahoney (UC Berkeley) Second order machine learning 8 / 96

Introduction A blog about my first stab at deep learning Michael W. Mahoney (UC Berkeley) Second order machine learning 9 / 96

Introduction A blog about my first stab at deep learning Michael W. Mahoney (UC Berkeley) Second order machine learning 10 / 96

Efficient and Effective Optimization Methods Problem Statement Problem 1: Composite Optimization Problem x ∈X⊆ R d F ( x ) = f ( x ) + h ( x ) min f : Convex and Smooth h : Convex and (Non-)Smooth Problem 2: Minimizing Finite Sum Problem n x ∈X⊆ R d F ( x ) = 1 � min f i ( x ) n i =1 f i : (Non-)Convex and Smooth n ≫ 1 Michael W. Mahoney (UC Berkeley) Second order machine learning 11 / 96

Efficient and Effective Optimization Methods Modern “Big-Data” Classical Optimization Algorithms Effective but Inefficient Need to design variants, that are: 1 Efficient, i.e., Low Per-Iteration Cost 2 Effective, i.e., Fast Convergence Rate Michael W. Mahoney (UC Berkeley) Second order machine learning 12 / 96

Efficient and Effective Optimization Methods Scientific Computing and Machine Learning share the same challenges, and use the same means, but to get to different ends! Machine Learning has been, and continues to be, very busy designing efficient and effective optimization methods Michael W. Mahoney (UC Berkeley) Second order machine learning 13 / 96

Efficient and Effective Optimization Methods First Order Methods • Variants of Gradient Descent (GD): Reduce the per-iteration cost of GD ⇒ Efficiency Achieve the convergence rate of the GD ⇒ Effectiveness x ( k +1) = x ( k ) − α k ∇ F ( x ( k ) ) Michael W. Mahoney (UC Berkeley) Second order machine learning 14 / 96

Efficient and Effective Optimization Methods First Order Methods E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG, Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ... Michael W. Mahoney (UC Berkeley) Second order machine learning 15 / 96

Efficient and Effective Optimization Methods But why? Q: Why do we use (stochastic) 1st order method? Cheaper Iterations? i.e., n ≫ 1 and/or d ≫ 1 Avoids Over-fitting? Michael W. Mahoney (UC Berkeley) Second order machine learning 16 / 96

Efficient and Effective Optimization Methods 1 st order method and “over-fitting” Challenges with “simple” 1st order method for “over-fitting”: Highly sensitive to ill-conditioning Very difficult to tune (many) hyper-parameters “Over-fitting” is difficult with “simple” 1st order method! Michael W. Mahoney (UC Berkeley) Second order machine learning 17 / 96

Efficient and Effective Optimization Methods Remedy? 1 “Not-So-Simple” 1st order method, e.g., accelerated and adaptive 2 2nd order methods, e.g., methods x ( k +1) = x ( k ) − [ ∇ 2 F ( x ( k ) )] − 1 ∇ F ( x ( k ) ) Michael W. Mahoney (UC Berkeley) Second order machine learning 18 / 96

Efficient and Effective Optimization Methods Your Choice Of.... Michael W. Mahoney (UC Berkeley) Second order machine learning 19 / 96

Efficient and Effective Optimization Methods Which Problem? 1 “Not-So-Simple” 1st order method: FLAG n’ FLARE Problem 1: Composite Optimization Problem x ∈X⊆ R d F ( x ) = f ( x ) + h ( x ) min f : Convex and Smooth, h : Convex and (Non-)Smooth 2 2nd order methods: Stochastic Newton-Type Methods Stochastic Newton, Trust Region, Cubic Regularization Problem 2: Minimizing Finite Sum Problem n x ∈X⊆ R d F ( x ) = 1 � min f i ( x ) n i =1 f i : (Non-)Convex and Smooth, n ≫ 1 Michael W. Mahoney (UC Berkeley) Second order machine learning 20 / 96

Efficient and Effective Optimization Methods Collaborators FLAG n’ FLARE Fred Roosta (UC Berkeley) Xiang Cheng (UC Berkeley) Stefan Palombo (UC Berkeley) Peter L. Bartlett (UC Berkeley & QUT) Sub-Sampled Newton-Type Methods for Convex Fred Roosta (UC Berkeley) Peng Xu (Stanford) Jiyan Yang (Stanford) Christopher R´ e (Stanford) Sub-Sampled Newton-Type Methods for Non-convex Fred Roosta (UC Berkeley) Peng Xu (Stanford) Implementations on GPU, etc. Fred Roosta (UC Berkeley) Sudhir Kylasa (Purdue) Ananth Grama (Purdue) Michael W. Mahoney (UC Berkeley) Second order machine learning 21 / 96

First-order methods: FLAG n’ FLARE Subgradient Method Composite Optimization Problem x ∈X⊆ R d F ( x ) = f ( x ) + h ( x ) min f : Convex (Non-)Smooth h : Convex (Non-)Smooth Michael W. Mahoney (UC Berkeley) Second order machine learning 22 / 96

First-order methods: FLAG n’ FLARE Subgradient Method Algorithm 1 Subgradient Method 1: Input: x 1 , and T 2: for k = 1 , 2 , . . . , T − 1 do - g k ∈ ∂ ( f ( x k ) + h ( x k )) 3: � 2 α k � x − x k � 2 � 1 - x k +1 = arg min x ∈X � g k , x � + 4: 5: end for � T x = 1 6: Output: ¯ t =1 x t T α k : Step-size Constant Step-size: α k = α Diminishing Step size � ∞ k =1 α k = ∞ , lim k →∞ α k = 0 Michael W. Mahoney (UC Berkeley) Second order machine learning 23 / 96

First-order methods: FLAG n’ FLARE Example: Logistic Regression { a i , b i } : features and labels a i ∈ { 0 , 1 } d , b i ∈ { 0 , 1 } n � log(1 + e � a i , x � ) − b i � a i , x � F ( x ) = i =1 n � � 1 � ∇ F ( x ) = 1 + e −� a i , x � − b i a i i =1 Infrequent Features ⇒ Small Partial Derivative Michael W. Mahoney (UC Berkeley) Second order machine learning 24 / 96

First-order methods: FLAG n’ FLARE predictive vs. irrelevant features Very infrequent features ⇒ Highly predictive (e.g. “CANON” in document classification) Very frequent features ⇒ Highly irrelevant (e.g. “and” in document classification) Michael W. Mahoney (UC Berkeley) Second order machine learning 25 / 96

First-order methods: FLAG n’ FLARE AdaGrad [Duchi et al., 2011] Frequent Features ⇒ Large Partial Derivative ⇒ Learning Rate ↓ Infrequent Features ⇒ Small Partial Derivative ⇒ Learning Rate ↑ Replace α k with scaling matrix adaptively... Many follows up works: RMSProp, Adam, Adadelta, etc... Michael W. Mahoney (UC Berkeley) Second order machine learning 26 / 96

First-order methods: FLAG n’ FLARE AdaGrad [Duchi et al., 2011] Algorithm 2 AdaGrad 1: Input: x 1 , η and T 2: for k = 1 , 2 , . . . , T − 1 do - g k ∈ ∂ f ( x k ) 3: - Form scaling matrix S k based on { g t ; t = 1 , . . . , k } 4: � g k , x � + h ( x ) + 1 2 ( x − x k ) T S k ( x − x k ) � � - x k +1 = arg min x ∈X 5: 6: end for � T x = 1 7: Output: ¯ t =1 x t T Michael W. Mahoney (UC Berkeley) Second order machine learning 27 / 96

First-order methods: FLAG n’ FLARE Convergence Convergence Let x ∗ be an optimum point. We have: AdaGrad [Duchi et al., 2011]: � √ � dD ∞ α x ) − F ( x ∗ ) ≤ O F (¯ √ , T where α ∈ [ 1 d , 1] and D ∞ = max x , y ∈X � y − x � ∞ , and √ Subgradient Descent: � D 2 � x ) − F ( x ∗ ) ≤ O √ F (¯ T where D 2 = max x , y ∈X � y − x � 2 . Michael W. Mahoney (UC Berkeley) Second order machine learning 28 / 96

Second order machine learning Michael W. Mahoney ICSI and - PowerPoint PPT Presentation

Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96 Outline Machine Learnings Inverse Problem Your choice: 1st Order

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter FY1 0 Second Quarter

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Expressions The Basics 42 Values 12.345 Hello! int eger Types float (real number) str

Augmenting Presentation MathML for Search Bruce R. Miller 1 and Abdou Youssef 2 1 Information

Presentation Instructions First choose a date during the semester to give your presentation by

VICTIM SERVICE PORTAL (VSP) TRAINING Mission Statement It is the mission of the Office of Victim

C unobserved construct (e.g. Disordered v. Non- Disordered) Latent classes are mutually

Variable Step Size Differential Equation Solvers Jason Brewer and George Little

HONORS THESIS PRESENTATION GUIDELINES FOR THESIS ADVISORS AND SECOND READERS Thesis Presentation :

Adaptive Quadrature for Nystrm Yuchen Su 12/6/17 Quadrature Quadrature Quadrature