Boosting Methods: Implicit Combinatorial Optimization via - PowerPoint PPT Presentation

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert M. Freund Paul Grigas Rahul Mazumder rfreund@mit.edu M.I.T. ADGO October 2013 1

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Motivation Boosting methods are learning methods for combining weak models into accurate and predictive models Add one new weak model per iteration The weight on each weak model is typically small We consider boosting methods in two modeling contexts: Binary (confidence-rated) Classification (Regularized/sparse) Linear Regression Boosting methods are typically tuned to perform implicit regularization 2

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Review of Subgradient Descent and Frank-Wolfe Methods 1 Subgradient Descent method 2 Frank-Wolfe method (also known as Conditional Gradient method) 3

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Subgradient Descent Our problem of interest is: f ∗ := min f ( x ) x x ∈ R n s.t. where f ( x ) is convex but not differentiable. Then f ( x ) has subgradients. #""" $""" %""" &""" f(u) " ! &""" ! %""" ! $""" ! !"" ! #" ! $" ! %" ! &" " &" %" $" #" !"" 4 u

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Subgradient Descent, continued f ∗ := min f ( x ) x x ∈ R n s.t. f ( · ) is a (non-smooth) Lipschitz continuous convex function with Lipschitz value L f : | f ( x ) − f ( y ) | ≤ L f � x − y � for any x , y � · � is prescribed norm on R n 5

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Subgradient Descent, continued f ∗ := min f ( x ) x x ∈ R n s.t. Subgradient Descent method for minimizing f ( x ) on R n Initialize at x 1 ∈ R n , k ← 1 . At iteration k : Compute a subgradient g k of f ( x k ) . 1 Choose step-size α k . 2 Set x k +1 ← x k − α k g k . 3 6

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Computational Guarantees for SD Computational Guarantees for Subgradient Descent For each k ≥ 0 and for any x ∈ P , the following inequality holds: � k i ∈{ 0 ,..., k } f ( x i ) − f ( x ) ≤ � x − x 0 � 2 2 + L 2 i =0 α 2 f i min 2 � k i =0 α i 7

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Frank-Wolfe Method (Conditional Gradient method) Here the problem of interest is: f ∗ := min f ( x ) x s.t. x ∈ P P is compact and convex f ( x ) is differentiable and ∇ f ( · ) is Lipschitz on P : �∇ f ( x ) − ∇ f ( y ) � ∗ ≤ L ∇ � x − y � for all x , y ∈ P it is “very easy” to do linear optimization on P for any c : x ∈ P { c T x } ˜ x ← arg min 8

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Frank-Wolfe Method, continued f ∗ := min f ( x ) x s.t. x ∈ P Frank-Wolfe Method for minimizing f ( x ) on P Initialize at x 0 ∈ P , k ← 0 . At iteration k : 1 Compute ∇ f ( x k ) . 2 Compute ˜ x ∈ P {∇ f ( x k ) T x } . x k ← arg min 3 Set x k +1 ← x k + ¯ α k (˜ x k − x k ), where ¯ α k ∈ [0 , 1] . 9

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Computational Guarantees for Frank-Wolfe Method Here is one (simplified) computational guarantee: A Computational Guarantee for Frank-Wolfe Method 2 If the step-size sequence { ¯ α k } is chosen as ¯ α k = k +2 , k ≥ 0, then for all k ≥ 1 it holds that: C f ( x k ) − f ∗ ≤ k + 3 where C = 2 · L ∇ · diam( P ) 2 . 10

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Binary Classification Binary Classification 11

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Binary Classification The set-up of the general binary classification boosting problem consists of: Data/training examples ( x 1 , y 1 ) , . . . , ( x m , y m ) where each x i ∈ X and each y i ∈ [ − 1 , +1] A set of base classifiers H = { h 1 , . . . , h n } where each h j : X → [ − 1 , +1] Assume that H is closed under negation ( h j ∈ H ⇒ − h j ∈ H ) We would like to construct a nonnegative combination of weak classifiers H λ = λ 1 h 1 + · · · + λ n h n that performs significantly better than any individual classifier in H . 12

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Binary Classification Feature Matrix Define the feature matrix A ∈ R m × n by A ij = y i h j ( x i ) We seek λ ≥ 0 for which: A λ > 0 or perhaps A λ > ≈ 0 In application/academic context: m is large-scale n is huge-scale, too large for many computational tasks we wish only to work with very sparse λ , namely � λ � 0 is small we have access to a weak learner W ( · ) that, for any distribution w on the examples ( w ≥ 0 , e T w = 1), returns the base classifier j ∗ ∈ { 1 , . . . , n } that does best on the weighted example determined by w : j ∗ ∈ arg max w T A j j =1 ,..., n 13

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Binary Classification Aspirations We seek λ ≥ 0 for which: A λ > 0 or perhaps A λ > ≈ 0 In the high-dimensional regime with n ≫ 0, m ≫ 0 and often n ≫≫ 0, we seek: Good predictive performance (on out-of-sample examples) Good performance on the training data ( A i λ > 0 for “most” i = 1 , . . . , m ) Sparsity of the coefficients ( � λ � 0 is small) Regularization of the coefficients ( � λ � 1 is small) 14

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Two Objective Functions for Boosting We seek λ ≥ 0 for which: A λ > 0 or perhaps A λ > ≈ 0 Two objective functions are often considered in this context: when the data are separable, maximize the margin: p ( λ ) := i ∈{ 1 ,..., m } ( A λ ) i min when the data are non-separable, minimize exponential loss � m L exp ( λ ) := 1 i =1 exp ( − ( A λ ) i ) m ( ≡ the log-exponential loss L ( λ ) := log( L exp ( λ ))) It is known that a high margin implies good generalization properties [Schapire 97]. On the other hand, the exponential loss upper bounds the empirical probability of misclassification. 15

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Margin Maximization Problem The margin is p ( λ ) := i ∈{ 1 ,..., m } ( A λ ) i min p ( λ ) is positively homogeneous, so we normalize the variables λ Let ∆ n := { λ ∈ R n : e T λ = 1 , λ ≥ 0 } The problem of maximizing the margin over all normalized variables is: (PM): ρ ∗ = max λ ∈ ∆ n p ( λ ) Recall that we have access to a weak learner W ( · ) that, for any distribution w on the examples ( w ≥ 0 , e T w = 1), returns the base classifier j ∗ ∈ { 1 , . . . , n } that does best on the weighted example determined by w : j ∗ ∈ arg max w T A j j =1 ,..., n 16

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting AdaBoost Algorithm AdaBoost Algorithm Initialize at w 0 = (1 / m , . . . , 1 / m ) , λ 0 = 0 , k = 0 At iteration k ≥ 0: Compute j k ∈ W ( w k ) Choose step-size α k ≥ 0 and set: λ k +1 ← λ k + α k e j k λ k +1 ← λ k +1 ¯ e T λ k +1 w k +1 ← w k i exp( − α k A ij k ) i = 1 , . . . , m , and re-normalize i w k +1 so that e T w k +1 = 1 AdaBoost has the following sparsity/regularization properties: k − 1 � λ k � 0 ≤ k � λ k � 1 ≤ � and α i i =0 17

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Optimization Perspectives on AdaBoost What has been known about AdaBoost in the context of optimization: AdaBoost has been interpreted as a coordinate descent method to minimize the exponential loss [Mason et al., Mukherjee et al., etc.] A related method, the Hedge Algorithm, has been interpreted as dual averaging [Baes and B¨ urgisser] Rudin et al. in fact show that AdaBoost can fail to maximize the margin, but this is under the particular popular � � 1+ r k “optimized” step-size α k := 1 2 ln 1 − r k Lots of other work as well... 18

Boosting Methods: Implicit Combinatorial Optimization via - PowerPoint PPT Presentation

Motivation Review of SD and FW(CG) Binary Classification and Boosting Linear Regression and Boosting Boosting Methods: Implicit Combinatorial Optimization via First-Order Convex Optimization Robert M. Freund Paul Grigas Rahul Mazumder

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

20.1 Combinatorial Optimization next chapters: combinatorial optimization similar scenario,

Introduction to Combinatorial Algorithms Lucia Moura Fall 2015 Introduction to Combinatorial

Introduction to Combinatorial Algorithms Lucia Moura Winter 2018 Introduction to Combinatorial

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

General Methods and Search 2. Generic Approaches to Combinatorial Optimization Algorithms 3.

RECSM Summer School: Machine Learning for Social Sciences Session 2.4: Boosting Reto West

STK-IN4300 Statistical Learning Methods in Data Science Likelihood-based Boosting introduction

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Combinatorial Optimization Problems 4. Solution Methods 5. Construction Heuristics for the

Combinatorial Optimization inspired by Uncertainties Arie M.C.A. Koster Operations Research 2018

Introduction to Machine Learning ML-Basics: Components of Supervised Learning Learning goals

Class2: Constraint Networks Rina Dechter Dbook: chapter 2-3, Constraint book: chapters 2 and 4

Mixed-integer conic optimization and MOSEK Dagstuhl seminar on MINLP, February 20th 2018 Sven

Max-Point-Tolerance Graphs D. Catanzaro 1 , S. Chaplick 2 , S. Felsner 6 , B. Halldrsson 3 , M.

Unit 2: Problem Classification and Difficulty in Optimization Learning goals Unit 2 I. What

Instructor: Pedro Domingos Logistics Instructor: Pedro Domingos Email: pedrod@cs

5. Analytic Combinatorics http://aofa.cs.princeton.edu Analytic combinatorics is a calculus for

Sambuz

Useful Links

Newsletter

Mail Us