[PPT] - Adaptive Sampling Strategies for Stochastic Optimization Raghu PowerPoint Presentation

SLIDE 1

Adaptive Sampling Strategies for Stochastic Optimization

Raghu (Vijaya Raghavendra) Bollapragada1 Richard Byrd2 Jorge Nocedal1

1Northwestern University 2University of Colorado Boulder

January 8, 2018 11th US - Mexico Workshop on Optimization and its Applications Huatulco, Mexico

Raghu Bollapragada (NU) Adaptive Sampling Methods 1/27 US - Mexico Workshop - 2018 1 / 27

SLIDE 2

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

SLIDE 3

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

Structural Risk Minimization min

x∈Rd F(x) =

f (x; z, y)dP(z, y)

SLIDE 4

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

Structural Risk Minimization min

x∈Rd F(x) =

f (x; z, y)dP(z, y)

Empirical Risk Minimization min

x∈Rd R(x) = 1

n

i=1

Fi(x)

SLIDE 5

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

Structural Risk Minimization min

x∈Rd F(x) =

f (x; z, y)dP(z, y)

Empirical Risk Minimization min

x∈Rd R(x) = 1

n

i=1

Fi(x) Stochastic Gradient is a popular first order method for solving these problems

SLIDE 6

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

Structural Risk Minimization min

x∈Rd F(x) =

f (x; z, y)dP(z, y)

Empirical Risk Minimization min

x∈Rd R(x) = 1

n

i=1

Fi(x) Stochastic Gradient is a popular first order method for solving these problems Many stochastic first order variance reduced methods have been proposed for finite sum problem

SAG[Schmidt et al. 2016], SAGA[Defazio et al. 2014], SVRG[Johnson and Zhang 2013]

SLIDE 7

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

Structural Risk Minimization min

x∈Rd F(x) =

f (x; z, y)dP(z, y)

Empirical Risk Minimization min

x∈Rd R(x) = 1

n

i=1

Fi(x) Stochastic Gradient is a popular first order method for solving these problems Many stochastic first order variance reduced methods have been proposed for finite sum problem

SAG[Schmidt et al. 2016], SAGA[Defazio et al. 2014], SVRG[Johnson and Zhang 2013]

Require either storage or computation of full gradient

SLIDE 8

Optimization Problem

min

x∈Rd F(x) = Eζ[f (x; ζ)]

Structural Risk Minimization min

x∈Rd F(x) =

f (x; z, y)dP(z, y)

Empirical Risk Minimization min

x∈Rd R(x) = 1

n

i=1

Fi(x) Stochastic Gradient is a popular first order method for solving these problems Many stochastic first order variance reduced methods have been proposed for finite sum problem

SAG[Schmidt et al. 2016], SAGA[Defazio et al. 2014], SVRG[Johnson and Zhang 2013]

Require either storage or computation of full gradient Achieve linear convergence for strongly convex functions

SLIDE 9

Adaptive Sampling Methods

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/27 US - Mexico Workshop - 2018 3 / 27

SLIDE 10

Adaptive Sampling Methods

xk+1 = xk − αk∇FSk(xk), ∇FSk(xk) = 1 |Sk|

i∈Sk

∇Fi(xk), where the set Sk ⊂ {1, 2, . . .} indexes data points (yi, zi) drawn at random from the distribution P

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/27 US - Mexico Workshop - 2018 3 / 27

SLIDE 11

Adaptive Sampling Methods

xk+1 = xk − αk∇FSk(xk), ∇FSk(xk) = 1 |Sk|

i∈Sk

∇Fi(xk), where the set Sk ⊂ {1, 2, . . .} indexes data points (yi, zi) drawn at random from the distribution P Noise in the steps is controlled by sample sizes

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/27 US - Mexico Workshop - 2018 3 / 27

SLIDE 12

Adaptive Sampling Methods

xk+1 = xk − αk∇FSk(xk), ∇FSk(xk) = 1 |Sk|

i∈Sk

∇Fi(xk), where the set Sk ⊂ {1, 2, . . .} indexes data points (yi, zi) drawn at random from the distribution P Noise in the steps is controlled by sample sizes These methods can take advantage of parallel frameworks

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/27 US - Mexico Workshop - 2018 3 / 27

SLIDE 13

Adaptive Sampling Methods

xk+1 = xk − αk∇FSk(xk), ∇FSk(xk) = 1 |Sk|

i∈Sk

∇Fi(xk), where the set Sk ⊂ {1, 2, . . .} indexes data points (yi, zi) drawn at random from the distribution P Noise in the steps is controlled by sample sizes These methods can take advantage of parallel frameworks If sample sizes are increased at geometric rate, R-Linear convergence for strongly convex functions

[Byrd et al. 2012] [Friedlander and Schmidt 2012] [Pasupathy et al. 2015]

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/27 US - Mexico Workshop - 2018 3 / 27

SLIDE 14

Overview

1

Adaptive Sampling Tests

2

Convergence Analysis

3

Practical Implementation

4

Numerical Experiments

5

Summary

SLIDE 15

Overview

1

Adaptive Sampling Tests Norm test Inner Product Test

2

Convergence Analysis Orthogonal test Linear Convergence

3

Practical Implementation Step-Length Strategy Parameter Selection

4

Numerical Experiments

5

Summary

SLIDE 16

Norm Test

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 17

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1).

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 18

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 19

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Byrd et al. [2012] proposed this test to control the sample sizes

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 20

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Byrd et al. [2012] proposed this test to control the sample sizes Cartis and Scheinberg [2016] ensured this condition is satisfied in probability and analyzed global convergence properties

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 21

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Byrd et al. [2012] proposed this test to control the sample sizes Cartis and Scheinberg [2016] ensured this condition is satisfied in probability and analyzed global convergence properties Hashemi et al. [2014] similar test in simulation optimization settings

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 22

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Byrd et al. [2012] proposed this test to control the sample sizes Cartis and Scheinberg [2016] ensured this condition is satisfied in probability and analyzed global convergence properties Hashemi et al. [2014] similar test in simulation optimization settings This test is designed to get more than just descent directions

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 23

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Byrd et al. [2012] proposed this test to control the sample sizes Cartis and Scheinberg [2016] ensured this condition is satisfied in probability and analyzed global convergence properties Hashemi et al. [2014] similar test in simulation optimization settings This test is designed to get more than just descent directions Sample gradients are unnecessarily close to the true gradients

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 24

Norm Test

∇FSk(xk) − ∇F(xk) ≤ θn∇F(xk), for some θn ∈ [0, 1). E[∇Fi(xk) − ∇F(xk)2] |Sk| ≤ θ2

n∇F(xk)2

Byrd et al. [2012] proposed this test to control the sample sizes Cartis and Scheinberg [2016] ensured this condition is satisfied in probability and analyzed global convergence properties Hashemi et al. [2014] similar test in simulation optimization settings This test is designed to get more than just descent directions Sample gradients are unnecessarily close to the true gradients Sample sizes are increased at much faster rates than the desired rates

Raghu Bollapragada (NU) Adaptive Sampling Methods 6/27 US - Mexico Workshop - 2018 6 / 27

SLIDE 25

Inner Product Test

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 26

Inner Product Test

First-Order Descent Condition ∇FSk(xk)T∇F(xk) > 0

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 27

Inner Product Test

First-Order Descent Condition ∇FSk(xk)T∇F(xk) > 0 Holds in Expectation E

∇FSk(xk)T∇F(xk)
= ∇F(xk)2 > 0

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 28

Inner Product Test

First-Order Descent Condition ∇FSk(xk)T∇F(xk) > 0 Holds in Expectation E

∇FSk(xk)T∇F(xk)
= ∇F(xk)2 > 0

For descent condition to hold at most iterations, we impose bounds on the variance

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 29

Inner Product Test

First-Order Descent Condition ∇FSk(xk)T∇F(xk) > 0 Holds in Expectation E

∇FSk(xk)T∇F(xk)
= ∇F(xk)2 > 0

For descent condition to hold at most iterations, we impose bounds on the variance E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 30

Inner Product Test

First-Order Descent Condition ∇FSk(xk)T∇F(xk) > 0 Holds in Expectation E

∇FSk(xk)T∇F(xk)
= ∇F(xk)2 > 0

For descent condition to hold at most iterations, we impose bounds on the variance E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Test is designed to achieve descent directions sufficiently often

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 31

Inner Product Test

First-Order Descent Condition ∇FSk(xk)T∇F(xk) > 0 Holds in Expectation E

∇FSk(xk)T∇F(xk)
= ∇F(xk)2 > 0

For descent condition to hold at most iterations, we impose bounds on the variance E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Test is designed to achieve descent directions sufficiently often Samples sizes required to satisfy this condition are smaller than those required for norm condition

Raghu Bollapragada (NU) Adaptive Sampling Methods 7/27 US - Mexico Workshop - 2018 7 / 27

SLIDE 32

Comparison

∇F (a) Norm Test g(x) − ∇F(x) ≤ θn∇F(x) ∇F (b) Inner Product Test

g(x)T∇F(x) − ∇F(x)2

≤ θip∇F(x)2

Figure: Given a gradient ∇F the shaded areas denote the set of vectors satisfying the deterministic (a) Norm test (b) Inner Product test

Raghu Bollapragada (NU) Adaptive Sampling Methods 8/27 US - Mexico Workshop - 2018 8 / 27

SLIDE 33

Comparison

Lemma

Let |Sip|, |Sn| represent the minimum number of samples required to satisfy the inner product test and norm test at any given iterate x and any given θip = θn < 1. Then we have |Sip| |Sn| = β(x) ≤ 1, where β(x) = E[∇Fi(x)2 cos2(γi)] − ∇F(x)2 E[∇Fi(x)2] − ∇F(x)2 , and γi is the angle made by ∇Fi(x) with ∇F(x).

Raghu Bollapragada (NU) Adaptive Sampling Methods 9/27 US - Mexico Workshop - 2018 9 / 27

SLIDE 34

Test Approximation

E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Raghu Bollapragada (NU) Adaptive Sampling Methods 10/27 US - Mexico Workshop - 2018 10 / 27

SLIDE 35

Test Approximation

E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Computing true gradient is expensive

Raghu Bollapragada (NU) Adaptive Sampling Methods 10/27 US - Mexico Workshop - 2018 10 / 27

SLIDE 36

Test Approximation

E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient

Raghu Bollapragada (NU) Adaptive Sampling Methods 10/27 US - Mexico Workshop - 2018 10 / 27

SLIDE 37

Test Approximation

E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Vari∈Sk(∇Fi(xk)T∇FSk(xk)) |Sk| ≤ θ2

ip∇FSk(xk)4,

Raghu Bollapragada (NU) Adaptive Sampling Methods 10/27 US - Mexico Workshop - 2018 10 / 27

SLIDE 38

Test Approximation

E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Vari∈Sk(∇Fi(xk)T∇FSk(xk)) |Sk| ≤ θ2

ip∇FSk(xk)4,

Vari∈Sk

∇Fi(xk)T∇FSk(xk)
=

1 |Sk| − 1

i∈Sk
∇Fi(xk)T∇FSk(xk) − ∇FSk(xk)22

Raghu Bollapragada (NU) Adaptive Sampling Methods 10/27 US - Mexico Workshop - 2018 10 / 27

SLIDE 39

Test Approximation

E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

|Sk| ≤ θ2

ip∇F(xk)4, θip ∈ [0, 1)

Computing true gradient is expensive Approximate population variance with sample variance and true gradient with sampled gradient Vari∈Sk(∇Fi(xk)T∇FSk(xk)) |Sk| ≤ θ2

ip∇FSk(xk)4,

Vari∈Sk

∇Fi(xk)T∇FSk(xk)
=

1 |Sk| − 1

i∈Sk
∇Fi(xk)T∇FSk(xk) − ∇FSk(xk)22

Whenever condition is not satisfied, increase sample size to satisfy the condition

Backup Raghu Bollapragada (NU) Adaptive Sampling Methods 10/27 US - Mexico Workshop - 2018 10 / 27

SLIDE 40

Overview

1

Adaptive Sampling Tests Norm test Inner Product Test

2

Convergence Analysis Orthogonal test Linear Convergence

3

Practical Implementation Step-Length Strategy Parameter Selection

4

Numerical Experiments

5

Summary

SLIDE 41

Orthogonal Test

Raghu Bollapragada (NU) Adaptive Sampling Methods 12/27 US - Mexico Workshop - 2018 12 / 27

SLIDE 42

Orthogonal Test

Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to ∇F(xk)

Raghu Bollapragada (NU) Adaptive Sampling Methods 12/27 US - Mexico Workshop - 2018 12 / 27

SLIDE 43

Orthogonal Test

Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to ∇F(xk) No restriction on near orthogonality of sample gradient and gradient

Raghu Bollapragada (NU) Adaptive Sampling Methods 12/27 US - Mexico Workshop - 2018 12 / 27

SLIDE 44

Orthogonal Test

Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to ∇F(xk) No restriction on near orthogonality of sample gradient and gradient Component of sample gradient orthogonal to gradient is 0 in expectation

Raghu Bollapragada (NU) Adaptive Sampling Methods 12/27 US - Mexico Workshop - 2018 12 / 27

SLIDE 45

Orthogonal Test

Although the Inner Product test is practical, convergence cannot be established because it allows sample gradients that are arbitrarily long relative to ∇F(xk) No restriction on near orthogonality of sample gradient and gradient Component of sample gradient orthogonal to gradient is 0 in expectation We control the variance in the orthogonal components of sampled gradients E

∇Fi(xk) − ∇Fi(xk)T ∇F(xk)

∇F(xk)2

∇F(xk)

2

|Sk| ≤ ν2∇F(xk)2, ν > 0

Raghu Bollapragada (NU) Adaptive Sampling Methods 12/27 US - Mexico Workshop - 2018 12 / 27

SLIDE 46

Linear Convergence

Raghu Bollapragada (NU) Adaptive Sampling Methods 13/27 US - Mexico Workshop - 2018 13 / 27

SLIDE 47

Linear Convergence

Theorem

Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µI ∇2F(x) LI, ∀x ∈ Rd.

Raghu Bollapragada (NU) Adaptive Sampling Methods 13/27 US - Mexico Workshop - 2018 13 / 27

SLIDE 48

Linear Convergence

Theorem

Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µI ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0.

Raghu Bollapragada (NU) Adaptive Sampling Methods 13/27 US - Mexico Workshop - 2018 13 / 27

SLIDE 49

Linear Convergence

Theorem

Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µI ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0. Then, if the steplength satisfies αk = α = 1 (1 + θ2

ip + ν2)L, Raghu Bollapragada (NU) Adaptive Sampling Methods 13/27 US - Mexico Workshop - 2018 13 / 27

SLIDE 50

Linear Convergence

Theorem

Suppose that F is twice continuously differentiable and that there exist constants 0 < µ ≤ L such that µI ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0. Then, if the steplength satisfies αk = α = 1 (1 + θ2

ip + ν2)L,

we have that E[F(wk) − F(w∗)] ≤ ρk(F(x0) − F(x∗)), where ρ = 1 − µ L 1 (1 + θ2

ip + ν2)

Convex Non-Convex

Raghu Bollapragada (NU) Adaptive Sampling Methods 13/27 US - Mexico Workshop - 2018 13 / 27

SLIDE 51

Overview

1

Adaptive Sampling Tests Norm test Inner Product Test

2

Convergence Analysis Orthogonal test Linear Convergence

3

Practical Implementation Step-Length Strategy Parameter Selection

4

Numerical Experiments

5

Summary

SLIDE 52

Step-Length Selection

xk+1 = xk − αk∇FSk(xk)

Raghu Bollapragada (NU) Adaptive Sampling Methods 15/27 US - Mexico Workshop - 2018 15 / 27

SLIDE 53

Step-Length Selection

xk+1 = xk − αk∇FSk(xk) Stochastic gradient is employed with diminishing stepsizes

Raghu Bollapragada (NU) Adaptive Sampling Methods 15/27 US - Mexico Workshop - 2018 15 / 27

SLIDE 54

Step-Length Selection

xk+1 = xk − αk∇FSk(xk) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive

Raghu Bollapragada (NU) Adaptive Sampling Methods 15/27 US - Mexico Workshop - 2018 15 / 27

SLIDE 55

Step-Length Selection

xk+1 = xk − αk∇FSk(xk) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Constant stepsize can be employed but one needs to know the Lipschitz constant of the problem αk = 1/L

Raghu Bollapragada (NU) Adaptive Sampling Methods 15/27 US - Mexico Workshop - 2018 15 / 27

SLIDE 56

Step-Length Selection

xk+1 = xk − αk∇FSk(xk) Stochastic gradient is employed with diminishing stepsizes Exact line search can be performed to determine the stepsize but it is too expensive Constant stepsize can be employed but one needs to know the Lipschitz constant of the problem αk = 1/L We propose to estimate the Lipschitz constant as we proceed, resulting in adaptive stepsizes

Raghu Bollapragada (NU) Adaptive Sampling Methods 15/27 US - Mexico Workshop - 2018 15 / 27

SLIDE 57

Step-Length Selection

Algorithm 1 Estimating Lipschitz constant Input: Lk−1 > 0, some η > 1

1: Compute parameter ζk > 1 2: Set Lk = Lk−1/ζk;

⊲ Decrease the Lipschitz constant

3: Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

4: while Fnew > Fsk(xk) −

1 2Lk ∇Fsk(xk)2 do

⊲ sufficient decrease

5:

Set Lk = ηLk−1 ⊲ Increase the Lipschitz constant

6:

Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

7: end while

Raghu Bollapragada (NU) Adaptive Sampling Methods 16/27 US - Mexico Workshop - 2018 16 / 27

SLIDE 58

Step-Length Selection

Algorithm 1 Estimating Lipschitz constant Input: Lk−1 > 0, some η > 1

1: Compute parameter ζk > 1 2: Set Lk = Lk−1/ζk;

⊲ Decrease the Lipschitz constant

3: Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

4: while Fnew > Fsk(xk) −

1 2Lk ∇Fsk(xk)2 do

⊲ sufficient decrease

5:

Set Lk = ηLk−1 ⊲ Increase the Lipschitz constant

6:

Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

7: end while

Beck and Teboulle [2009] proposed this algorithm to estimate Lipschitz constant in deterministic settings

Raghu Bollapragada (NU) Adaptive Sampling Methods 16/27 US - Mexico Workshop - 2018 16 / 27

SLIDE 59

Step-Length Selection

Algorithm 1 Estimating Lipschitz constant Input: Lk−1 > 0, some η > 1

1: Compute parameter ζk > 1 2: Set Lk = Lk−1/ζk;

⊲ Decrease the Lipschitz constant

3: Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

4: while Fnew > Fsk(xk) −

1 2Lk ∇Fsk(xk)2 do

⊲ sufficient decrease

5:

Set Lk = ηLk−1 ⊲ Increase the Lipschitz constant

6:

Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

7: end while

Beck and Teboulle [2009] proposed this algorithm to estimate Lipschitz constant in deterministic settings Scmidt et al. [2016] adapted this algorithm to stochastic algorithms such as SAG

Raghu Bollapragada (NU) Adaptive Sampling Methods 16/27 US - Mexico Workshop - 2018 16 / 27

SLIDE 60

Step-Length Selection

Algorithm 1 Estimating Lipschitz constant Input: Lk−1 > 0, some η > 1

1: Compute parameter ζk > 1 2: Set Lk = Lk−1/ζk;

⊲ Decrease the Lipschitz constant

3: Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

4: while Fnew > Fsk(xk) −

1 2Lk ∇Fsk(xk)2 do

⊲ sufficient decrease

5:

Set Lk = ηLk−1 ⊲ Increase the Lipschitz constant

6:

Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

7: end while

Similar to Line-Search on sampled functions with memory

Raghu Bollapragada (NU) Adaptive Sampling Methods 17/27 US - Mexico Workshop - 2018 17 / 27

SLIDE 61

Step-Length Selection

Algorithm 1 Estimating Lipschitz constant Input: Lk−1 > 0, some η > 1

1: Compute parameter ζk > 1 2: Set Lk = Lk−1/ζk;

⊲ Decrease the Lipschitz constant

3: Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

4: while Fnew > Fsk(xk) −

1 2Lk ∇Fsk(xk)2 do

⊲ sufficient decrease

5:

Set Lk = ηLk−1 ⊲ Increase the Lipschitz constant

6:

Compute Fnew = Fsk

xk − 1

Lk ∇Fsk(xk)

7: end while

Similar to Line-Search on sampled functions with memory Only needs access to sampled function values

Raghu Bollapragada (NU) Adaptive Sampling Methods 17/27 US - Mexico Workshop - 2018 17 / 27

SLIDE 62

Aggressive Steps Heuristic

Raghu Bollapragada (NU) Adaptive Sampling Methods 18/27 US - Mexico Workshop - 2018 18 / 27

SLIDE 63

Aggressive Steps Heuristic

It is well known and easy to show that E[F(xk+1)] − F(xk) ≤ −αkF(xk)2 + α2

kL

2 E[∇FSk(xk)2]

Raghu Bollapragada (NU) Adaptive Sampling Methods 18/27 US - Mexico Workshop - 2018 18 / 27

SLIDE 64

Aggressive Steps Heuristic

It is well known and easy to show that E[F(xk+1)] − F(xk) ≤ −αkF(xk)2 + α2

kL

2 E[∇FSk(xk)2] Thus, we can obtain a decrease in the true objective, in expectation, if Lα2

k

2

E[∇FSk(xk) − ∇F(xk)2] + ∇F(xk)2

≤ αk∇F(xk)2

Raghu Bollapragada (NU) Adaptive Sampling Methods 18/27 US - Mexico Workshop - 2018 18 / 27

SLIDE 65

Aggressive Steps Heuristic

Lα2

k

2

E[∇FSk(xk) − ∇F(xk)2] + ∇F(xk)2

≤ αk∇F(xk)2

Raghu Bollapragada (NU) Adaptive Sampling Methods 19/27 US - Mexico Workshop - 2018 19 / 27

SLIDE 66

Aggressive Steps Heuristic

Lα2

k

2

E[∇FSk(xk) − ∇F(xk)2] + ∇F(xk)2

≤ αk∇F(xk)2 Using, αk = 1/Lk, assuming Lk−1 ≥ L, and sample approximations Lk ≥ Lk−1 2 Var (∇Fi(xk)) |Sk|∇FSk(xk)2 + 1

,

where Var (∇Fi(xk)) =

1 |Sk|−1

i∈Sk ∇Fi(xk) − ∇FSk(xk)2.

Raghu Bollapragada (NU) Adaptive Sampling Methods 19/27 US - Mexico Workshop - 2018 19 / 27

SLIDE 67

Aggressive Steps Heuristic

Lα2

k

2

E[∇FSk(xk) − ∇F(xk)2] + ∇F(xk)2

≤ αk∇F(xk)2 Using, αk = 1/Lk, assuming Lk−1 ≥ L, and sample approximations Lk ≥ Lk−1 2 Var (∇Fi(xk)) |Sk|∇FSk(xk)2 + 1

,

where Var (∇Fi(xk)) =

1 |Sk|−1

i∈Sk ∇Fi(xk) − ∇FSk(xk)2.

Therefore, ζk = max   1, 2 Var(∇Fi(xk))

|Sk|∇FSk (xk)2 + 1

  

Raghu Bollapragada (NU) Adaptive Sampling Methods 19/27 US - Mexico Workshop - 2018 19 / 27

SLIDE 68

Parameter Selection

Raghu Bollapragada (NU) Adaptive Sampling Methods 20/27 US - Mexico Workshop - 2018 20 / 27

SLIDE 69

Parameter Selection

∇FSk(xk)T∇F(xk) − ∇F(xk)2

σ

√

|Sk|

∼ N(0, 1),

where σ2 = E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

is the true variance.

Raghu Bollapragada (NU) Adaptive Sampling Methods 20/27 US - Mexico Workshop - 2018 20 / 27

SLIDE 70

Parameter Selection

∇FSk(xk)T∇F(xk) − ∇F(xk)2

σ

√

|Sk|

∼ N(0, 1),

where σ2 = E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

is the true variance. Parameter θip is directly proportional to probability of getting a descent direction

Raghu Bollapragada (NU) Adaptive Sampling Methods 20/27 US - Mexico Workshop - 2018 20 / 27

SLIDE 71

Parameter Selection

∇FSk(xk)T∇F(xk) − ∇F(xk)2

σ

√

|Sk|

∼ N(0, 1),

where σ2 = E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

is the true variance. Parameter θip is directly proportional to probability of getting a descent direction θip = 0.7 corresponds to 0.9 probability

Raghu Bollapragada (NU) Adaptive Sampling Methods 20/27 US - Mexico Workshop - 2018 20 / 27

SLIDE 72

Parameter Selection

∇FSk(xk)T∇F(xk) − ∇F(xk)2

σ

√

|Sk|

∼ N(0, 1),

where σ2 = E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

is the true variance. Parameter θip is directly proportional to probability of getting a descent direction θip = 0.7 corresponds to 0.9 probability θip = 0.9 works well in practice

Raghu Bollapragada (NU) Adaptive Sampling Methods 20/27 US - Mexico Workshop - 2018 20 / 27

SLIDE 73

Parameter Selection

∇FSk(xk)T∇F(xk) − ∇F(xk)2

σ

√

|Sk|

∼ N(0, 1),

where σ2 = E

∇Fi(xk)T∇F(xk) − ∇F(xk)22

is the true variance. Parameter θip is directly proportional to probability of getting a descent direction θip = 0.7 corresponds to 0.9 probability θip = 0.9 works well in practice Orthogonal test is seldom active in practice and we choose ν = tan(80o) = 5.84 for all problems

Raghu Bollapragada (NU) Adaptive Sampling Methods 20/27 US - Mexico Workshop - 2018 20 / 27

SLIDE 74

Overview

1

Adaptive Sampling Tests Norm test Inner Product Test

2

Convergence Analysis Orthogonal test Linear Convergence

3

Practical Implementation Step-Length Strategy Parameter Selection

4

Numerical Experiments

5

Summary

SLIDE 75

Results: Constant Step-Length Strategy

R(x) = 1 n

n

i=1

log(1 + exp(−zixTyi)) + λ 2 x2

Raghu Bollapragada (NU) Adaptive Sampling Methods 22/27 US - Mexico Workshop - 2018 22 / 27

SLIDE 76

Results: Constant Step-Length Strategy

R(x) = 1 n

n

i=1

log(1 + exp(−zixTyi)) + λ 2 x2

20 40 60 80 100

Effective Gradient Evaluations

10 -4 10 -3 10 -2 10 -1 10 0 10 1

R(x) - R* synthetic, αn = 20, αip = 20 , θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600

Iterations

10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2

R(x) - R* synthetic, αn = 20, αip = 20 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Norm Test vs. Inner Product Test. Synthetic Dataset (n = 7000)

Raghu Bollapragada (NU) Adaptive Sampling Methods 22/27 US - Mexico Workshop - 2018 22 / 27

SLIDE 77

Results: Constant Step-Length Strategy

100 200 300 400 500

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes synthetic, αn = 20, αip = 20 , θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500

Iterations

20 40 60 80 100 120

Angle synthetic, αn = 20, αip = 20 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Synthetic Dataset (n = 7000)

Other Datasets Raghu Bollapragada (NU) Adaptive Sampling Methods 23/27 US - Mexico Workshop - 2018 23 / 27

SLIDE 78

Results: Adaptive Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -2 10 -1 10 0

R(x) - R* synthetic,θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700 800

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes synthetic,θ=0.90

Aug. Inner Product test

Norm Test

Figure: Synthetic Dataset (n = 7000)

Raghu Bollapragada (NU) Adaptive Sampling Methods 24/27 US - Mexico Workshop - 2018 24 / 27

SLIDE 79

Results: Adaptive Step-Length Strategy

100 200 300 400 500 600 700 800 900

Iterations

10 -2 10 -1 10 0 10 1

Stepsizes synthetic,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: Synthetic Dataset (n = 7000)

Other Datasets Raghu Bollapragada (NU) Adaptive Sampling Methods 25/27 US - Mexico Workshop - 2018 25 / 27

SLIDE 80

Overview

1

Adaptive Sampling Tests Norm test Inner Product Test

2

Convergence Analysis Orthogonal test Linear Convergence

3

Practical Implementation Step-Length Strategy Parameter Selection

4

Numerical Experiments

5

Summary

SLIDE 81

Summary

Raghu Bollapragada (NU) Adaptive Sampling Methods 27/27 US - Mexico Workshop - 2018 27 / 27

SLIDE 82

Summary

Adaptive sampling methods are alternate methods for noise reduction

Raghu Bollapragada (NU) Adaptive Sampling Methods 27/27 US - Mexico Workshop - 2018 27 / 27

SLIDE 83

Summary

Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments

Raghu Bollapragada (NU) Adaptive Sampling Methods 27/27 US - Mexico Workshop - 2018 27 / 27

SLIDE 84

Summary

Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test

Raghu Bollapragada (NU) Adaptive Sampling Methods 27/27 US - Mexico Workshop - 2018 27 / 27

SLIDE 85

Summary

Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test These methods can use adaptive stepsizes and second-order information can be incorporated

Raghu Bollapragada (NU) Adaptive Sampling Methods 27/27 US - Mexico Workshop - 2018 27 / 27

SLIDE 86

Summary

Adaptive sampling methods are alternate methods for noise reduction These methods can lead to speed ups when implemented in parallel environments We propose a practical inner product test which is better at controlling the sample sizes than the existing norm test These methods can use adaptive stepsizes and second-order information can be incorporated Currently working on practical sampling tests to control sample sizes in stochastic quasi-Newton methods

Raghu Bollapragada (NU) Adaptive Sampling Methods 27/27 US - Mexico Workshop - 2018 27 / 27

SLIDE 87

Questions???

Raghu Bollapragada (NU) Adaptive Sampling Methods 1/13 US - Mexico Workshop - 2018 1 / 13

SLIDE 88

Thank You

Raghu Bollapragada (NU) Adaptive Sampling Methods 1/13 US - Mexico Workshop - 2018 1 / 13

SLIDE 89

Appendix

Raghu Bollapragada (NU) Adaptive Sampling Methods 2/13 US - Mexico Workshop - 2018 2 / 13

SLIDE 90

Backup Mechanism

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/13 US - Mexico Workshop - 2018 3 / 13

SLIDE 91

Backup Mechanism

Sample approximations are not accurate when samples are very small (say 1, 5 , 10)

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/13 US - Mexico Workshop - 2018 3 / 13

SLIDE 92

Backup Mechanism

Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/13 US - Mexico Workshop - 2018 3 / 13

SLIDE 93

Backup Mechanism

Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/13 US - Mexico Workshop - 2018 3 / 13

SLIDE 94

Backup Mechanism

Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios gavg

def

= 1 r

k

j=k−r+1

∇FSj(xj)

Raghu Bollapragada (NU) Adaptive Sampling Methods 3/13 US - Mexico Workshop - 2018 3 / 13

SLIDE 95

Backup Mechanism

Sample approximations are not accurate when samples are very small (say 1, 5 , 10) Our tests may not be accurate in controlling the sample sizes in such situations Need more accurate approximations in such scenarios gavg

def

= 1 r

k

j=k−r+1

∇FSj(xj) r should be chosen such that the iterates in the summation are close enough and there are enough samples for gavg to be accurate (r = 10). If gavg < γ∇FSk(xk), for some γ ∈ (0, 1) then we use gavg instead of ∇FSk(xk) in the tests.

Main Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 3/13 US - Mexico Workshop - 2018 3 / 13

SLIDE 96

Convex Functions

Raghu Bollapragada (NU) Adaptive Sampling Methods 4/13 US - Mexico Workshop - 2018 4 / 13

SLIDE 97

Convex Functions

Theorem

(General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd.

Raghu Bollapragada (NU) Adaptive Sampling Methods 4/13 US - Mexico Workshop - 2018 4 / 13

SLIDE 98

Convex Functions

Theorem

(General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0.

Raghu Bollapragada (NU) Adaptive Sampling Methods 4/13 US - Mexico Workshop - 2018 4 / 13

SLIDE 99

Convex Functions

Theorem

(General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0. Then, if the steplength satisfies αk = α < 1 (1 + θ2

ip + ν2)L, Raghu Bollapragada (NU) Adaptive Sampling Methods 4/13 US - Mexico Workshop - 2018 4 / 13

SLIDE 100

Convex Functions

Theorem

(General Convex Objective.) Suppose that F is twice continuously differentiable and convex, and that there exists a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0. Then, if the steplength satisfies αk = α < 1 (1 + θ2

ip + ν2)L,

we have for any positive integer T, min

0≤k≤T−1 E [F (xk)] − F ∗ ≤

1 2αcT x0 − x∗2, where the constant c > 0 is given by c = 1 − Lα(1 + θ2

ip + ν2).

Main-Presentation

Raghu Bollapragada (NU) Adaptive Sampling Methods 4/13 US - Mexico Workshop - 2018 4 / 13

SLIDE 101

Non-Convex Functions

Raghu Bollapragada (NU) Adaptive Sampling Methods 5/13 US - Mexico Workshop - 2018 5 / 13

SLIDE 102

Non-Convex Functions

Theorem

(Nonconvex Objective.) Suppose that F is twice continuously differentiable and bounded below, and that there exist a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd.

Raghu Bollapragada (NU) Adaptive Sampling Methods 5/13 US - Mexico Workshop - 2018 5 / 13

SLIDE 103

Non-Convex Functions

Theorem

(Nonconvex Objective.) Suppose that F is twice continuously differentiable and bounded below, and that there exist a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0.

Raghu Bollapragada (NU) Adaptive Sampling Methods 5/13 US - Mexico Workshop - 2018 5 / 13

SLIDE 104

Non-Convex Functions

Theorem

(Nonconvex Objective.) Suppose that F is twice continuously differentiable and bounded below, and that there exist a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0. Then, if the steplength satisfies αk = α ≤ 1 (1 + θ2

ip + ν2)L, Raghu Bollapragada (NU) Adaptive Sampling Methods 5/13 US - Mexico Workshop - 2018 5 / 13

SLIDE 105

Non-Convex Functions

Theorem

(Nonconvex Objective.) Suppose that F is twice continuously differentiable and bounded below, and that there exist a constant L > 0 such that ∇2F(x) LI, ∀x ∈ Rd. Let {xk} be the iterates generated by subsampled gradient method with any x0, where |Sk| is chosen such that inner product test and orthogonal test are satisfied at each iteration for any given θip > 0 and ν > 0. Then, if the steplength satisfies αk = α ≤ 1 (1 + θ2

ip + ν2)L,

we have that lim

k→∞ E[∇F(xk)2] → 0.

Moreover, for any positive integer T we have that min

0≤k≤T−1 E[∇F(xk)2] ≤

2 αT (F(x0) − Fmin).

Main-Presentation

Raghu Bollapragada (NU) Adaptive Sampling Methods 5/13 US - Mexico Workshop - 2018 5 / 13

SLIDE 106

Results: Constant Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -3 10 -2 10 -1 10 0

R(x) - R* covtype, αn = 21, αip = 21 , θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000 2500

Iterations

10 -3 10 -2 10 -1 10 0 10 1

R(x) - R* covtype, αn = 21, αip = 21 , θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000

Iterations

5 10 15 20 25 30

(%) BatchSizes covtype, αn = 21, αip = 21 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Covertype Dataset (n = 581012)

20 40 60 80 100

Effective Gradient Evaluations

10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1

R(x) - R(x*) real-sim, αn = 210, αip = 210 , θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000 2500

Iterations

10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2

R(x) - R(x*) real-sim, αn = 210, αip = 210 , θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes real-sim, αn = 210, αip = 210 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Real-Sim Dataset (n = 65078)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 6/13 US - Mexico Workshop - 2018 6 / 13

SLIDE 107

Results: Constant Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1

R(x) - R* RCV1, αn = 28, αip = 28 , θ=0.90

Aug. Inner Product test

Norm Test

1000 2000 3000 4000 5000

Iterations

10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1

R(x) - R* RCV1, αn = 28, αip = 28 , θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000 2500 3000 3500 4000 4500

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes RCV1, αn = 28, αip = 28 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: RCV1 Dataset (n = 20242)

20 40 60 80 100

Effective Gradient Evaluations

10 -3 10 -2 10 -1 10 0

R(x) - R* mushrooms, αn = 22, αip = 22 , θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200

Iterations

10 -3 10 -2 10 -1 10 0 10 1

R(x) - R* mushrooms, αn = 22, αip = 22 , θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000

Iterations

10 20 30 40 50 60 70

(%) BatchSizes mushrooms, αn = 22, αip = 22 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Mushrooms Dataset (n = 8124)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 7/13 US - Mexico Workshop - 2018 7 / 13

SLIDE 108

Results: Constant Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -2 10 -1 10 0

R(x) - R* sido, αn = 2-1, αip = 2-1 , θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000

Iterations

10 -2 10 -1 10 0 10 1

R(x) - R* sido, αn = 2-1, αip = 2-1 , θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700 800 900 1000

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes sido, αn = 2-1, αip = 2-1 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Sido Dataset (n = 12678)

20 40 60 80 100

Effective Gradient Evaluations

10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0

R(x) - R* ijcnn, αn = 27, αip = 27 , θ=0.90

Aug. Inner Product test

Norm Test

50 100 150 200 250 300 350 400 450

Iterations

10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2

R(x) - R* ijcnn, αn = 27, αip = 27 , θ=0.90

Aug. Inner Product test

Norm Test

50 100 150 200 250 300 350 400

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes ijcnn, αn = 27, αip = 27 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Ijcnn Dataset (n = 35000)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 8/13 US - Mexico Workshop - 2018 8 / 13

SLIDE 109

Results: Constant Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -2 10 -1 10 0 10 1

R(x) - R* gisette, αn = 2-7, αip = 2-7 , θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700 800 900

Iterations

10 -2 10 -1 10 0 10 1

R(x) - R* gisette, αn = 2-7, αip = 2-7 , θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700 800

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes gisette, αn = 2-7, αip = 2-7 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: Gisette Dataset (n = 6000)

20 40 60 80 100

Effective Gradient Evaluations

10 -2 10 -1 10 0

R(x) - R* MNIST, αn = 2-1, αip = 2-1 , θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000

Iterations

10 -2 10 -1 10 0 10 1

R(x) - R* MNIST, αn = 2-1, αip = 2-1 , θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700 800 900

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes MNIST, αn = 2-1, αip = 2-1 , θ=0.90

Aug. Inner Product test

Norm Test

Figure: MNIST Dataset (n = 60000)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 9/13 US - Mexico Workshop - 2018 9 / 13

SLIDE 110

Results: Adaptive Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -3 10 -2 10 -1 10 0

R(x) - R* covtype,θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000 2500

Iterations

2 4 6 8 10 12 14 16 18

(%) BatchSizes covtype,θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000 2500 3000

Iterations

10 -1 10 0 10 1 10 2

Stepsizes covtype,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: Covertype Dataset (n = 581012)

20 40 60 80 100

Effective Gradient Evaluations

10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0

R(x) - R(x*) real-sim,θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200 1400 1600 1800

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes real-sim,θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000

Iterations

10 0 10 1 10 2 10 3 10 4 10 5

Stepsizes real-sim,θ=0.90

Aug. Inner Product test

Norm Test Best Constant Steplength

Figure: Real-Sim Dataset (n = 65078)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 10/13 US - Mexico Workshop - 2018 10 / 13

SLIDE 111

Results: Adaptive Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 10 0

R(x) - R(x*) RCV1,θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200 1400 1600 1800 2000

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes RCV1,θ=0.90

Aug. Inner Product test

Norm Test

500 1000 1500 2000 2500

Iterations

10 0 10 1 10 2 10 3 10 4

Stepsizes RCV1,θ=0.90

Aug. Inner Product test

Norm Test Best Constant Steplength

Figure: RCV1 Dataset (n = 20242)

20 40 60 80 100

Effective Gradient Evaluations

10 -4 10 -3 10 -2 10 -1 10 0

R(x) - R* mushrooms,θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600

Iterations

10 20 30 40 50 60 70 80

(%) BatchSizes mushrooms,θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700

Iterations

10 -1 10 0 10 1 10 2

Stepsizes mushrooms,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: Mushrooms Dataset (n = 8124)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 11/13 US - Mexico Workshop - 2018 11 / 13

SLIDE 112

Results: Adaptive Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

10 -2 10 -1 10 0

R(x) - R* sido,θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes sido,θ=0.90

Aug. Inner Product test

Norm Test

100 200 300 400 500 600 700

Iterations

10 -2 10 -1 10 0 10 1

Stepsizes sido,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: Sido Dataset (n = 12678)

20 40 60 80 100

Effective Gradient Evaluations

10 -4 10 -3 10 -2 10 -1 10 0

R(x) - R* ijcnn,θ=0.90

Aug. Inner Product test

Norm Test

50 100 150 200 250 300 350

Iterations

10 20 30 40 50 60 70 80 90 100

(%) BatchSizes ijcnn,θ=0.90

Aug. Inner Product test

Norm Test

50 100 150 200 250 300 350 400

Iterations

10 0 10 1 10 2 10 3

Stepsizes ijcnn,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: Ijcnn Dataset (n = 35000)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 12/13 US - Mexico Workshop - 2018 12 / 13

SLIDE 113

Results: Adaptive Step-Length Strategy

20 40 60 80 100

Effective Gradient Evaluations

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65

R(x) - R* gisette,θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200

Iterations

5 10 15 20 25 30 35 40 45

(%) BatchSizes gisette,θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200 1400

Iterations

10 -4 10 -3 10 -2 10 -1

Stepsizes gisette,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: Gisette Dataset (n = 6000)

20 40 60 80 100

Effective Gradient Evaluations

10 -2 10 -1 10 0

R(x) - R* MNIST,θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200

Iterations

10 20 30 40 50 60

(%) BatchSizes MNIST,θ=0.90

Aug. Inner Product test

Norm Test

200 400 600 800 1000 1200 1400

Iterations

10 -2 10 -1 10 0 10 1

Stepsizes MNIST,θ=0.90

Aug. Inner Product test

Norm Test Optimal Constant Steplength

Figure: MNIST Dataset (n = 60000)

Main-Presentation Raghu Bollapragada (NU) Adaptive Sampling Methods 13/13 US - Mexico Workshop - 2018 13 / 13