Scalable Machine Learning 4. Optimization Alex Smola Yahoo! - - PowerPoint PPT Presentation

scalable machine learning
SMART_READER_LITE
LIVE PREVIEW

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! - - PowerPoint PPT Presentation

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 4. Optimization Optimization Basic Techniques Gradient descent Newton's method Conjugate


slide-1
SLIDE 1

Scalable Machine Learning

  • 4. Optimization

Alex Smola Yahoo! Research and ANU

http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12

slide-2
SLIDE 2
  • 4. Optimization
slide-3
SLIDE 3

Basic Techniques

  • Gradient descent
  • Newton's method
  • Conjugate Gradient Descent
  • Broden-Fletcher-Goldfarb-Shanno (BFGS)
  • Constrained Convex Optimization
  • Properties
  • Lagrange function
  • Wolfe dual
  • Batch methods
  • Distributed subgradient
  • Bundle methods
  • Online methods
  • Unconstrained subgradient
  • Gradient projections
  • Parallel optimization

Optimization

slide-4
SLIDE 4

Why

slide-5
SLIDE 5

Parameter Estimation

  • Maximum a Posteriori with Gaussian Prior
  • We have lots of data
  • Does not fit on single machine
  • Bandwidth constraints
  • May grow in real time
  • Regularized Risk Minimization yields similar problems

(more on this in a later lecture)

log p(θ|X) = 1 2σ2 kθk2 +

m

X

i=1

g(θ) hφ(xi), θi + const.

prior data

slide-6
SLIDE 6

Batch and Online

  • Batch
  • Very large dataset available
  • Require parameter only at the end
  • optical character recognition
  • speech recognition
  • image annotation / categorization
  • machine translation
  • Online
  • Spam filtering
  • Computational advertising
  • Content recommendation / collaborative filtering
slide-7
SLIDE 7

Many parameters

  • 100 million to 1 Billion users

Personalized content provision - impossible to adjust all parameters by heuristic/manually

  • 1,000-10,000 computers

Cannot exchange all data between machines, Distributed optimization, multicore

  • Large networks

Nontrivial parameter dependence structure

slide-8
SLIDE 8

4.1 Unconstrained Problems

slide-9
SLIDE 9

Convexity 101

slide-10
SLIDE 10

−2 2 −2 2 0.2 0.4 0.6 0.8 −2 2 −3 −2 −1 1 2 3

Convexity 101

  • Convex set
  • Convex function

For x, x0 ∈ X it follows that λx + (1 − λ)x0 ∈ X for λ ∈ [0, 1] λλf(x) + (1 − λ)f(x0) ≥ f(λx + (1 − λ)x0) for λ ∈ [0, 1]

slide-11
SLIDE 11

−2 2 −2 2 0.2 0.4 0.6 0.8 −2 2 −3 −2 −1 1 2 3

Convexity 101

  • Convex set
  • Convex function

For x, x0 ∈ X it follows that λx + (1 − λ)x0 ∈ X for λ ∈ [0, 1] λλf(x) + (1 − λ)f(x0) ≥ f(λx + (1 − λ)x0) for λ ∈ [0, 1]

slide-12
SLIDE 12

Convexity 101

  • Below-set of convex function is convex
  • Convex functions don’t have local minima

Proof by contradiction - linear interpolation breaks local minimum condition

f(λx + (1 − λ)x0) ≤ λf(x) + (1 − λ)f(x hence λx + (1 − λ)x0 ∈ X for x, x0 ∈ X

slide-13
SLIDE 13

Convexity 101

  • Below-set of convex function is convex
  • Convex functions don’t have local minima

Proof by contradiction - linear interpolation breaks local minimum condition

f(λx + (1 − λ)x0) ≤ λf(x) + (1 − λ)f(x hence λx + (1 − λ)x0 ∈ X for x, x0 ∈ X

slide-14
SLIDE 14
  • Vertex of a convex set

Point which cannot be extrapolated within convex set

  • Convex hull
  • Convex hull of set is a convex set (proof trivial)

Convexity 101

λx + (1 λ)x0 62 X for λ > 1 for all x0 2 X

co X :=

  • ¯

x

  • ¯

x =

n

  • i=1

αixi where n ∈ N, αi ≥ 0 and

n

  • i=1

αi ≤ 1

slide-15
SLIDE 15

Convexity 101

sup

x∈X

f(x) = sup

x∈coX

f(x)

  • Supremum on convex hull

Proof by contradiction

  • Maximum over convex function
  • n convex set is obtained on vertex
  • Assume that maximum inside line segment
  • Then function cannot be convex
  • Hence it must be on vertex
slide-16
SLIDE 16

Gradient descent

slide-17
SLIDE 17

One dimensional problems

  • Key Idea
  • For differentiable f search for x with f’(x) = 0
  • Interval bisection (derivative is monotonic)
  • Need log (A-B) - log ε to converge
  • Can be extended to nondifferentiable problems

(exploit convexity in upper bound and keep 5 points)

7 3 1 2 4 5 6

Require: a, b, Precision Set A = a, B = b repeat if f ′ A+B

2

  • > 0 then

B = A+B

2

else A = A+B

2

end if until (B − A) min(|f ′(A)|, |f ′(B)|) ≤ Output: x = A+B

2

solution on the left

slide-18
SLIDE 18
  • Key idea
  • Gradient points into descent direction
  • Locally gradient is good

approximation of objective function

  • GD with Line Search
  • Get descent direction
  • Unconstrained line search
  • Exponential convergence for strongly

convex objective

Gradient descent

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

slide-19
SLIDE 19

Convergence Analysis

  • Strongly convex function
  • Progress guarantees (minimum x*)
  • Lower bound on the minimum (set y= x*)

f(y) f(x) + hy x, ∂xf(x)i + m 2 ky xk2 f(x) f(x∗) m 2 kx x∗k2 f(x) f(x∗)  hx x∗, ∂xf(x)i m 2 kx∗ xk2  sup

y hx y, ∂xf(x)i m

2 ky xk2 = 1 2m k∂xf(x)k2

slide-20
SLIDE 20

Convergence Analysis

  • Bounded Hessian

Using strong convexity

  • Iteration bound

f(y)  f(x) + hy x, ∂xf(x)i + M 2 ky xk2 = ) f(x + tgx)  f(x) t kgxk2 + M 2 t2 kgxk2  f(x) 1 2M kgxk2 = ) f(x + tgx) f(x∗)  f(x) f(x∗) 1 2M kgxk2  f(x) f(x∗) h 1 m M i M m log f(x) − f(x∗) ✏

slide-21
SLIDE 21

Distributed Implementation

slide-22
SLIDE 22

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

slide-23
SLIDE 23

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

distribute data over several machines

slide-24
SLIDE 24

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

distribute data over several machines compute partial gradients and aggregate

slide-25
SLIDE 25

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

distribute data over several machines compute partial gradients and aggregate update value in search direction and feed back

slide-26
SLIDE 26

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

distribute data over several machines compute partial gradients and aggregate update value in search direction and feed back communicate final value to each machine

slide-27
SLIDE 27

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

  • Map: compute gradient on subblock and emit
  • Reduce: aggregate parts of the gradients
  • Communicate the aggregate gradient back to all machines
slide-28
SLIDE 28

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

distribute data over several machines

  • Map: compute gradient on subblock and emit
  • Reduce: aggregate parts of the gradients
  • Communicate the aggregate gradient back to all machines
slide-29
SLIDE 29

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

distribute data over several machines

  • Map: compute gradient on subblock and emit
  • Reduce: aggregate parts of the gradients
  • Communicate the aggregate gradient back to all machines

compute partial gradients and aggregate

slide-30
SLIDE 30

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

  • Repeat until converged
  • Map: compute function & derivative at given parameter t
  • Reduce: aggregate parts of function and derivative
  • Decide based on f(x) and f’(x) which interval to pursue
  • Send updated parameter to all machines
slide-31
SLIDE 31

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

update value in search direction and feed back

  • Repeat until converged
  • Map: compute function & derivative at given parameter t
  • Reduce: aggregate parts of function and derivative
  • Decide based on f(x) and f’(x) which interval to pursue
  • Send updated parameter to all machines
slide-32
SLIDE 32

Basic steps

given a starting point x ∈ dom f. repeat

  • 1. ∆x := −∇f(x).
  • 2. Line search. Choose step size t via exact or backtracking line search.
  • 3. Update. x := x + t∆x.

until stopping criterion is satisfied.

update value in search direction and feed back

  • Repeat until converged
  • Map: compute function & derivative at given parameter t
  • Reduce: aggregate parts of function and derivative
  • Decide based on f(x) and f’(x) which interval to pursue
  • Send updated parameter to all machines

communicate final value to each machine

slide-33
SLIDE 33

Scalability analysis

  • Linear time in number of instances
  • Linear storage in problem size, not data
  • Logarithmic time in accuracy
  • ‘perfect’ scalability
  • 10s of passes through dataset for each iteration

(line search is very expensive)

  • MapReduce loses state at each iteration
  • Single master as bottleneck

(important if the state space is several GB)

slide-34
SLIDE 34

A Better Algorithm

  • Avoiding the line search
  • Not used in convergence proof anyway
  • Simply pick update
  • Only single pass through data per iteration
  • Only single MapReduce pass per iteration
  • Logarithmic iteration bound (as before)

x ← x − 1 M ∂xf(x) M m log f(x) − f(x∗) ✏

slide-35
SLIDE 35

Newton’s Method

Isaac Newton

slide-36
SLIDE 36

Newton Method

  • Convex objective function f
  • Nonnegative second derivative
  • Taylor expansion
  • Minimize approximation & iterate til converged

∂2

xf(x) ⌫ 0

f(x + δ) = f(x) + hδ, ∂xf(x)i + 1 2δ>∂2

xf(x)δ + O(δ3)

x ← x − ⇥ ∂2

xf(x)

⇤−1 ∂xf(x)

gradient Hessian

slide-37
SLIDE 37

Convergence Analysis

  • There exists a region around optimality where

Newton’s method converges quadratically if f is twice continuously differentiable

  • For some region around x* gradient is well

approximated by Taylor expansion

  • Expand Newton update
  • ∂xf(x∗) ∂xf(x)

⌦ x∗ x, ∂2

xf(x)

↵  γ kx∗ xk2

kxn+1 x∗k =

  • xn x∗

⇥ ∂2

xf(xn)

⇤−1 [∂xf(xn) ∂xf(x∗)]

  • =

∂2

xf(xn)

⇤−1 ⇥ ∂f

x(xn)[xn x∗] ∂xf(xn) + ∂xf(x∗)

  •  γ

∂2

xf(xn)

⇤−1

  • kxn x∗k2
slide-38
SLIDE 38

Convergence Analysis

  • Two convergence regimes
  • As slow as gradient descent outside the

region where Taylor expansion is good

  • Quadratic convergence once the bound holds
  • Newton method is affine invariant

(proof by chain rule)

  • ∂xf(x∗) ∂xf(x)

⌦ x∗ x, ∂2

xf(x)

↵  γ kx∗ xk2 kxn+1 x∗k  γ

∂2

xf(xn)

⇤−1

  • kxn x∗k2

See Boyd and Vandenberghe, Chapter 9.5 for much more

slide-39
SLIDE 39

Newton method rescales space

x(0) x(1) x(2)

from Boyd & Vandenberghe

wrong metric

slide-40
SLIDE 40

Newton method rescales space

from Boyd & Vandenberghe

x x + ∆xnt x + ∆xnsd

locally adaptive metric

slide-41
SLIDE 41

Parallel Newton Method

  • Good rate of convergence
  • Few passes through data needed
  • Parallel aggregation of gradient and Hessian
  • Gradient requires O(d) data
  • Hessian requires O(d2) data
  • Update step is O(d3) & nontrivial to parallelize
  • Use it only for low dimensional problems
slide-42
SLIDE 42

Conjugate Gradient Descent

slide-43
SLIDE 43

Key Idea

  • Minimizing quadratic function

takes cubic time (e.g. Cholesky factorization)

  • Matrix vector products and orthogonalization
  • Vectors x, x’ are K orthogonal if
  • m mutually K orthogonal vectors
  • form a basis
  • allow expansion
  • solve linear system

x>Kx0 = 0 xi ∈ Rm (K ⌫ 0) z =

m

X

i=1

xi x>

i Kz

x>

i Kxi

z =

m

X

i=1

xi x>

i y

x>

i Kxi

for y = Kz f(x) = 1 2x>Kx − l>x + c

slide-44
SLIDE 44
  • m mutually K orthogonal vectors
  • form a basis
  • allow expansion
  • solve linear system
  • Show linear independence by contradiction
  • Reconstruction - expand z into basis
  • For linear system plug in y = Kz

Proof

xi ∈ Rm z =

m

X

i=1

xi x>

i Kz

x>

i Kxi

z =

m

X

i=1

xi x>

i y

x>

i Kxi

for y = Kz X

i

αixi = 0 hence 0 = x>

j K

X

i

αixi = x>

j Kxjαj

z = X

i

αixi hence x>

j Kz = x> j K

X

i

αixi = x>

j Kxjαj

slide-45
SLIDE 45

???

  • Need vectors xi
  • Need to orthogonalize the vectors
  • How to select them
  • K-orthogonal vectors whiten the space since

has trivial solution

x = l f(x) = 1 2x>x − l>x + c

slide-46
SLIDE 46

Conjugate Gradient Descent

  • Gradient computation
  • Algorithm

deflation step K orthogonal

f(x) = 1 2x>Kx − l>x + c hence g(x) = Kx − l initialize x0 and v0 = g0 = Kx0 − l and i = 0 repeat xi+1 = xi − vi

g>

i vi

v>

i Kvi

gi+1 = Kxi+1 − l vi+1 = −gi+1 + vi

g>

i+1Kvi

v>

i Kvi

i ← i + 1 until gi = 0

slide-47
SLIDE 47

Proof - Deflation property

  • First assume that the vi are K orthogonal and show that

xi+1 is optimal in span of {v1 .. vi}

  • Enough if we show that
  • For j=i expand
  • For smaller j a consequence of K orthogonality

xi+1 = xi − vi

g>

i vi

v>

i Kvi

gi+1 = Kxi+1 − l vi+1 = −gi+1 + vi

g>

i+1Kvi

v>

i Kvi

v>

i gi+1 = v> i

 Kxi − l − Kvi g>

i vi

v>

i Kvi

  • = v>

i gi − v> i Kvi

g>

i vi

v>

i Kvi

= 0 v>

j gi = 0 for all j < i

slide-48
SLIDE 48

Proof - K orthogonality

  • Need to check that vi+1 is K orthogonal to all vj

(rest automatically true by construction)

xi+1 = xi − vi

g>

i vi

v>

i Kvi

gi+1 = Kxi+1 − l vi+1 = −gi+1 + vi

g>

i+1Kvi

v>

i Kvi

v>

j Kvi+1 = −v> j Kgi+1 + v> j Kvi

g>

i+1Kvi

v>

i KVi

0 by K orthogonality 0 by deflation

slide-49
SLIDE 49

Properties

  • Subspace expansion method for optimality

(g, Kg, K2g, K3g, ...)

  • Focuses on leading eigenvalues
  • Often sufficient to take only a few steps

(whenever the eigenvalues decay rapidly)

slide-50
SLIDE 50

Extensions

Generic Method Compute Hessian Ki := f ′′(xi) and update αi, βi with αi = −

g⊤

i vi

v⊤

i Kivi

βi =

g⊤

i+1Kivi

v⊤

i Kivi

This requires calculation of the Hessian at each iteration. Fletcher–Reeves [163] Find αi via a line search and use Theorem 6.20 (iii) for βi αi = argminαf(xi + αvi) βi =

g⊤

i+1gi+1

g⊤

i gi

Polak–Ribiere [398] Find αi via a line search αi = argminαf(xi + αvi) βi =

(gi+1−gi)⊤gi+1 g⊤

i gi

Experimentally, Polak–Ribiere tends to be better than Fletcher–Reeves.

x and v updates

slide-51
SLIDE 51

BFGS algorithm Broyden-Fletcher-Goldfarb-Shanno

slide-52
SLIDE 52

Basic Idea

  • Newton-like method to compute descent direction
  • Line search on f in direction
  • Update B with rank 2 matrix
  • Require that Quasi-Newton condition holds

δi = B−1

i

∂xf(xi−1) xi+1 = xi − αiδi Bi+1 = Bi + uiu>

i + viv> i

Bi+1(xi+1 − xi) = ∂xf(xi+1) − ∂xf(xi) Bi+1 = Bi + gig>

i

αiδ>

i gi

− Biδiδ>

i Bi

δ>

i Biδi

slide-53
SLIDE 53

Properties

  • Simple rank 2 update for B
  • Use matrix inversion lemma to update inverse
  • Memory-limited versions L-BFGS
  • Use toolbox if possible (TAO, MATLAB)

(typically slower if you implement it yourself)

  • Works well for nonlinear nonconvex objectives

(often even for nonsmooth objectives)

slide-54
SLIDE 54

4.2 Constrained Convex Problems

slide-55
SLIDE 55

Basic Convexity

slide-56
SLIDE 56
  • Optimization problem
  • Common constraints
  • linear inequality constraints
  • quadratic cone constraints
  • semidefinite constraints

Constrained Convex Minimization

minimize

x

f(x) subject to ci(x) ≤ 0 for all i hwi, xi + bi  0 x>Qx + b>x  c with Q ⌫ 0 M ⌫ 0 or M0 + X

i

xiMi ⌫ 0

slide-57
SLIDE 57
  • Optimization problem
  • Common constraints
  • linear inequality constraints
  • quadratic cone constraints
  • semidefinite constraints

Constrained Convex Minimization

minimize

x

f(x) subject to ci(x) ≤ 0 for all i hwi, xi + bi  0 x>Qx + b>x  c with Q ⌫ 0 M ⌫ 0 or M0 + X

i

xiMi ⌫ 0

Equality is special case Why?

slide-58
SLIDE 58

Example - Support Vectors

,

w {x | <w x> + b = 0} ,

{x | <w x> + b = −1}

,

{x | <w x> + b = +1}

, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||

< >

, , , , 2 ||w|| yi = −1 yi = +1

❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆

minimize

w,b

1 2 kwk2 subject to yi [hw, xii + b] 1

hw, x1i + b = 1 hw, x2i + b = 1 hence hw, x1 x2i + b = 2 hence ⌧ w kwk, x1 x2

  • =

2 kwk

margin

slide-59
SLIDE 59

Lagrange Multipliers

  • Lagrange function
  • Saddlepoint Condition

If there are x* and nonnegative α* such that then x* is an optimal solution to the constrained optimization problem

L(x, α) := f(x) +

n

X

i=1

αici(x) where αi ≥ 0 L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗)

slide-60
SLIDE 60

Proof

  • From first inequality we see that x* is feasible
  • Setting some yields KKT conditions
  • Consequently we have

This proves optimality

L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗) (αi − α∗

i )ci(x∗) ≤ 0 for all αi ≥ 0

αi = 0 α∗

i ci(x∗) = 0

L(x∗, α∗) = f(x∗) ≤ L(x, α∗) = f(x) + X

i

α∗

i ci(x) ≤ f(x)

slide-61
SLIDE 61

Constraint gymnastics (all three conditions are equivalent)

  • Slater’s condition

There exists some x such that for all i

  • Karlin’s condition

For all nonnegative α there exists some x such that

  • Strict constraint qualification

The feasible region contains at least two distinct elements and there exists an x in X such that all ci(x) are strictly convex at x with respect to X

ci(x) < 0 X

i

αici(x) ≤ 0

slide-62
SLIDE 62

Necessary Kuhn-Tucker Conditions

  • Assume optimization problem
  • satisfies the constraint qualifications
  • has convex differentiable objective + constraints
  • Then the KKT conditions are necessary & sufficient

∂xL(x∗, α∗) = ∂xf(x∗) + X

i

α∗

i ∂xci(x∗)

= 0 (Saddlepoint in x∗) ∂αiL(x∗, α∗) = ci(x∗) ≤ 0 (Saddlepoint in α∗) X

i

α∗

i ci(x∗)

= 0 (Vanishing KKT-gap)

Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions

slide-63
SLIDE 63

Proof

f(x) − f(x⇤) ≥ [∂xf(x⇤)]> (x − x⇤) (by convexity) = − X

i

α⇤

i [∂xci(x⇤)]> (x − x⇤)

(by Saddlepoint in x⇤) ≥ − X

i

α⇤

i (ci(x) − ci(x⇤))

(by convexity) = X

i

α⇤

i ci(x)

(by vanishing KKT gap) ≥ 0

slide-64
SLIDE 64

Linear and Quadratic Programs

slide-65
SLIDE 65

Linear Programs

  • Objective
  • Lagrange function
  • Optimality conditions
  • Dual problem

minimize

x

c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize

i

d>α subject to A>α + c = 0 and α ≥ 0

slide-66
SLIDE 66

Linear Programs

  • Objective
  • Lagrange function
  • Optimality conditions
  • Dual problem

minimize

x

c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize

i

d>α subject to A>α + c = 0 and α ≥ 0

plug into L plug into L

slide-67
SLIDE 67

Linear Programs

  • Objective
  • Lagrange function
  • Optimality conditions
  • Dual problem

minimize

x

c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize

i

d>α subject to A>α + c = 0 and α ≥ 0

plug into L plug into L

slide-68
SLIDE 68

Linear Programs

  • Primal
  • Dual
  • Free variables become equality constraints
  • Equality constraints become free variables
  • Inequalities become inequalities
  • Dual of dual is primal

minimize

x

c>x subject to Ax + d ≤ 0 maximize

i

d>α subject to A>α + c = 0 and α ≥ 0

slide-69
SLIDE 69
  • Objective
  • Lagrange function
  • Optimality conditions

Quadratic Programs

plug into L

minimize

x

1 2x>Qx + c>x subject to Ax + d ≤ 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) ∂xL(x, α) = Qx + A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α

slide-70
SLIDE 70

Quadratic Program

  • Eliminating x from the Lagrangian via
  • Lagrange function

Qx + A>α + c = 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) = −1 2x>Qx + α>d = −1 2(A>α + c)>Q1(A>α + c) + α>d = −1 2α>AQ1A>α + α> ⇥ d − AQ1c ⇤ − 1 2c>Q1c subject to α ≥ 0

slide-71
SLIDE 71

dual

Quadratic Program

  • Eliminating x from the Lagrangian via
  • Lagrange function

Qx + A>α + c = 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) = −1 2x>Qx + α>d = −1 2(A>α + c)>Q1(A>α + c) + α>d = −1 2α>AQ1A>α + α> ⇥ d − AQ1c ⇤ − 1 2c>Q1c subject to α ≥ 0

slide-72
SLIDE 72
  • Primal
  • Dual
  • Dual constraints are simpler
  • Possibly many fewer variables
  • Dual of dual is not (always) primal

(e.g. in SVMs x is in a Hilbert Space)

Quadratic Programs

minimize

x

1 2x>Qx + c>x subject to Ax + d ≤ 0 minimize

α

1 2α>AQ1A>α + α> ⇥ AQ1c − d ⇤ subject to α ≥ 0

slide-73
SLIDE 73

Interior Point Solvers

slide-74
SLIDE 74

Constrained Newton Method

  • Objective
  • Lagrange function and optimality conditions
  • Taylor expansion of gradient
  • Plug back into the constraints and solve

minimize

x

f(x) subject to Ax = b L(x, α) = f(x) + α> [Ax − b] ∂xL(x, α) = ∂xf(x) + A>α = 0 ∂αL(x, α) = Ax − b = 0 ∂xf(x) = ∂xf(x0) + ∂2

xf(x0) [x x0] + O(kx x0k2)

yields

  • ptimality

 ∂2

xf(x0)

A> A  x α

  • =

 ∂2

xf(x0)x0 − ∂xf(x0)

b

  • No need to be initially feasible!
slide-75
SLIDE 75

General Strategy

  • Optimality conditions
  • Solve equations repeatedly.
  • Yields primal and dual solution variables
  • Yields size of primal/dual gap
  • Feasibility not necessary at start
  • KKT conditions are problematic - need approximation

∂xL(x∗, α∗) = ∂xf(x∗) + X

i

α∗

i ∂xci(x∗)

= 0 (Saddlepoint in x∗) ∂αiL(x∗, α∗) = ci(x∗) ≤ 0 (Saddlepoint in α∗) X

i

α∗

i ci(x∗)

= 0 (Vanishing KKT-gap)

slide-76
SLIDE 76

Quadratic Programs

  • Optimality conditions
  • Relax KKT conditions
  • Solve linearization of nonlinear system
  • Predictor/corrector step for nonlinearity
  • Iterate until converged

Qx + A>α + c = 0 Ax + d + ξ = 0 αiξi = 0 α, ξ ≥ 0

slack

αiξi = 0 relaxed to αiξi = µ  Q A> A −D  δx δα

  • =

 cx cα

slide-77
SLIDE 77

Implementation details

  • Dominant cost is solving reduced KKT system

Solve linear system with (dense) Q and A

  • Solve linear system twice (predictor / corrector)
  • Update steps are only taken far enough to

ensure nonnegativity of dual and slack

  • Tighten up KKT constraints by decreasing μ
  • Only 10-20 iterations typically needed

 Q A> A −D  δx δα

  • =

 cx cα

slide-78
SLIDE 78

Solver Software

  • OOQP

http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver

  • LOQO

http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver

  • HOPDM

http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver

  • CVXOPT

http://abel.ee.ucla.edu/cvxopt/ Python package for convex optimization

  • SeDuMi

http://sedumi.ie.lehigh.edu/ Semidefinite programming solver

slide-79
SLIDE 79

Solver Software

  • OOQP

http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver

  • LOQO

http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver

  • HOPDM

http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver

  • CVXOPT

http://abel.ee.ucla.edu/cvxopt/ Python package for convex optimization

  • SeDuMi

http://sedumi.ie.lehigh.edu/ Semidefinite programming solver

nontrivial to parallelize

slide-80
SLIDE 80

Bundle Methods

simple parallelization

slide-81
SLIDE 81

Some optimization problems

  • Density estimation
  • Penalized regression

minimize

θ m

X

i=1

log p(xi|θ) log p(θ) equivalently minimize

θ m

X

i=1

[g(θ) hφ(xi), θi] + 1 2σ2 kθk2 minimize

θ m

X

i=1

l (yi hφ(xi), θi) + 1 2σ2 kθk2

e.g. squared loss regularizer

slide-82
SLIDE 82

Basic Idea

  • Loss
  • Convex but expensive to compute
  • Line search just as expensive as new computation
  • Gradient almost free with function value computation
  • Easy to compute in parallel
  • Regularizer
  • Convex and cheap to compute and to optimize
  • Strategy
  • Compute tangents on loss
  • Provides lower bound on objective
  • Solve dual optimization problem (fewer parameters)

minimize

θ m

X

i=1

li(θ) + λΩ[θ]

slide-83
SLIDE 83

Bundle Method

empirical risk

slide-84
SLIDE 84

Regularized Risk Minimization minimize

w

Remp[w] + λΩ[w] Taylor Approximation for Remp[w] Remp[w] Remp[wt] + hw wt, ∂wRemp[wt]i = hat, wi + bt where at = ∂wRemp[wt−1] and bt = Remp[wt−1] hat, wt−1i. Bundle Bound Remp[w] Rt[w] := max

i≤t hai, wi + bi

Regularizer Ω[w] solves stability problems.

Lower bound

slide-85
SLIDE 85

Pseudocode

Initialize t = 0, w0 = 0, a0 = 0, b0 = 0 repeat Find minimizer wt := argmin

w

Rt(w) + Ω[w] Compute gradient at+1 and offset bt+1. Increment t ← t + 1. until ✏t ≤ ✏ Convergence Monitor Rt+1[wt] − Rt[wt] Since Rt+1[wt] = Remp[wt] (Taylor approximation) we have Rt+1[wt] + Ω[wt] ≥ min

w Remp[w] + Ω[w] ≥ Rt[wt] + Ω[wt]

slide-86
SLIDE 86

Dual Problem

Good News Dual optimization for Ω[w] = 1

2 kwk2 2 is Quadratic Program

regardless of the choice of the empirical risk Remp[w]. Details minimize

β 1 2λβ>AA>β β>b

subject to βi 0 and kβk1 = 1 The primal coefficient w is given by w = λ1A>β. General Result Use Fenchel-Legendre dual of Ω[w], e.g. k·k1 ! k·k1. Very Cheap Variant Can even use simple line search for update (almost as good).

slide-87
SLIDE 87

Properties

Parallelization Empirical risk sum of many terms: MapReduce Gradient sum of many terms, gather from cluster. Possible even for multivariate performance scores. Data is local. Combine data from competing entities. Solver independent of loss No need to change solver for new loss. Loss independent of solver/regularizer Add new regularizer without need to re-implement loss. Line search variant Optimization does not require QP solver at all! Update along gradient direction in the dual. We only need inner product on gradients!

slide-88
SLIDE 88

Implementation

empirical risk empirical risk empirical risk empirical risk reducers bundle solver

slide-89
SLIDE 89

Guarantees

Theorem The number of iterations to reach ✏ precision is bounded by n ≤ log2 Remp[0] G2 + 8G2 ✏ − 4

  • steps. If the Hessian of Remp[w] is bounded, convergence to

any ✏ ≤ /2 takes at most the following number of steps: n ≤ log2 Remp[0] 4G2 + 4 max ⇥ 0, 1 − 8G2H∗/ ⇤ − 4H∗

  • log 2✏

Advantages Linear convergence for smooth loss For non-smooth loss almost as good in practice (as long as smooth on a course scale). Does not require primal line search.

slide-90
SLIDE 90

Proof idea

Duality Argument Dual of Ri[w] + λΩ[w] lower bounds minimum of regularized risk Remp[w] + λΩ[w]. Ri+1[wi] + λΩ[wi] is upper bound. Show that the gap γi := Ri+1[wi] − Ri[wi] vanishes. Dual Improvement Give lower bound on increase in dual problem in terms of γi and the subgradient ∂w [Remp[w] + λΩ[w]]. For unbounded Hessian we have δγ = O(γ2). For bounded Hessian we have δγ = O(γ). Convergence Solve difference equation in γt to get desired result.

slide-91
SLIDE 91

More

  • Dual decomposition methods
  • Optimization problem with many constraints
  • Replicate variable & add equality constraints
  • Solve relaxed problem
  • Gradient descent in dual variables
  • Prox operator
  • Problems with smooth & nonsmooth objective
  • Generalization of Bregman projections
slide-92
SLIDE 92

4.3 Online Methods

slide-93
SLIDE 93

The Perceptron

slide-94
SLIDE 94

The Perceptron

Spam Ham

slide-95
SLIDE 95

The Perceptron

Spam Ham

slide-96
SLIDE 96

The Perceptron

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of

inner products

initialize w = 0 and b = 0 repeat if yi [hw, xii + b]  0 then w w + yixi and b b + yi end if until all classified correctly w = X

i∈I

xi f(x) = X

i∈I

hxi, xi + b

slide-97
SLIDE 97

Convergence Theorem

  • If there exists some with unit length and

then the perceptron converges to a linear separator after a number of steps bounded by

  • Dimensionality independent
  • Order independent (i.e. also worst case)
  • Scales with ‘difficulty’ of problem

(w∗, b∗) yi [hxi, w∗i + b∗] ρ for all i ⇣ b∗2 + 1 ⌘ r2 + 1

  • ρ−2 where kxik  r
slide-98
SLIDE 98

Proof

Starting Point We start from w1 = 0 and b1 = 0. Step 1: Bound on the increase of alignment Denote by wi the value of w at step i (analogously bi). Alignment: h(wi, bi), (w⇤, b⇤)i For error in observation (xi, yi) we get h(wj+1, bj+1) · (w⇤, b⇤)i = h[(wj, bj) + yi(xi, 1)] , (w⇤, b⇤)i = h(wj, bj), (w⇤, b⇤)i + yih(xi, 1) · (w⇤, b⇤)i h(wj, bj), (w⇤, b⇤)i + ρ jρ. Alignment increases with number of errors.

slide-99
SLIDE 99

Proof

Step 2: Cauchy-Schwartz for the Dot Product h(wj+1, bj+1) · (w⇤, b⇤)i  k(wj+1, bj+1)k k(w⇤, b⇤)k = p 1 + (b⇤)2k(wj+1, bj+1)k Step 3: Upper Bound on k(wj, bj)k If we make a mistake we have k(wj+1, bj+1)k2 = k(wj, bj) + yi(xi, 1)k2 = k(wj, bj)k2 + 2yih(xi, 1), (wj, bj)i + k(xi, 1)k2  k(wj, bj)k2 + k(xi, 1)k2  j(R2 + 1). Step 4: Combination of first three steps jρ  p 1 + (b⇤)2k(wj+1, bj+1)k  p j(R2 + 1)((b⇤)2 + 1) Solving for j proves the theorem.

slide-100
SLIDE 100

Consequences

  • Only need to store errors.

This gives a compression bound for perceptron.

  • Stochastic gradient descent on hinge loss
  • Fails with noisy data

l(xi, yi, w, b) = max (0, 1 yi [hw, xii + b])

do NOT train your avatar with perceptrons

Black & White

slide-101
SLIDE 101

Stochastic Gradient Descent

slide-102
SLIDE 102

Stochastic gradient descent

  • Empirical risk as expectation
  • Stochastic gradient descent (pick random x,y)
  • Often we require that parameters are restricted

to some convex set X, hence we project on it

1 m

m

X

i=1

l (yi hφ(xi), θi) = Ei∼{1,..m} [l (yi hφ(xi), θi)] here πX(θ) = argmin

x∈X

kx θk θt+1 θt ηt∂θ (yt, hφ(xt), θti) θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)]

slide-103
SLIDE 103

Convergence in Expectation

  • Proof

Show that parameters converge to minimum

θ

⇥ l(¯ θ) ⇤ l∗  R2 + L2 PT −1

t=0 η2 t

2 PT −1

t=0 ηt

where l(θ) = E(x,y) [l(y, hφ(x), θi)] and l∗ = inf

θ∈X l(θ) and ¯

θ = PT −1

t=0 θtηt

PT −1

t=0 ηt

expected loss

parameter average

θ∗ 2 argmin

θ∈X

l(θ) and set rt := kθ∗ θtk

from Nesterov and Vial

initial loss

slide-104
SLIDE 104

Proof

  • Summing over inequality for t proves claim
  • This yields randomized algorithm for

minimizing objective functions (try log times and pick the best / or average median trick)

r2

t+1 = kπX[θt ηtgt] θ∗k2

 kθt ηtgt θ∗k2 = r2

t + η2 t kgtk2 2ηt hθt θ∗, gti

hence E ⇥ r2

t+1 r2 t

⇤  η2

t L2 + 2ηt [l∗ E[l(θt)]]

 η2

t L2 + 2ηt

⇥ l∗ E[l(¯ θ)] ⇤

by convexity by convexity

slide-105
SLIDE 105

Rates

  • Guarantee
  • If we know R, L, T pick constant learning rate
  • If we don’t know T pick

This costs us an additional log term

θ

⇥ l(¯ θ) ⇤ − l∗ ≤ R2 + L2 PT −1

t=0 η2 t

2 PT −1

t=0 ηt

η = R L √ T and hence E¯

θ[l(¯

θ)] − l∗ ≤ R[1 + 1/T]L 2 √ T < LR √ T ηt = O(t− 1

2 )

θ[l(¯

θ)] − l∗ = O ✓log T √ T ◆

slide-106
SLIDE 106

Strong Convexity

  • Use this to bound the expected deviation
  • Exponentially decaying averaging

and plugging this into the discrepancy yields

li(θ0) li(θ) + h∂θli(θ), θ0 θi + 1 2λ kθ θ0k2 r2

t+1  r2 t + η2 t kgtk2 2ηt hθt θ∗, gti

 r2

t + η2 t L2 2ηt [lt(θt) lt(θ∗)] 2ληtr2 k

hence E[r2

t+1]  (1 λht)E[r2 t ] 2ηt [E [l(θt)] l∗]

¯ θ = 1 − σ 1 − σT

T −1

X

t=0

σT −1−tθt l(¯ θ) − l∗ ≤ 2L2 λT log " 1 + λRT

1 2

2L # for η = 2 λT log " 1 + λRT

1 2

2L #

slide-107
SLIDE 107

More variants

  • Adversarial guarantees

has low regret (average instantaneous cost) for arbitrary orders (useful for game theory)

  • Ratliff, Bagnell, Zinkevich

learning rate

  • Shalev-Shwartz, Srebro, Singer (Pegasos)

learning rate (but need constants)

  • Bartlett, Rakhlin, Hazan

(add strong convexity penalty)

θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)] O(t− 1

2 )

O(t−1)

slide-108
SLIDE 108

Parallel distributed variants

slide-109
SLIDE 109

Online Learning

  • General Template
  • Get instance
  • Compute instantaneous gradient
  • Update parameter vector
  • Problems
  • Sequential execution (single core)
  • CPU core speed is no longer increasing
  • Disk/network bandwidth: 300GB/h
  • Does not scale to TBs of data
  • Batch subgradient has 50x penalty
slide-110
SLIDE 110

Parallel Online Templates

  • Data parallel
  • Parameter parallel

loss gradient data source x

data source data part n x part n updater

slide-111
SLIDE 111

Delayed Updates

  • Data parallel
  • n processors compute gradients
  • delay is n-1 between gradient computation

and application

  • Parameter parallel
  • delay between partial computation and

feedback from joint loss

  • delay logarithmic in processors
slide-112
SLIDE 112
  • Optimization Problem
  • Algorithm

Delayed Updates

minimize

w

  • i

fi(w) Input: scalar σ > 0 and delay τ for t = τ + 1 to T + τ do Obtain ft and incur loss ft(wt) Compute gt := ⇥ft(wt) and set ηt =

1 σ(t−τ)

Update wt+1 = wt ηtgt−τ end for

slide-113
SLIDE 113

Theoretical Guarantees

  • Worst case guarantee

SGD with delay τ on τ processors is no worse than sequential SGD

  • Lower bound is tight

Proof: send same instance τ times

  • Better bounds with iid data
  • Penalty is covariance in features
  • Vanishing penalty for smooth f(w)
slide-114
SLIDE 114
  • Linear function classes

Algorithm converges no worse than with serial

  • execution. Up to a factor of 4 as tight.
  • Strong convexity

Each loss function is strongly convex with modulus λ. Constant offset depends on the degree of parallelism.

Theoretical Guarantees

E[fi(w)] ≤ 4RL √ τT

R[X] ≤ λτR + ⇥ 1

2 + τ

⇤ L2 λ (1 + τ + log T)

slide-115
SLIDE 115
  • Lipschitz continuous loss gradients

Asymptotic rate does no longer depend on amount of parallelism

  • Strong convexity and Lipschitz gradients

This only works when the objective function is very close to a parabola (upper and lower bound)

  • Lock-free updates

(Hogwild - Recht, Wright, Re http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf)

Nonadversarial Guarantees

E[R[X]] ≤  28.3R2H + 2 3RL + 4 3R2H log T

  • τ 2 + 8

3RL √ T.

E[R[X]] ≤ O(τ 2 + log T)

slide-116
SLIDE 116

Lazy updates & sparsity

  • Sparse gradients (easy with l2 regularizer)
  • General coordinate-based penalty
  • Key insight - we only need to know the accurate

value of wj whenever we use it

  • Store wj with timestamp of last update
  • Before using wj update using past stepsizes
  • Approximate sum over stepsizes by integral

(Quadrianto et al, 2010; Li and Langford, 2009)

w ← w − ηtg(w, xt)xt

Eemp [l(xi, yi, w)] + λ X

j

γj(wj)

slide-117
SLIDE 117

Convergence on TREC

  • 12
  • 10
  • 8
  • 6
  • 4
  • 2

2 0 10 20 30 40 50 60 70 80 90 100 Log_2 Error Thousands of Iterations no delay delay of 10 delay of 100 delay of 1000

slide-118
SLIDE 118

Convergence on Y!Data

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 10 20 30 40 50 60 70 80 90 100 Log_2 Error Thousands of Iterations no delay delay of 10 delay of 100 delay of 1000

slide-119
SLIDE 119

Speedup on TREC

50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7 Percent Speedup Threads

slide-120
SLIDE 120

Multiple Machines

slide-121
SLIDE 121

MapReduce variant

  • Idiot proof simple algorithm
  • Perform stochastic gradient on each

computer for a random subset of the data (drawn with replacement)

  • Average parameters
  • Benefits
  • No communication during optimization
  • Single pass MapReduce
  • Latency is not a problem
slide-122
SLIDE 122

Guarantees

  • Requirements
  • Strongly convex loss
  • Lipschitz continuous gradient
  • Theorem
  • Not sample size dependent
  • Regularization limits parallelization
  • For runtime

Ew∈DT,k

η

[c(w)] min

w c(w)  8ηG2

p kλ q k∂ckL + 8ηG2 k∂ckL kλ + (2ηG2)

T = ln k−(ln η+ln λ)

2ηλ

slide-123
SLIDE 123

4.4 Discrete Problems

slide-124
SLIDE 124

Integer programming relaxations

  • Optimization problem
  • Relax to linear program if vertices are integral

since LP has vertex solution

minimize

x

c>x subject to Ax ≤ b and x ∈ Zn

slide-125
SLIDE 125

Integer programming relaxations

  • Totally unimodular constraint matrix A
  • Inverse of each submatrix must be integral
  • RHS of constraints must be integral
  • Many useful sufficient conditions for TU.
slide-126
SLIDE 126

Example - Hungarian Marriage

  • Optimization Problem
  • n Hungarian men
  • n Hungarian women
  • Compatibility cij between them
  • Find optimal matching
  • All vertices of the constraint matrix are integral

maximize

π

X

ij

πijCij subject to πij ∈ {0, 1} and X

i

πij = 1 and X

j

πij = 1

slide-127
SLIDE 127

Randomization

  • Maximum finding
  • Very large set of instances
  • Find approximate maximum
  • Draw a random set of n terms
  • Take maximum over subset

(59 for 95% with 95% confidence) x x x x x x x

Pr n F[max

i

xi] < ✏

  • = (1 − ✏)n =

hence n = log log(1 − ✏) ≤ − log ✏

slide-128
SLIDE 128

Randomization

  • Find good solution
  • Show that expected value is well behaved
  • Show that tails are bounded
  • Sufficiently large random draw must contain at least one

good element (e.g. CM sketch)

  • Find good majority
  • Show that majority satisfies condition
  • Bound probability of minority being overrepresented (e.g.

Mean-Median theorem)

  • Much more in these books
  • Raghavan & Motwani (Randomized Algorithms)
  • Alon & Spencer (Probabilistic Method)
slide-129
SLIDE 129

+ + + > > > > >

Submodular maximization

  • Submodular function
  • Defined on sets
  • Diminishing returns property
  • Example

For web search results we might have individually But if we can show only 4 we should probably pick

f(A ∪ C) − f(A) ≥ f(B ∪ C) − f(B) for A ⊆ B

slide-130
SLIDE 130

Submodular maximization

  • Optimization problem

Often NP hard even to find tight approximation

  • Greedy optimization procedure
  • Start with empty set X
  • Find x such that is maximized
  • Add x to the set and repeat until |X|=k

max

X∈X f(X) subject to |X| ≤ k

f(X ∪ {x})

slide-131
SLIDE 131

Applications

  • Feature selection
  • Active learning and experimental design
  • Disease spread detection in networks
  • Document summarization
  • Learning graphical models
  • Extensions to
  • Weighted item sets
  • Decision trees
slide-132
SLIDE 132

Basic Techniques

  • Gradient descent
  • Newton's method
  • Conjugate Gradient Descent
  • Broden-Fletcher-Goldfarb-Shanno (BFGS)
  • Constrained Convex Optimization
  • Properties
  • Lagrange function
  • Wolfe dual
  • Batch methods
  • Distributed subgradient
  • Bundle methods
  • Online methods
  • Unconstrained subgradient
  • Gradient projections
  • Parallel optimization

Optimization

slide-133
SLIDE 133

Further reading

  • Nesterov and Vial (expected convergence)

http://dl.acm.org/citation.cfm?id=1377347

  • Bartlett, Hazan, Rakhlin (strong convexity SGD)

http://books.nips.cc/papers/files/nips20/NIPS2007_0699.pdf

  • TAO (toolkit for advanced optimization)

http://www.mcs.anl.gov/research/projects/tao/

  • Ratliff, Bagnell, Zinkevich

http://martin.zinkevich.org/publications/ratliff_nathan_2007_3.pdf

  • Shalev-Shwartz, Srebro, Singer (Pegasos paper)

http://dl.acm.org/citation.cfm?id=1273598

  • Langford, Smola, Zinkevich (slow learners are fast)

http://arxiv.org/abs/0911.0491

  • Hogwild (Recht, Wright, Re)

http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf