Scalable Machine Learning
- 4. Optimization
Alex Smola Yahoo! Research and ANU
http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12
Scalable Machine Learning 4. Optimization Alex Smola Yahoo! - - PowerPoint PPT Presentation
Scalable Machine Learning 4. Optimization Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 4. Optimization Optimization Basic Techniques Gradient descent Newton's method Conjugate
http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12
Basic Techniques
(more on this in a later lecture)
log p(θ|X) = 1 2σ2 kθk2 +
m
X
i=1
g(θ) hφ(xi), θi + const.
−2 2 −2 2 0.2 0.4 0.6 0.8 −2 2 −3 −2 −1 1 2 3
For x, x0 ∈ X it follows that λx + (1 − λ)x0 ∈ X for λ ∈ [0, 1] λλf(x) + (1 − λ)f(x0) ≥ f(λx + (1 − λ)x0) for λ ∈ [0, 1]
−2 2 −2 2 0.2 0.4 0.6 0.8 −2 2 −3 −2 −1 1 2 3
For x, x0 ∈ X it follows that λx + (1 − λ)x0 ∈ X for λ ∈ [0, 1] λλf(x) + (1 − λ)f(x0) ≥ f(λx + (1 − λ)x0) for λ ∈ [0, 1]
f(λx + (1 − λ)x0) ≤ λf(x) + (1 − λ)f(x hence λx + (1 − λ)x0 ∈ X for x, x0 ∈ X
f(λx + (1 − λ)x0) ≤ λf(x) + (1 − λ)f(x hence λx + (1 − λ)x0 ∈ X for x, x0 ∈ X
λx + (1 λ)x0 62 X for λ > 1 for all x0 2 X
co X :=
x
x =
n
αixi where n ∈ N, αi ≥ 0 and
n
αi ≤ 1
sup
x∈X
f(x) = sup
x∈coX
f(x)
(exploit convexity in upper bound and keep 5 points)
7 3 1 2 4 5 6
Require: a, b, Precision Set A = a, B = b repeat if f ′ A+B
2
B = A+B
2
else A = A+B
2
end if until (B − A) min(|f ′(A)|, |f ′(B)|) ≤ Output: x = A+B
2
solution on the left
approximation of objective function
convex objective
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
f(y) f(x) + hy x, ∂xf(x)i + m 2 ky xk2 f(x) f(x∗) m 2 kx x∗k2 f(x) f(x∗) hx x∗, ∂xf(x)i m 2 kx∗ xk2 sup
y hx y, ∂xf(x)i m
2 ky xk2 = 1 2m k∂xf(x)k2
f(y) f(x) + hy x, ∂xf(x)i + M 2 ky xk2 = ) f(x + tgx) f(x) t kgxk2 + M 2 t2 kgxk2 f(x) 1 2M kgxk2 = ) f(x + tgx) f(x∗) f(x) f(x∗) 1 2M kgxk2 f(x) f(x∗) h 1 m M i M m log f(x) − f(x∗) ✏
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
given a starting point x ∈ dom f. repeat
until stopping criterion is satisfied.
(line search is very expensive)
(important if the state space is several GB)
x ← x − 1 M ∂xf(x) M m log f(x) − f(x∗) ✏
Isaac Newton
∂2
xf(x) ⌫ 0
f(x + δ) = f(x) + hδ, ∂xf(x)i + 1 2δ>∂2
xf(x)δ + O(δ3)
x ← x − ⇥ ∂2
xf(x)
⇤−1 ∂xf(x)
⌦ x∗ x, ∂2
xf(x)
↵ γ kx∗ xk2
kxn+1 x∗k =
⇥ ∂2
xf(xn)
⇤−1 [∂xf(xn) ∂xf(x∗)]
∂2
xf(xn)
⇤−1 ⇥ ∂f
x(xn)[xn x∗] ∂xf(xn) + ∂xf(x∗)
⇤
∂2
xf(xn)
⇤−1
⌦ x∗ x, ∂2
xf(x)
↵ γ kx∗ xk2 kxn+1 x∗k γ
∂2
xf(xn)
⇤−1
See Boyd and Vandenberghe, Chapter 9.5 for much more
x(0) x(1) x(2)
from Boyd & Vandenberghe
from Boyd & Vandenberghe
x x + ∆xnt x + ∆xnsd
x>Kx0 = 0 xi ∈ Rm (K ⌫ 0) z =
m
X
i=1
xi x>
i Kz
x>
i Kxi
z =
m
X
i=1
xi x>
i y
x>
i Kxi
for y = Kz f(x) = 1 2x>Kx − l>x + c
xi ∈ Rm z =
m
X
i=1
xi x>
i Kz
x>
i Kxi
z =
m
X
i=1
xi x>
i y
x>
i Kxi
for y = Kz X
i
αixi = 0 hence 0 = x>
j K
X
i
αixi = x>
j Kxjαj
z = X
i
αixi hence x>
j Kz = x> j K
X
i
αixi = x>
j Kxjαj
x = l f(x) = 1 2x>x − l>x + c
f(x) = 1 2x>Kx − l>x + c hence g(x) = Kx − l initialize x0 and v0 = g0 = Kx0 − l and i = 0 repeat xi+1 = xi − vi
g>
i vi
v>
i Kvi
gi+1 = Kxi+1 − l vi+1 = −gi+1 + vi
g>
i+1Kvi
v>
i Kvi
i ← i + 1 until gi = 0
xi+1 is optimal in span of {v1 .. vi}
xi+1 = xi − vi
g>
i vi
v>
i Kvi
gi+1 = Kxi+1 − l vi+1 = −gi+1 + vi
g>
i+1Kvi
v>
i Kvi
v>
i gi+1 = v> i
Kxi − l − Kvi g>
i vi
v>
i Kvi
i gi − v> i Kvi
g>
i vi
v>
i Kvi
= 0 v>
j gi = 0 for all j < i
(rest automatically true by construction)
xi+1 = xi − vi
g>
i vi
v>
i Kvi
gi+1 = Kxi+1 − l vi+1 = −gi+1 + vi
g>
i+1Kvi
v>
i Kvi
v>
j Kvi+1 = −v> j Kgi+1 + v> j Kvi
g>
i+1Kvi
v>
i KVi
Generic Method Compute Hessian Ki := f ′′(xi) and update αi, βi with αi = −
g⊤
i vi
v⊤
i Kivi
βi =
g⊤
i+1Kivi
v⊤
i Kivi
This requires calculation of the Hessian at each iteration. Fletcher–Reeves [163] Find αi via a line search and use Theorem 6.20 (iii) for βi αi = argminαf(xi + αvi) βi =
g⊤
i+1gi+1
g⊤
i gi
Polak–Ribiere [398] Find αi via a line search αi = argminαf(xi + αvi) βi =
(gi+1−gi)⊤gi+1 g⊤
i gi
Experimentally, Polak–Ribiere tends to be better than Fletcher–Reeves.
δi = B−1
i
∂xf(xi−1) xi+1 = xi − αiδi Bi+1 = Bi + uiu>
i + viv> i
Bi+1(xi+1 − xi) = ∂xf(xi+1) − ∂xf(xi) Bi+1 = Bi + gig>
i
αiδ>
i gi
− Biδiδ>
i Bi
δ>
i Biδi
minimize
x
f(x) subject to ci(x) ≤ 0 for all i hwi, xi + bi 0 x>Qx + b>x c with Q ⌫ 0 M ⌫ 0 or M0 + X
i
xiMi ⌫ 0
minimize
x
f(x) subject to ci(x) ≤ 0 for all i hwi, xi + bi 0 x>Qx + b>x c with Q ⌫ 0 M ⌫ 0 or M0 + X
i
xiMi ⌫ 0
,
w {x | <w x> + b = 0} ,
{x | <w x> + b = −1}
,
{x | <w x> + b = +1}
, x2 x1 Note: <w x1> + b = +1 <w x2> + b = −1 => <w (x1−x2)> = 2 => (x1−x2) = w ||w||
, , , , 2 ||w|| yi = −1 yi = +1
❍ ❍ ❍ ❍ ❍ ◆ ◆ ◆ ◆
minimize
w,b
1 2 kwk2 subject to yi [hw, xii + b] 1
hw, x1i + b = 1 hw, x2i + b = 1 hence hw, x1 x2i + b = 2 hence ⌧ w kwk, x1 x2
2 kwk
L(x, α) := f(x) +
n
X
i=1
αici(x) where αi ≥ 0 L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗)
L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗) (αi − α∗
i )ci(x∗) ≤ 0 for all αi ≥ 0
αi = 0 α∗
i ci(x∗) = 0
L(x∗, α∗) = f(x∗) ≤ L(x, α∗) = f(x) + X
i
α∗
i ci(x) ≤ f(x)
There exists some x such that for all i
For all nonnegative α there exists some x such that
The feasible region contains at least two distinct elements and there exists an x in X such that all ci(x) are strictly convex at x with respect to X
ci(x) < 0 X
i
αici(x) ≤ 0
∂xL(x∗, α∗) = ∂xf(x∗) + X
i
α∗
i ∂xci(x∗)
= 0 (Saddlepoint in x∗) ∂αiL(x∗, α∗) = ci(x∗) ≤ 0 (Saddlepoint in α∗) X
i
α∗
i ci(x∗)
= 0 (Vanishing KKT-gap)
Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions
f(x) − f(x⇤) ≥ [∂xf(x⇤)]> (x − x⇤) (by convexity) = − X
i
α⇤
i [∂xci(x⇤)]> (x − x⇤)
(by Saddlepoint in x⇤) ≥ − X
i
α⇤
i (ci(x) − ci(x⇤))
(by convexity) = X
i
α⇤
i ci(x)
(by vanishing KKT gap) ≥ 0
minimize
x
c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize
i
d>α subject to A>α + c = 0 and α ≥ 0
minimize
x
c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize
i
d>α subject to A>α + c = 0 and α ≥ 0
minimize
x
c>x subject to Ax + d ≤ 0 L(x, α) = c>x + α>(Ax + d) ∂xL(x, α) = A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α maximize
i
d>α subject to A>α + c = 0 and α ≥ 0
minimize
x
c>x subject to Ax + d ≤ 0 maximize
i
d>α subject to A>α + c = 0 and α ≥ 0
minimize
x
1 2x>Qx + c>x subject to Ax + d ≤ 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) ∂xL(x, α) = Qx + A>α + c = 0 ∂αL(x, α) = Ax + d ≤ 0 0 = α>(Ax + d) 0 ≤ α
Qx + A>α + c = 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) = −1 2x>Qx + α>d = −1 2(A>α + c)>Q1(A>α + c) + α>d = −1 2α>AQ1A>α + α> ⇥ d − AQ1c ⇤ − 1 2c>Q1c subject to α ≥ 0
Qx + A>α + c = 0 L(x, α) = 1 2x>Qx + c>x + α>(Ax + d) = −1 2x>Qx + α>d = −1 2(A>α + c)>Q1(A>α + c) + α>d = −1 2α>AQ1A>α + α> ⇥ d − AQ1c ⇤ − 1 2c>Q1c subject to α ≥ 0
minimize
x
1 2x>Qx + c>x subject to Ax + d ≤ 0 minimize
α
1 2α>AQ1A>α + α> ⇥ AQ1c − d ⇤ subject to α ≥ 0
minimize
x
f(x) subject to Ax = b L(x, α) = f(x) + α> [Ax − b] ∂xL(x, α) = ∂xf(x) + A>α = 0 ∂αL(x, α) = Ax − b = 0 ∂xf(x) = ∂xf(x0) + ∂2
xf(x0) [x x0] + O(kx x0k2)
yields
∂2
xf(x0)
A> A x α
∂2
xf(x0)x0 − ∂xf(x0)
b
∂xL(x∗, α∗) = ∂xf(x∗) + X
i
α∗
i ∂xci(x∗)
= 0 (Saddlepoint in x∗) ∂αiL(x∗, α∗) = ci(x∗) ≤ 0 (Saddlepoint in α∗) X
i
α∗
i ci(x∗)
= 0 (Vanishing KKT-gap)
Qx + A>α + c = 0 Ax + d + ξ = 0 αiξi = 0 α, ξ ≥ 0
αiξi = 0 relaxed to αiξi = µ Q A> A −D δx δα
cx cα
Q A> A −D δx δα
cx cα
http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver
http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver
http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver
http://abel.ee.ucla.edu/cvxopt/ Python package for convex optimization
http://sedumi.ie.lehigh.edu/ Semidefinite programming solver
http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver
http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver
http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver
http://abel.ee.ucla.edu/cvxopt/ Python package for convex optimization
http://sedumi.ie.lehigh.edu/ Semidefinite programming solver
minimize
θ m
X
i=1
log p(xi|θ) log p(θ) equivalently minimize
θ m
X
i=1
[g(θ) hφ(xi), θi] + 1 2σ2 kθk2 minimize
θ m
X
i=1
l (yi hφ(xi), θi) + 1 2σ2 kθk2
minimize
θ m
X
i=1
li(θ) + λΩ[θ]
Regularized Risk Minimization minimize
w
Remp[w] + λΩ[w] Taylor Approximation for Remp[w] Remp[w] Remp[wt] + hw wt, ∂wRemp[wt]i = hat, wi + bt where at = ∂wRemp[wt−1] and bt = Remp[wt−1] hat, wt−1i. Bundle Bound Remp[w] Rt[w] := max
i≤t hai, wi + bi
Regularizer Ω[w] solves stability problems.
Initialize t = 0, w0 = 0, a0 = 0, b0 = 0 repeat Find minimizer wt := argmin
w
Rt(w) + Ω[w] Compute gradient at+1 and offset bt+1. Increment t ← t + 1. until ✏t ≤ ✏ Convergence Monitor Rt+1[wt] − Rt[wt] Since Rt+1[wt] = Remp[wt] (Taylor approximation) we have Rt+1[wt] + Ω[wt] ≥ min
w Remp[w] + Ω[w] ≥ Rt[wt] + Ω[wt]
Good News Dual optimization for Ω[w] = 1
2 kwk2 2 is Quadratic Program
regardless of the choice of the empirical risk Remp[w]. Details minimize
β 1 2λβ>AA>β β>b
subject to βi 0 and kβk1 = 1 The primal coefficient w is given by w = λ1A>β. General Result Use Fenchel-Legendre dual of Ω[w], e.g. k·k1 ! k·k1. Very Cheap Variant Can even use simple line search for update (almost as good).
Parallelization Empirical risk sum of many terms: MapReduce Gradient sum of many terms, gather from cluster. Possible even for multivariate performance scores. Data is local. Combine data from competing entities. Solver independent of loss No need to change solver for new loss. Loss independent of solver/regularizer Add new regularizer without need to re-implement loss. Line search variant Optimization does not require QP solver at all! Update along gradient direction in the dual. We only need inner product on gradients!
Theorem The number of iterations to reach ✏ precision is bounded by n ≤ log2 Remp[0] G2 + 8G2 ✏ − 4
any ✏ ≤ /2 takes at most the following number of steps: n ≤ log2 Remp[0] 4G2 + 4 max ⇥ 0, 1 − 8G2H∗/ ⇤ − 4H∗
Advantages Linear convergence for smooth loss For non-smooth loss almost as good in practice (as long as smooth on a course scale). Does not require primal line search.
Duality Argument Dual of Ri[w] + λΩ[w] lower bounds minimum of regularized risk Remp[w] + λΩ[w]. Ri+1[wi] + λΩ[wi] is upper bound. Show that the gap γi := Ri+1[wi] − Ri[wi] vanishes. Dual Improvement Give lower bound on increase in dual problem in terms of γi and the subgradient ∂w [Remp[w] + λΩ[w]]. For unbounded Hessian we have δγ = O(γ2). For bounded Hessian we have δγ = O(γ). Convergence Solve difference equation in γt to get desired result.
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly w = X
i∈I
xi f(x) = X
i∈I
hxi, xi + b
(w∗, b∗) yi [hxi, w∗i + b∗] ρ for all i ⇣ b∗2 + 1 ⌘ r2 + 1
Starting Point We start from w1 = 0 and b1 = 0. Step 1: Bound on the increase of alignment Denote by wi the value of w at step i (analogously bi). Alignment: h(wi, bi), (w⇤, b⇤)i For error in observation (xi, yi) we get h(wj+1, bj+1) · (w⇤, b⇤)i = h[(wj, bj) + yi(xi, 1)] , (w⇤, b⇤)i = h(wj, bj), (w⇤, b⇤)i + yih(xi, 1) · (w⇤, b⇤)i h(wj, bj), (w⇤, b⇤)i + ρ jρ. Alignment increases with number of errors.
Step 2: Cauchy-Schwartz for the Dot Product h(wj+1, bj+1) · (w⇤, b⇤)i k(wj+1, bj+1)k k(w⇤, b⇤)k = p 1 + (b⇤)2k(wj+1, bj+1)k Step 3: Upper Bound on k(wj, bj)k If we make a mistake we have k(wj+1, bj+1)k2 = k(wj, bj) + yi(xi, 1)k2 = k(wj, bj)k2 + 2yih(xi, 1), (wj, bj)i + k(xi, 1)k2 k(wj, bj)k2 + k(xi, 1)k2 j(R2 + 1). Step 4: Combination of first three steps jρ p 1 + (b⇤)2k(wj+1, bj+1)k p j(R2 + 1)((b⇤)2 + 1) Solving for j proves the theorem.
l(xi, yi, w, b) = max (0, 1 yi [hw, xii + b])
1 m
m
X
i=1
l (yi hφ(xi), θi) = Ei∼{1,..m} [l (yi hφ(xi), θi)] here πX(θ) = argmin
x∈X
kx θk θt+1 θt ηt∂θ (yt, hφ(xt), θti) θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)]
E¯
θ
⇥ l(¯ θ) ⇤ l∗ R2 + L2 PT −1
t=0 η2 t
2 PT −1
t=0 ηt
where l(θ) = E(x,y) [l(y, hφ(x), θi)] and l∗ = inf
θ∈X l(θ) and ¯
θ = PT −1
t=0 θtηt
PT −1
t=0 ηt
parameter average
θ∗ 2 argmin
θ∈X
l(θ) and set rt := kθ∗ θtk
from Nesterov and Vial
initial loss
r2
t+1 = kπX[θt ηtgt] θ∗k2
kθt ηtgt θ∗k2 = r2
t + η2 t kgtk2 2ηt hθt θ∗, gti
hence E ⇥ r2
t+1 r2 t
⇤ η2
t L2 + 2ηt [l∗ E[l(θt)]]
η2
t L2 + 2ηt
⇥ l∗ E[l(¯ θ)] ⇤
by convexity by convexity
E¯
θ
⇥ l(¯ θ) ⇤ − l∗ ≤ R2 + L2 PT −1
t=0 η2 t
2 PT −1
t=0 ηt
η = R L √ T and hence E¯
θ[l(¯
θ)] − l∗ ≤ R[1 + 1/T]L 2 √ T < LR √ T ηt = O(t− 1
2 )
E¯
θ[l(¯
θ)] − l∗ = O ✓log T √ T ◆
li(θ0) li(θ) + h∂θli(θ), θ0 θi + 1 2λ kθ θ0k2 r2
t+1 r2 t + η2 t kgtk2 2ηt hθt θ∗, gti
r2
t + η2 t L2 2ηt [lt(θt) lt(θ∗)] 2ληtr2 k
hence E[r2
t+1] (1 λht)E[r2 t ] 2ηt [E [l(θt)] l∗]
¯ θ = 1 − σ 1 − σT
T −1
X
t=0
σT −1−tθt l(¯ θ) − l∗ ≤ 2L2 λT log " 1 + λRT
1 2
2L # for η = 2 λT log " 1 + λRT
1 2
2L #
θt+1 πx [θt ηt∂θ (yt, hφ(xt), θti)] O(t− 1
2 )
O(t−1)
loss gradient data source x
data source data part n x part n updater
minimize
w
fi(w) Input: scalar σ > 0 and delay τ for t = τ + 1 to T + τ do Obtain ft and incur loss ft(wt) Compute gt := ⇥ft(wt) and set ηt =
1 σ(t−τ)
Update wt+1 = wt ηtgt−τ end for
Algorithm converges no worse than with serial
Each loss function is strongly convex with modulus λ. Constant offset depends on the degree of parallelism.
2 + τ
Asymptotic rate does no longer depend on amount of parallelism
This only works when the objective function is very close to a parabola (upper and lower bound)
(Hogwild - Recht, Wright, Re http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf)
E[R[X]] ≤ 28.3R2H + 2 3RL + 4 3R2H log T
3RL √ T.
value of wj whenever we use it
(Quadrianto et al, 2010; Li and Langford, 2009)
Eemp [l(xi, yi, w)] + λ X
j
γj(wj)
2 0 10 20 30 40 50 60 70 80 90 100 Log_2 Error Thousands of Iterations no delay delay of 10 delay of 100 delay of 1000
1 2 10 20 30 40 50 60 70 80 90 100 Log_2 Error Thousands of Iterations no delay delay of 10 delay of 100 delay of 1000
50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7 Percent Speedup Threads
Ew∈DT,k
η
[c(w)] min
w c(w) 8ηG2
p kλ q k∂ckL + 8ηG2 k∂ckL kλ + (2ηG2)
T = ln k−(ln η+ln λ)
2ηλ
minimize
x
c>x subject to Ax ≤ b and x ∈ Zn
maximize
π
X
ij
πijCij subject to πij ∈ {0, 1} and X
i
πij = 1 and X
j
πij = 1
Pr n F[max
i
xi] < ✏
hence n = log log(1 − ✏) ≤ − log ✏
good element (e.g. CM sketch)
Mean-Median theorem)
For web search results we might have individually But if we can show only 4 we should probably pick
f(A ∪ C) − f(A) ≥ f(B ∪ C) − f(B) for A ⊆ B
max
X∈X f(X) subject to |X| ≤ k
f(X ∪ {x})
Basic Techniques
http://dl.acm.org/citation.cfm?id=1377347
http://books.nips.cc/papers/files/nips20/NIPS2007_0699.pdf
http://www.mcs.anl.gov/research/projects/tao/
http://martin.zinkevich.org/publications/ratliff_nathan_2007_3.pdf
http://dl.acm.org/citation.cfm?id=1273598
http://arxiv.org/abs/0911.0491
http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf