Scalable Machine Learning 4. Optimization Alex Smola Yahoo! - PowerPoint PPT Presentation

Basic steps • Repeat until converged • Map: compute function & derivative at given parameter t • Reduce: aggregate parts of function and derivative • Decide based on f(x) and f’(x) which interval to pursue • Send updated parameter to all machines given a starting point x ∈ dom f . repeat 1. ∆ x := −∇ f ( x ). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x + t ∆ x . until stopping criterion is satisfied. update value in search direction and feed back communicate final value to each machine

Scalability analysis • Linear time in number of instances • Linear storage in problem size, not data • Logarithmic time in accuracy • ‘perfect’ scalability • 10s of passes through dataset for each iteration (line search is very expensive) • MapReduce loses state at each iteration • Single master as bottleneck (important if the state space is several GB)

A Better Algorithm • Avoiding the line search • Not used in convergence proof anyway • Simply pick update x ← x − 1 M ∂ x f ( x ) • Only single pass through data per iteration • Only single MapReduce pass per iteration • Logarithmic iteration bound (as before) m log f ( x ) − f ( x ∗ ) M ✏

Newton’s Method Isaac Newton

Newton Method • Convex objective function f • Nonnegative second derivative ∂ 2 x f ( x ) ⌫ 0 • Taylor expansion f ( x + δ ) = f ( x ) + h δ , ∂ x f ( x ) i + 1 2 δ > ∂ 2 x f ( x ) δ + O ( δ 3 ) gradient Hessian • Minimize approximation & iterate til converged ⇤ − 1 ∂ x f ( x ) ∂ 2 ⇥ x f ( x ) x ← x −

Convergence Analysis • There exists a region around optimality where Newton’s method converges quadratically if f is twice continuously differentiable • For some region around x* gradient is well approximated by Taylor expansion x ∗ � x, ∂ 2 �  γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Expand Newton update ⇤ − 1 [ ∂ x f ( x n ) � ∂ x f ( x ∗ )] � � � x n � x ∗ � ∂ 2 ⇥ k x n +1 � x ∗ k = x f ( x n ) � � � � ⇤� ⇤ − 1 ⇥ ∂ 2 ∂ f ⇥ = x f ( x n ) x ( x n )[ x n � x ∗ ] � ∂ x f ( x n ) + ∂ x f ( x ∗ ) � � � � � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥  γ x f ( x n ) � � �

Convergence Analysis • Two convergence regimes • As slow as gradient descent outside the region where Taylor expansion is good x ∗ � x, ∂ 2 �  γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Quadratic convergence once the bound holds � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥ k x n +1 � x ∗ k  γ x f ( x n ) � � � • Newton method is affine invariant (proof by chain rule) See Boyd and Vandenberghe, Chapter 9.5 for much more

Newton method rescales space wrong metric x (0) x (2) x (1) from Boyd & Vandenberghe

Newton method rescales space locally adaptive metric x x + ∆ x nsd x + ∆ x nt from Boyd & Vandenberghe

Parallel Newton Method • Good rate of convergence • Few passes through data needed • Parallel aggregation of gradient and Hessian • Gradient requires O(d) data • Hessian requires O(d 2 ) data • Update step is O(d 3 ) & nontrivial to parallelize • Use it only for low dimensional problems

Conjugate Gradient Descent

Key Idea • Minimizing quadratic function ( K ⌫ 0) f ( x ) = 1 2 x > Kx − l > x + c takes cubic time (e.g. Cholesky factorization) • Matrix vector products and orthogonalization • Vectors x, x’ are K orthogonal if x > Kx 0 = 0 • m mutually K orthogonal vectors x i ∈ R m • form a basis m x > i Kz X z = x i x > • allow expansion i Kx i i =1 m x > i y • solve linear system X z = x i for y = Kz x > i Kx i i =1

Proof • m mutually K orthogonal vectors x i ∈ R m m x > • form a basis i Kz X z = x i x > i Kx i • allow expansion i =1 m x > i y X • solve linear system z = x i for y = Kz x > i Kx i i =1 • Show linear independence by contradiction X X α i x i = 0 hence 0 = x > α i x i = x > j K j Kx j α j i i • Reconstruction - expand z into basis X X α i x i hence x > j Kz = x > α i x i = x > z = j K j Kx j α j i i • For linear system plug in y = Kz

??? • Need vectors x i • Need to orthogonalize the vectors • How to select them • K-orthogonal vectors whiten the space since f ( x ) = 1 2 x > x − l > x + c has trivial solution x = l

Conjugate Gradient Descent • Gradient computation f ( x ) = 1 2 x > Kx − l > x + c hence g ( x ) = Kx − l • Algorithm initialize x 0 and v 0 = g 0 = Kx 0 − l and i = 0 repeat deflation step g > i v i x i +1 = x i − v i v > i Kv i g i +1 = Kx i +1 − l g > i +1 Kv i v i +1 = − g i +1 + v i v > i Kv i i ← i + 1 K orthogonal until g i = 0

Proof - Deflation property g > i v i x i +1 = x i − v i v > i Kv i g i +1 = Kx i +1 − l g > i +1 Kv i v i +1 = − g i +1 + v i v > i Kv i • First assume that the v i are K orthogonal and show that x i+1 is optimal in span of {v 1 .. v i } • Enough if we show that v > j g i = 0 for all j < i • For j=i expand g >  � i v i v > i g i +1 = v > Kx i − l − Kv i i v > i Kv i g > i v i = v > i g i − v > i Kv i = 0 v > i Kv i • For smaller j a consequence of K orthogonality

Proof - K orthogonality g > i v i x i +1 = x i − v i v > i Kv i g i +1 = Kx i +1 − l g > i +1 Kv i v i +1 = − g i +1 + v i v > i Kv i • Need to check that v i+1 is K orthogonal to all v j (rest automatically true by construction) g > i +1 Kv i v > j Kv i +1 = − v > j Kg i +1 + v > j Kv i v > i KV i 0 by deflation 0 by K orthogonality

Properties • Subspace expansion method for optimality (g, Kg, K 2 g, K 3 g, ...) • Focuses on leading eigenvalues • Often sufficient to take only a few steps (whenever the eigenvalues decay rapidly)

Extensions Compute Hessian K i := f ′′ ( x i ) and update α i , β i with Generic Method g ⊤ i v i α i = − x and v updates v ⊤ i K i v i g ⊤ i +1 K i v i β i = v ⊤ i K i v i This requires calculation of the Hessian at each iteration. Fletcher–Reeves [163] Find α i via a line search and use Theorem 6.20 (iii) for β i α i = argmin α f ( x i + α v i ) g ⊤ i +1 g i +1 β i = g ⊤ i g i Polak–Ribiere [398] Find α i via a line search α i = argmin α f ( x i + α v i ) ( g i +1 − g i ) ⊤ g i +1 β i = g ⊤ i g i Experimentally, Polak–Ribiere tends to be better than Fletcher–Reeves.

BFGS algorithm Broyden-Fletcher-Goldfarb-Shanno

Basic Idea • Newton-like method to compute descent direction δ i = B − 1 ∂ x f ( x i − 1 ) i • Line search on f in direction x i +1 = x i − α i δ i • Update B with rank 2 matrix B i +1 = B i + u i u > i + v i v > i • Require that Quasi-Newton condition holds B i +1 ( x i +1 − x i ) = ∂ x f ( x i +1 ) − ∂ x f ( x i ) g i g > − B i δ i δ > i B i i B i +1 = B i + α i δ > δ > i B i δ i i g i

Properties • Simple rank 2 update for B • Use matrix inversion lemma to update inverse • Memory-limited versions L-BFGS • Use toolbox if possible (TAO, MATLAB) (typically slower if you implement it yourself) • Works well for nonlinear nonconvex objectives (often even for nonsmooth objectives)

4.2 Constrained Convex Problems

Basic Convexity

Constrained Convex Minimization • Optimization problem minimize f ( x ) x subject to c i ( x ) ≤ 0 for all i • Common constraints • linear inequality constraints h w i , x i + b i  0 • quadratic cone constraints x > Qx + b > x  c with Q ⌫ 0 • semidefinite constraints X M ⌫ 0 or M 0 + x i M i ⌫ 0 i

Constrained Convex Minimization • Optimization problem minimize f ( x ) x subject to c i ( x ) ≤ 0 for all i Equality is special case • Common constraints Why? • linear inequality constraints h w i , x i + b i  0 • quadratic cone constraints x > Qx + b > x  c with Q ⌫ 0 • semidefinite constraints X M ⌫ 0 or M 0 + x i M i ⌫ 0 i

Example - Support Vectors {x | <w x> + b = + 1 } , , {x | <w x> + b = − 1 } Note: h w, x 1 i + b = 1 ◆ , <w x 1 > + b = +1 h w, x 2 i + b = � 1 , <w x 2 > + b = − 1 y i = +1 ❍ x 1 ◆ ❍ x 2 hence h w, x 1 � x 2 i + b = 2 , => <w (x 1 − x 2 )> = 2 ⌧ w ◆ � w 2 , > 2 < y i = − 1 , w => k w k , x 1 � x 2 (x 1 − x 2 ) = hence = ||w|| ◆ ||w|| k w k ❍ ❍ margin , {x | <w x> + b = 0 } ❍ 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b

Lagrange Multipliers • Lagrange function n X L ( x, α ) := f ( x ) + α i c i ( x ) where α i ≥ 0 i =1 • Saddlepoint Condition If there are x* and nonnegative α * such that L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x, α ∗ ) then x* is an optimal solution to the constrained optimization problem

Proof L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x, α ∗ ) • From first inequality we see that x* is feasible ( α i − α ∗ i ) c i ( x ∗ ) ≤ 0 for all α i ≥ 0 • Setting some yields KKT conditions α i = 0 α ∗ i c i ( x ∗ ) = 0 • Consequently we have X L ( x ∗ , α ∗ ) = f ( x ∗ ) ≤ L ( x, α ∗ ) = f ( x ) + α ∗ i c i ( x ) ≤ f ( x ) This proves optimality i

Constraint gymnastics (all three conditions are equivalent) • Slater’s condition There exists some x such that for all i c i ( x ) < 0 • Karlin’s condition For all nonnegative α there exists some x such that X α i c i ( x ) ≤ 0 i • Strict constraint qualification The feasible region contains at least two distinct elements and there exists an x in X such that all c i (x) are strictly convex at x with respect to X

Necessary Kuhn-Tucker Conditions • Assume optimization problem • satisfies the constraint qualifications • has convex differentiable objective + constraints • Then the KKT conditions are necessary & sufficient X ∂ x L ( x ∗ , α ∗ ) = ∂ x f ( x ∗ ) + i ∂ x c i ( x ∗ ) = 0 (Saddlepoint in x ∗ ) α ∗ i ∂ α i L ( x ∗ , α ∗ ) = c i ( x ∗ ) ≤ 0 (Saddlepoint in α ∗ ) X α ∗ i c i ( x ∗ ) = 0 (Vanishing KKT-gap) i Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions

Proof f ( x ) − f ( x ⇤ ) ≥ [ ∂ x f ( x ⇤ )] > ( x − x ⇤ ) (by convexity) i [ ∂ x c i ( x ⇤ )] > ( x − x ⇤ ) X α ⇤ (by Saddlepoint in x ⇤ ) = − i X α ⇤ i ( c i ( x ) − c i ( x ⇤ )) (by convexity) ≥ − i X α ⇤ = i c i ( x ) (by vanishing KKT gap) i ≥ 0

Linear and Quadratic Programs

Linear Programs • Objective c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = c > x + α > ( Ax + d ) • Optimality conditions ∂ x L ( x, α ) = A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α • Dual problem d > α subject to A > α + c = 0 and α ≥ 0 maximize i

Linear Programs • Objective c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = c > x + α > ( Ax + d ) • Optimality conditions plug into L plug into L ∂ x L ( x, α ) = A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α • Dual problem d > α subject to A > α + c = 0 and α ≥ 0 maximize i

Linear Programs • Primal c > x subject to Ax + d ≤ 0 minimize x • Dual d > α subject to A > α + c = 0 and α ≥ 0 maximize i • Free variables become equality constraints • Equality constraints become free variables • Inequalities become inequalities • Dual of dual is primal

Quadratic Programs • Objective 1 2 x > Qx + c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = 1 2 x > Qx + c > x + α > ( Ax + d ) • Optimality conditions plug into L ∂ x L ( x, α ) = Qx + A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α

Quadratic Program • Eliminating x from the Lagrangian via Qx + A > α + c = 0 • Lagrange function L ( x, α ) = 1 2 x > Qx + c > x + α > ( Ax + d ) = − 1 2 x > Qx + α > d = − 1 2( A > α + c ) > Q � 1 ( A > α + c ) + α > d = − 1 − 1 2 α > AQ � 1 A > α + α > ⇥ d − AQ � 1 c 2 c > Q � 1 c ⇤ subject to α ≥ 0

Quadratic Program • Eliminating x from the Lagrangian via Qx + A > α + c = 0 • Lagrange function L ( x, α ) = 1 2 x > Qx + c > x + α > ( Ax + d ) = − 1 2 x > Qx + α > d = − 1 2( A > α + c ) > Q � 1 ( A > α + c ) + α > d = − 1 − 1 2 α > AQ � 1 A > α + α > ⇥ d − AQ � 1 c 2 c > Q � 1 c ⇤ dual subject to α ≥ 0

Quadratic Programs • Primal 1 2 x > Qx + c > x subject to Ax + d ≤ 0 minimize x • Dual 1 2 α > AQ � 1 A > α + α > ⇥ AQ � 1 c − d ⇤ minimize subject to α ≥ 0 α • Dual constraints are simpler • Possibly many fewer variables • Dual of dual is not (always) primal (e.g. in SVMs x is in a Hilbert Space)

Interior Point Solvers

Constrained Newton Method • Objective minimize f ( x ) subject to Ax = b x • Lagrange function and optimality conditions L ( x, α ) = f ( x ) + α > [ Ax − b ] yields ∂ x L ( x, α ) = ∂ x f ( x ) + A > α = 0 optimality ∂ α L ( x, α ) = Ax − b = 0 • Taylor expansion of gradient x f ( x 0 ) [ x � x 0 ] + O ( k x � x 0 k 2 ) ∂ x f ( x ) = ∂ x f ( x 0 ) + ∂ 2 • Plug back into the constraints and solve �  x  ∂ 2  ∂ 2 A > � � x f ( x 0 ) x f ( x 0 ) x 0 − ∂ x f ( x 0 ) = A α b No need to be initially feasible!

General Strategy • Optimality conditions X ∂ x L ( x ∗ , α ∗ ) = ∂ x f ( x ∗ ) + i ∂ x c i ( x ∗ ) = 0 (Saddlepoint in x ∗ ) α ∗ i ∂ α i L ( x ∗ , α ∗ ) = c i ( x ∗ ) ≤ 0 (Saddlepoint in α ∗ ) X α ∗ i c i ( x ∗ ) = 0 (Vanishing KKT-gap) i • Solve equations repeatedly. • Yields primal and dual solution variables • Yields size of primal/dual gap • Feasibility not necessary at start • KKT conditions are problematic - need approximation

Quadratic Programs • Optimality conditions Qx + A > α + c = 0 Ax + d + ξ = 0 slack α i ξ i = 0 α , ξ ≥ 0 • Relax KKT conditions α i ξ i = 0 relaxed to α i ξ i = µ • Solve linearization of nonlinear system  c x �  δ x  A > � � Q = A − D δα c α • Predictor/corrector step for nonlinearity • Iterate until converged

Implementation details • Dominant cost is solving reduced KKT system  c x �  δ x  A > � � Q = A − D δα c α Solve linear system with (dense) Q and A • Solve linear system twice (predictor / corrector) • Update steps are only taken far enough to ensure nonnegativity of dual and slack • Tighten up KKT constraints by decreasing μ • Only 10-20 iterations typically needed

Solver Software • OOQP http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver • LOQO http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver • HOPDM http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver • CVXOPT http://abel.ee.ucla.edu/cvxopt/ Python package for convex optimization • SeDuMi http://sedumi.ie.lehigh.edu/ Semidefinite programming solver

Solver Software • OOQP http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver • LOQO http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver • HOPDM http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver • CVXOPT http://abel.ee.ucla.edu/cvxopt/ nontrivial to Python package for convex optimization • SeDuMi parallelize http://sedumi.ie.lehigh.edu/ Semidefinite programming solver

Bundle Methods simple parallelization

Some optimization problems • Density estimation m X minimize � log p ( x i | θ ) � log p ( θ ) θ i =1 m 1 X 2 σ 2 k θ k 2 equivalently minimize [ g ( θ ) � h φ ( x i ) , θ i ] + θ i =1 • Penalized regression m 1 2 σ 2 k θ k 2 X minimize l ( y i � h φ ( x i ) , θ i ) + θ i =1 regularizer e.g. squared loss

Basic Idea m X minimize l i ( θ ) + λ Ω [ θ ] θ • Loss i =1 • Convex but expensive to compute • Line search just as expensive as new computation • Gradient almost free with function value computation • Easy to compute in parallel • Regularizer • Convex and cheap to compute and to optimize • Strategy • Compute tangents on loss • Provides lower bound on objective • Solve dual optimization problem (fewer parameters)

Bundle Method empirical risk

Lower bound Regularized Risk Minimization R emp [ w ] + λ Ω [ w ] minimize w Taylor Approximation for R emp [ w ] R emp [ w ] � R emp [ w t ] + h w � w t , ∂ w R emp [ w t ] i = h a t , w i + b t where a t = ∂ w R emp [ w t − 1 ] and b t = R emp [ w t − 1 ] � h a t , w t − 1 i . Bundle Bound R emp [ w ] � R t [ w ] := max i ≤ t h a i , w i + b i Regularizer Ω [ w ] solves stability problems.

Pseudocode Initialize t = 0 , w 0 = 0 , a 0 = 0 , b 0 = 0 repeat Find minimizer w t := argmin R t ( w ) + � Ω [ w ] w Compute gradient a t + 1 and offset b t + 1 . Increment t ← t + 1. until ✏ t ≤ ✏ Convergence Monitor R t + 1 [ w t ] − R t [ w t ] Since R t + 1 [ w t ] = R emp [ w t ] (Taylor approximation) we have R t + 1 [ w t ] + � Ω [ w t ] ≥ min w R emp [ w ] + � Ω [ w ] ≥ R t [ w t ] + � Ω [ w t ]

Dual Problem Good News 2 k w k 2 Dual optimization for Ω [ w ] = 1 2 is Quadratic Program regardless of the choice of the empirical risk R emp [ w ] . Details 1 2 λ β > AA > β � β > b minimize β subject to β i � 0 and k β k 1 = 1 The primal coefficient w is given by w = � λ � 1 A > β . General Result Use Fenchel-Legendre dual of Ω [ w ] , e.g. k · k 1 ! k · k 1 . Very Cheap Variant Can even use simple line search for update (almost as good).

Properties Parallelization Empirical risk sum of many terms: MapReduce Gradient sum of many terms, gather from cluster. Possible even for multivariate performance scores. Data is local . Combine data from competing entities. Solver independent of loss No need to change solver for new loss. Loss independent of solver/regularizer Add new regularizer without need to re-implement loss. Line search variant Optimization does not require QP solver at all! Update along gradient direction in the dual. We only need inner product on gradients!

Implementation empirical empirical empirical empirical risk risk risk risk reducers bundle solver

Guarantees Theorem The number of iterations to reach ✏ precision is bounded by + 8 G 2 � R emp [ 0 ] n ≤ log 2 �✏ − 4 G 2 steps. If the Hessian of R emp [ w ] is bounded, convergence to any ✏ ≤ � / 2 takes at most the following number of steps: � R emp [ 0 ] + 4 − 4 H ∗ 0 , 1 − 8 G 2 H ∗ / � ⇥ ⇤ n ≤ log 2 � max log 2 ✏ 4 G 2 � Advantages Linear convergence for smooth loss For non-smooth loss almost as good in practice (as long as smooth on a course scale). Does not require primal line search.

Proof idea Duality Argument Dual of R i [ w ] + λ Ω [ w ] lower bounds minimum of regularized risk R emp [ w ] + λ Ω [ w ] . R i + 1 [ w i ] + λ Ω [ w i ] is upper bound. Show that the gap γ i := R i + 1 [ w i ] − R i [ w i ] vanishes. Dual Improvement Give lower bound on increase in dual problem in terms of γ i and the subgradient ∂ w [ R emp [ w ] + λ Ω [ w ]] . For unbounded Hessian we have δγ = O ( γ 2 ) . For bounded Hessian we have δγ = O ( γ ) . Convergence Solve difference equation in γ t to get desired result.

More • Dual decomposition methods • Optimization problem with many constraints • Replicate variable & add equality constraints • Solve relaxed problem • Gradient descent in dual variables • Prox operator • Problems with smooth & nonsmooth objective • Generalization of Bregman projections

4.3 Online Methods

The Perceptron

The Perceptron Ham Spam

The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = x i i ∈ I • Classifier is linear combination of inner products X f ( x ) = h x i , x i + b i ∈ I

Convergence Theorem • If there exists some with unit length and ( w ∗ , b ∗ ) y i [ h x i , w ∗ i + b ∗ ] � ρ for all i then the perceptron converges to a linear separator after a number of steps bounded by b ∗ 2 + 1 ⇣ ⌘ � r 2 + 1 ρ − 2 where k x i k  r � • Dimensionality independent • Order independent (i.e. also worst case) • Scales with ‘difficulty’ of problem

Proof Starting Point We start from w 1 = 0 and b 1 = 0 . Step 1: Bound on the increase of alignment Denote by w i the value of w at step i (analogously b i ). Alignment: h ( w i , b i ) , ( w ⇤ , b ⇤ ) i For error in observation ( x i , y i ) we get h ( w j +1 , b j +1 ) · ( w ⇤ , b ⇤ ) i = h [( w j , b j ) + y i ( x i , 1)] , ( w ⇤ , b ⇤ ) i = h ( w j , b j ) , ( w ⇤ , b ⇤ ) i + y i h ( x i , 1) · ( w ⇤ , b ⇤ ) i � h ( w j , b j ) , ( w ⇤ , b ⇤ ) i + ρ � j ρ . Alignment increases with number of errors.

Proof Step 2: Cauchy-Schwartz for the Dot Product h ( w j +1 , b j +1 ) · ( w ⇤ , b ⇤ ) i  k ( w j +1 , b j +1 ) k k ( w ⇤ , b ⇤ ) k p 1 + ( b ⇤ ) 2 k ( w j +1 , b j +1 ) k = Step 3: Upper Bound on k ( w j , b j ) k If we make a mistake we have k ( w j +1 , b j +1 ) k 2 = k ( w j , b j ) + y i ( x i , 1) k 2 = k ( w j , b j ) k 2 + 2 y i h ( x i , 1) , ( w j , b j ) i + k ( x i , 1) k 2  k ( w j , b j ) k 2 + k ( x i , 1) k 2  j ( R 2 + 1) . Step 4: Combination of first three steps j ( R 2 + 1)(( b ⇤ ) 2 + 1) p p 1 + ( b ⇤ ) 2 k ( w j +1 , b j +1 ) k  j ρ  Solving for j proves the theorem.

Consequences • Only need to store errors. This gives a compression bound for perceptron. • Stochastic gradient descent on hinge loss l ( x i , y i , w, b ) = max (0 , 1 � y i [ h w, x i i + b ]) • Fails with noisy data do NOT train your avatar with perceptrons Black & White

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! - PowerPoint PPT Presentation

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 4. Optimization Optimization Basic Techniques Gradient descent Newton's method Conjugate

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

CDI Burden and Pathophysiology Ciarn P. Kelly, MD Professor of Medicine Harvard Medical School

Biosurveillance in Maryland October 2014 Public Health Goals of Biosurveillance Goal:

Surveillance Strategies in African Surveillance Strategies in African Refugees in their Country

Harnessing Big Data, Tracking Covid-19: Technological Panacea or Digital Pandoras Box? Dr

Evaluating forecasts of infectious disease spread Sebastian Meyer Institute of Medical

CS-5630 / CS-6630 Visualization for Data Science Data Alexander Lex alex@sci.utah.edu

Environmental Health Matters Initiative July 11, 2018 Thomas Burke: Steering Committee Chair

Public works as means to push for poverty reduction? Short-term welfare effects of Rwandas