robust nonlinear optimization
play

Robust nonlinear Optimization Maren Mahsereci Workshop on - PowerPoint PPT Presentation

Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems


  1. Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

  2. Robust optimization . . . outline ▸ basics about greedy optimizers ▸ GD and SGD : (stochastic) gradient descent ▸ robust stochastic optimization ▸ example: step size adaptation ▸ extending line searches ▸ robust search directions 1 ,

  3. Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,

  4. Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,

  5. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  6. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  7. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  8. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  9. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  10. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  11. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  12. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  13. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  14. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  15. Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching sometimes we do not know −∇L( x i ) precisely! 4 ,

  16. Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) 5 ,

  17. Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) for iid. mini-batches, noise is approximately Gaussian L( x ) = ˆ L( x ) + ǫ, ǫ ∼ N ( 0 , O ( M − m )) m L( x ) ˆ ∼ 5 ,

  18. The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,

  19. The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,

  20. Step size adaptation ... by line searches x i + 1 ← x i − α i s i so far α was constant and hand-chosen! ▸ line searches automatically choose step sizes 7 ,

  21. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

  22. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

  23. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

  24. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given noisy direction −∇ ˆ L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 Line searches break in stochastic setting! 8 ,

  25. Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! 9 ,

  26. Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! small outline ▸ introduce classic (noise free) line searches ▸ translate concept to language of probability ▸ get a new algorithm robust to noise 9 ,

  27. Classic line searches x ∗ = arg min x f ( x ) , Initial evaluation ≡ current position of optimizer x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  28. Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 1 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 ← initial candidate d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  29. Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  30. Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 2 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 extrapolation → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  31. Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  32. Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 3 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 interpolation → 2 d f ( t ) (local minimum) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  33. Classic line searches x ∗ = arg min x f ( x ) , Accept: datapoint # 3 fulfills Wolfe conditions x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 accepted → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  34. Classic line searches Choosing meaningful step-sizes, at very low overhead many classic line searches 1. model the 1D objective with cubic spline 2. search candidate points by collapsing search space 3. accept if Wolfe conditions fulfilled 11 ,

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend