l ecture 17 s warm i ntelligence 3 c lassical o
play

L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - PowerPoint PPT Presentation

15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy


  1. 15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO

  2. P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy • P for an individual particle / agent individual P swarm / neighborhood 2

  3. A T Y P I C A L S WA R M - L E V E L B E H A V I O R … 3

  4. W H A T I F W E H A V E O N E S I N G L E A G E N T … • PSO leverages the presence of a swarm: the outcome is truly a collective behavior • If left alone, each individual agent would behave like a hill-climber when moving in the direction of a local optimum, and then it will have a quite hard time to escape it A single agent doesn’t look impressive 🤕 … How can a single agent be smarter ? 4

  5. A ( F I R S T ) G E N E R A L A P P R O A C H • What is a good direction p ? • How small / large / constant / variable should be the step size 𝜷 k ? • How do we check that we are at the minimum? • This could work for a local optimum, but what about finding the global optimum? 5

  6. I F W E H A V E T H E F U N C T I O N ( A N D I T S D E R I VA T I V E S ) ! " ≈ ! " % + ' ( !(" % ) * ( " − " % ) • From 1st order Taylor series: Equation of the tangent plane to 𝒈 in 𝒚 0 : the gradient vector is orthogonal to the tangent plane • The partial derivatives determine the slope of the plane • In each point 𝒚 , the gradient vector is orthogonal to the isocountours, { 𝒚 : 𝒈 ( 𝒚 )= 𝑑 } • ⇒ The gradient vector points in the direction of maximal change of the function in the point. The magnitude of the gradient is the (max) velocity of change for the function 6

  7. G R A D I E N T S A N D R A T E O F C H A N G E • Directional derivative for a function 𝒈 : ℝ n ⟼ ℝ is the rate of change of the function along a given direction 𝒘 (where 𝒘 is a unit-norm vector): • E.g., for a function of two variables and a direction 𝒘 =( 𝑤 x , 𝑤 y ) • Partial derivatives are directional derivatives along the vectors of the canonical basis of ℝ n 7

  8. G R A D I E N T S A N D R A T E O F C H A N G E • Theorem: Given a direction 𝒘 and a di ff erentiable function 𝒈 , then in each point 𝒚 0 of the function domain the following relation exists between the gradient vector and the directional derivative: • Corollary 1: only when the gradient is parallel to the • Corollary 2: directional derivative ( 𝜾 =0) that is, the directional derivative in a point gets its maximum value when the direction 𝒘 is the same as the gradient vector • ⇒ In each point, the gradient vector corresponds to the direction of maximal change of the function • ⇒ The norm of the gradient corresponds to the max velocity of change in the point 8

  9. G R A D I E N T D E S C E N T / A S C E N T • Move in the direction opposite to (min) to or aligned with (max) the gradient vector, the direction of maximal change of the function 9

  10. O N LY L O C A L O P T I M A The final local optimum depends on where we start from (for non-convex functions) 10

  11. ~ M O T I O N I N A P O T E N T I A L F I E L D GD run ~ Motion of a mass in a potential field towards the minimum energy § configuration. At each point the gradient defines the attraction force, while the step ! scales the force, to define the next point 11

  12. G R A D I E N T D E S C E N T A L G O R I T H M ( M I N ) 1. Initialization (a) Definition of a starting point x 0 (b) Definition of a tolerance parameter for converence ✏ (c) Initialization of the iteration variable, k 0 2. Computation of a feasible direction for moving • d k �r k f ( x k ) 3. Definition of the feasible (max) length of the move • ↵ k min α f ( x k + ↵ d k ) (1-dimensional problem in ↵ 2 R , the farthest point to where f ( x k + ↵ d k ) keeps increasing) 4. Move to new point in the direction of gradient descent • x k +1 x k + ↵ k d k 5. Check for convergence • If k ↵ k d k k < ✏ [or, if ↵ k  c ↵ 0 , where c > 0 is a small constant] (i.e., gradient becomes zero) (a) Output: x ∗ = x k +1 , f ( x ∗ ) = f ( x k +1 ) (b) Stop • Otherwise, k k + 1 , and go to Step 2 12

  13. ⇒ S T E P B E H A V I O R • If the gradient in 𝒚 0 points to the local minimum, then a line search would determine a step size 𝜷 k that would take directly to the minimum • However, this lucky case only happens in perfectly conditioned functions, or for a restricted set of points • It might be heavy to solve a 1D optimization problem at each step, such that some approximate methods can be preferred • In the general case, the moving directions 𝒆 k are perpendicular to each other : 𝜷 * minimizes 𝒈 ( 𝒚 k + 𝜷 * 𝒆 k ), such that 𝑒𝒈 / 𝑒𝜷 must be zero in 𝜷 * but, 13

  14. I L L - C O N D I T I O N E D P R O B L E M S Ill-conditioned problem Well-conditioned problem • If the function is very anisotropic , then the problem is said ill-conditioned , since the gradient vector doesn’t point to the direction of the local minimum, resulting into a zig-zagging trajectory • Ill-conditioning can be determined by computing the ratio between the eigenvalues of the Hessian matrix (the matrix of the second partial derivatives) 14

  15. W H A T A B O U T A C O N S TA N T S T E P S I Z E ? ! is too large, divergence Small, good ! , convergence 15

  16. W H A T A B O U T A C O N S TA N T S T E P S I Z E ? ! is too large, divergence ! is too low, slow convergence • Adapting the step size may be necessary to avoid either a too slow progress, or overshooting the target minimum 16

  17. E X A M P L E : S U P E R V I S E D M A C H I N E L E A R N I N G • Loss function: Sum of squared errors: • 𝑛 labeled training samples ( examples ) • 𝑧 ( 𝑗 ) = known correct value for sample 𝑗 • 𝑦 ( 𝑗 ) · 𝜄 = linear hypothesis function, 𝜄 vector of parameters • Goal: find the value of the parameter vector 𝜄 such that the loss (errors in classification / regression) is minimized (over the training set) • Any analogy with PSO? Gradient: Ø If the averaging factor 1/2& is used, then the update action becomes: / ! ← ! − % & ' ( (*) ( (*), - ! − . (*) 17 *01

  18. S T O C H A S T I C G R A D I E N T D E S C E N T 18

  19. A D A P T I V E S T E P S I Z E ? • Problem min α f ( x k + α d k ) can be tackled using any algorithm for one-dimensional search: Fibonacci, Golden section, Newton’s method, Secant method , . . . • If f 2 C 2 , it’s possible to compute the Hessian H k , and α k can be determined analytically : – Let y k = α d k be the (feasible) small displacement applied in x k , such that from Taylor’s series about x k , truncated at the 2nd order (quadratic approximation): k r k + 1 f ( x k + y k ) ⇡ f ( x k ) + y T 2 y T k H k y k – In the algorithm, d k is the direction of steepest descent, y k = � α r k , such that: k r k + 1 f ( x k � α r k ) ⇡ f ( x k ) � α r T 2 α 2 r T k H k r k – We are looking for α minimizing f ! first order conditions must be satisfied, d f d α = 0 : d f r T f ( x k � α r k ) k r k d ⇡ �r T k r k + α r T k H k r k = 0 ) α = α k ⇡ r T k H k r k d α – Updating rule of the gradient descent algorithm becomes: r T k r k x k +1 x k � r k r T k H k r k 19

  20. C O N D I T I O N S O N T H E H E S S I A N M A T R I X r T k r k x k +1 x k � r k r T k H k r k • The value found for α k is a sound estimate of the ”correct” value of α as long as the quadratic approximation of the Taylor series is an accurate approximation of the f ( x ) . • At the beginning of the iterations, k y k k will be quite large since the approximation will be inaccurate, but getting closer to the minimum k y k k will decrease accordingly, and the accuracy will keep increasing. • If f is a quadratic function, the Taylor series is exact, and we can use = instead of ⇡ ) α k is exact at each iteration • The Hessian matrix is the matrix of 2nd order partial derivatives. E.g., for f : X ✓ R 3 7! R , the Hessian matrix computed in a point x ∗ :   ∂ 2 f ∂ 2 f ∂ 2 f ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 1 ∂ x 3   1     ∂ 2 f ∂ 2 f ∂ 2 f   H | x ∗ =   ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3   2     ∂ 2 f ∂ 2 f ∂ 2 f   ∂ x 2 ∂ x 3 ∂ x 1 ∂ x 3 ∂ x 2 20 3 | x ∗

  21. P R O P E R T I E S O F T H E H E S S I A N M A T R I X • Theorem (Schwartz) : If a function is C 2 (twice di ff erentiable with continuity) in R n ! The order of derivation is irrelevant. ) The Hessian matrix is symmetric • Theorem (Quadratic form of a matrix) : Given a matrix H 2 R n × n , squared and symmetric, the associated quadratic form is defined as the function: Positive definite (convex) q ( x ) = 1 2 x T Hx 2 The matrix is said: The matrix is said: – Positive definite if x T Hx > 0 , 8 x 2 R n , x 6 = 0 – Positive semi-definite if x T Hx � 0 , 8 x 2 R n – Negative definite if x T Hx < 0 , 8 x 2 R n , x 6 = 0 – Negative semi-definite if x T Hx  0 , 8 x 2 R n – Indefinite if x T Hx > 0 for some x and x T Hx < 0 for others x Negative definite (concave) Positive semi-definite Indefinite (saddle) 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend