L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - PowerPoint PPT Presentation

15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO

P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy • P for an individual particle / agent individual P swarm / neighborhood 2

A T Y P I C A L S WA R M - L E V E L B E H A V I O R … 3

W H A T I F W E H A V E O N E S I N G L E A G E N T … • PSO leverages the presence of a swarm: the outcome is truly a collective behavior • If left alone, each individual agent would behave like a hill-climber when moving in the direction of a local optimum, and then it will have a quite hard time to escape it A single agent doesn’t look impressive 🤕 … How can a single agent be smarter ? 4

A ( F I R S T ) G E N E R A L A P P R O A C H • What is a good direction p ? • How small / large / constant / variable should be the step size 𝜷 k ? • How do we check that we are at the minimum? • This could work for a local optimum, but what about finding the global optimum? 5

I F W E H A V E T H E F U N C T I O N ( A N D I T S D E R I VA T I V E S ) ! " ≈ ! " % + ' ( !(" % ) * ( " − " % ) • From 1st order Taylor series: Equation of the tangent plane to 𝒈 in 𝒚 0 : the gradient vector is orthogonal to the tangent plane • The partial derivatives determine the slope of the plane • In each point 𝒚 , the gradient vector is orthogonal to the isocountours, { 𝒚 : 𝒈 ( 𝒚 )= 𝑑 } • ⇒ The gradient vector points in the direction of maximal change of the function in the point. The magnitude of the gradient is the (max) velocity of change for the function 6

G R A D I E N T S A N D R A T E O F C H A N G E • Directional derivative for a function 𝒈 : ℝ n ⟼ ℝ is the rate of change of the function along a given direction 𝒘 (where 𝒘 is a unit-norm vector): • E.g., for a function of two variables and a direction 𝒘 =( 𝑤 x , 𝑤 y ) • Partial derivatives are directional derivatives along the vectors of the canonical basis of ℝ n 7

G R A D I E N T S A N D R A T E O F C H A N G E • Theorem: Given a direction 𝒘 and a di ff erentiable function 𝒈 , then in each point 𝒚 0 of the function domain the following relation exists between the gradient vector and the directional derivative: • Corollary 1: only when the gradient is parallel to the • Corollary 2: directional derivative ( 𝜾 =0) that is, the directional derivative in a point gets its maximum value when the direction 𝒘 is the same as the gradient vector • ⇒ In each point, the gradient vector corresponds to the direction of maximal change of the function • ⇒ The norm of the gradient corresponds to the max velocity of change in the point 8

G R A D I E N T D E S C E N T / A S C E N T • Move in the direction opposite to (min) to or aligned with (max) the gradient vector, the direction of maximal change of the function 9

O N LY L O C A L O P T I M A The final local optimum depends on where we start from (for non-convex functions) 10

~ M O T I O N I N A P O T E N T I A L F I E L D GD run ~ Motion of a mass in a potential field towards the minimum energy § configuration. At each point the gradient defines the attraction force, while the step ! scales the force, to define the next point 11

G R A D I E N T D E S C E N T A L G O R I T H M ( M I N ) 1. Initialization (a) Definition of a starting point x 0 (b) Definition of a tolerance parameter for converence ✏ (c) Initialization of the iteration variable, k 0 2. Computation of a feasible direction for moving • d k �r k f ( x k ) 3. Definition of the feasible (max) length of the move • ↵ k min α f ( x k + ↵ d k ) (1-dimensional problem in ↵ 2 R , the farthest point to where f ( x k + ↵ d k ) keeps increasing) 4. Move to new point in the direction of gradient descent • x k +1 x k + ↵ k d k 5. Check for convergence • If k ↵ k d k k < ✏ [or, if ↵ k  c ↵ 0 , where c > 0 is a small constant] (i.e., gradient becomes zero) (a) Output: x ∗ = x k +1 , f ( x ∗ ) = f ( x k +1 ) (b) Stop • Otherwise, k k + 1 , and go to Step 2 12

⇒ S T E P B E H A V I O R • If the gradient in 𝒚 0 points to the local minimum, then a line search would determine a step size 𝜷 k that would take directly to the minimum • However, this lucky case only happens in perfectly conditioned functions, or for a restricted set of points • It might be heavy to solve a 1D optimization problem at each step, such that some approximate methods can be preferred • In the general case, the moving directions 𝒆 k are perpendicular to each other : 𝜷 * minimizes 𝒈 ( 𝒚 k + 𝜷 * 𝒆 k ), such that 𝑒𝒈 / 𝑒𝜷 must be zero in 𝜷 * but, 13

I L L - C O N D I T I O N E D P R O B L E M S Ill-conditioned problem Well-conditioned problem • If the function is very anisotropic , then the problem is said ill-conditioned , since the gradient vector doesn’t point to the direction of the local minimum, resulting into a zig-zagging trajectory • Ill-conditioning can be determined by computing the ratio between the eigenvalues of the Hessian matrix (the matrix of the second partial derivatives) 14

W H A T A B O U T A C O N S TA N T S T E P S I Z E ? ! is too large, divergence Small, good ! , convergence 15

W H A T A B O U T A C O N S TA N T S T E P S I Z E ? ! is too large, divergence ! is too low, slow convergence • Adapting the step size may be necessary to avoid either a too slow progress, or overshooting the target minimum 16

E X A M P L E : S U P E R V I S E D M A C H I N E L E A R N I N G • Loss function: Sum of squared errors: • 𝑛 labeled training samples ( examples ) • 𝑧 ( 𝑗 ) = known correct value for sample 𝑗 • 𝑦 ( 𝑗 ) · 𝜄 = linear hypothesis function, 𝜄 vector of parameters • Goal: find the value of the parameter vector 𝜄 such that the loss (errors in classification / regression) is minimized (over the training set) • Any analogy with PSO? Gradient: Ø If the averaging factor 1/2& is used, then the update action becomes: / ! ← ! − % & ' ( (*) ( (*), - ! − . (*) 17 *01

S T O C H A S T I C G R A D I E N T D E S C E N T 18

A D A P T I V E S T E P S I Z E ? • Problem min α f ( x k + α d k ) can be tackled using any algorithm for one-dimensional search: Fibonacci, Golden section, Newton’s method, Secant method , . . . • If f 2 C 2 , it’s possible to compute the Hessian H k , and α k can be determined analytically : – Let y k = α d k be the (feasible) small displacement applied in x k , such that from Taylor’s series about x k , truncated at the 2nd order (quadratic approximation): k r k + 1 f ( x k + y k ) ⇡ f ( x k ) + y T 2 y T k H k y k – In the algorithm, d k is the direction of steepest descent, y k = � α r k , such that: k r k + 1 f ( x k � α r k ) ⇡ f ( x k ) � α r T 2 α 2 r T k H k r k – We are looking for α minimizing f ! first order conditions must be satisfied, d f d α = 0 : d f r T f ( x k � α r k ) k r k d ⇡ �r T k r k + α r T k H k r k = 0 ) α = α k ⇡ r T k H k r k d α – Updating rule of the gradient descent algorithm becomes: r T k r k x k +1 x k � r k r T k H k r k 19

C O N D I T I O N S O N T H E H E S S I A N M A T R I X r T k r k x k +1 x k � r k r T k H k r k • The value found for α k is a sound estimate of the ”correct” value of α as long as the quadratic approximation of the Taylor series is an accurate approximation of the f ( x ) . • At the beginning of the iterations, k y k k will be quite large since the approximation will be inaccurate, but getting closer to the minimum k y k k will decrease accordingly, and the accuracy will keep increasing. • If f is a quadratic function, the Taylor series is exact, and we can use = instead of ⇡ ) α k is exact at each iteration • The Hessian matrix is the matrix of 2nd order partial derivatives. E.g., for f : X ✓ R 3 7! R , the Hessian matrix computed in a point x ∗ :   ∂ 2 f ∂ 2 f ∂ 2 f ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 1 ∂ x 3   1     ∂ 2 f ∂ 2 f ∂ 2 f   H | x ∗ =   ∂ x 2 ∂ x 1 ∂ x 2 ∂ x 2 ∂ x 3   2     ∂ 2 f ∂ 2 f ∂ 2 f   ∂ x 2 ∂ x 3 ∂ x 1 ∂ x 3 ∂ x 2 20 3 | x ∗

P R O P E R T I E S O F T H E H E S S I A N M A T R I X • Theorem (Schwartz) : If a function is C 2 (twice di ff erentiable with continuity) in R n ! The order of derivation is irrelevant. ) The Hessian matrix is symmetric • Theorem (Quadratic form of a matrix) : Given a matrix H 2 R n × n , squared and symmetric, the associated quadratic form is defined as the function: Positive definite (convex) q ( x ) = 1 2 x T Hx 2 The matrix is said: The matrix is said: – Positive definite if x T Hx > 0 , 8 x 2 R n , x 6 = 0 – Positive semi-definite if x T Hx � 0 , 8 x 2 R n – Negative definite if x T Hx < 0 , 8 x 2 R n , x 6 = 0 – Negative semi-definite if x T Hx  0 , 8 x 2 R n – Indefinite if x T Hx > 0 for some x and x T Hx < 0 for others x Negative definite (concave) Positive semi-definite Indefinite (saddle) 21

L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - PowerPoint PPT Presentation

15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy

L ECTURE 20: S WARM I NTELLIGENCE 1 / P ARTICLE S WARM O PTIMIZATION 1 T EACHER : G IANNI A. D I C

L ECTURE 16: S WARM I NTELLIGENCE 2 / P ARTICLE S WARM O PTIMIZATION 2 I NSTRUCTOR : G IANNI A. D I

L ECTURE 24: S WARM I NTELLIGENCE 5 / A NT C OLONY O PTIMIZATION 1 T EACHER : G IANNI A. D I C ARO

L ECTURE 21: S WARM I NTELLIGENCE 7 / A NT C OLONY O PTIMIZATION 3 I NSTRUCTOR : G IANNI A. D I C

L ECTURE 19: S WARM I NTELLIGENCE 5 / A NT C OLONY O PTIMIZATION 1 I NSTRUCTOR : G IANNI A. D I C

L ECTURE 27: S WARM I NTELLIGENCE 8 / A NT C OLONY O PTIMIZATION 4 T EACHER : G IANNI A. D I C ARO

L ECTURE 1: I NTRODUCTION T EACHER : G IANNI A. D I C ARO C OLLECTIVE I NTELLIGENCE ? Group of

L ECTURE 1: I NTRODUCTION I NSTRUCTOR : G IANNI A. D I C ARO C OLLECTIVE I NTELLIGENCE ? Group of

Warm Mix Asphalt Warm Mix Asphalt (WMA 101) (WMA 101) What Is Warm Mix Asphalt ? What Is Warm

GeoCom putational I ntelligence and GeoCom putational I ntelligence and High-perform ance

4. Droplet Growth in Warm Clouds In warm clouds, droplets can grow by condensation in a

Succinct Compilation of Propositional Theories Simone Bova Vienna University of Technology

A GING OF C LASSICAL O SCILLATORS DURING A N OISE -D RIVEN M IGRATION OF O SCILLATOR P HASES

Elder Abuse The Confederated Tribes of Warm Springs Warm Springs, Oregon Wilson Wewa Senior

Hot code is faster code Addressing JVM warm-up Mark Price LMAX Exchange The JVM warm-up

WARM HANDOFF Why and how to implement it and successful approaches for CCOs Content 1. Define

Iterated Tikhonov with Generalized Singular Value Decomposition Mirjeta PASHA Kent State

A structural geometrical analysis of ill-conditioned semidefinite programs Takashi Tsuchiya

CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer

Preconditioners for ill conditioned (block) Toeplitz systems: facts and ideas Paris Vassalos

JUST THE MATHS SLIDES NUMBER 1.7 ALGEBRA 7 (Simultaneous linear equations) by

HW Due, Odds & Ends of Why the derivative important topics problem is fundamentally

Spectral Reconstruction with Deep Neural Networks Lukas Kades Cold Quantum Coffee - Heidelberg

Robust Camera Location Estimation by Convex Programming sil and Amit Singer Onur

L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - PowerPoint PPT Presentation

15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy

L ECTURE 20: S WARM I NTELLIGENCE 1 / P ARTICLE S WARM O PTIMIZATION 1 T EACHER : G IANNI A. D I C

L ECTURE 16: S WARM I NTELLIGENCE 2 / P ARTICLE S WARM O PTIMIZATION 2 I NSTRUCTOR : G IANNI A. D I

L ECTURE 24: S WARM I NTELLIGENCE 5 / A NT C OLONY O PTIMIZATION 1 T EACHER : G IANNI A. D I C ARO

L ECTURE 21: S WARM I NTELLIGENCE 7 / A NT C OLONY O PTIMIZATION 3 I NSTRUCTOR : G IANNI A. D I C

L ECTURE 19: S WARM I NTELLIGENCE 5 / A NT C OLONY O PTIMIZATION 1 I NSTRUCTOR : G IANNI A. D I C

L ECTURE 27: S WARM I NTELLIGENCE 8 / A NT C OLONY O PTIMIZATION 4 T EACHER : G IANNI A. D I C ARO

L ECTURE 1: I NTRODUCTION T EACHER : G IANNI A. D I C ARO C OLLECTIVE I NTELLIGENCE ? Group of

L ECTURE 1: I NTRODUCTION I NSTRUCTOR : G IANNI A. D I C ARO C OLLECTIVE I NTELLIGENCE ? Group of

Warm Mix Asphalt Warm Mix Asphalt (WMA 101) (WMA 101) What Is Warm Mix Asphalt ? What Is Warm

GeoCom putational I ntelligence and GeoCom putational I ntelligence and High-perform ance

4. Droplet Growth in Warm Clouds In warm clouds, droplets can grow by condensation in a

Succinct Compilation of Propositional Theories Simone Bova Vienna University of Technology

A GING OF C LASSICAL O SCILLATORS DURING A N OISE -D RIVEN M IGRATION OF O SCILLATOR P HASES

Elder Abuse The Confederated Tribes of Warm Springs Warm Springs, Oregon Wilson Wewa Senior

Hot code is faster code Addressing JVM warm-up Mark Price LMAX Exchange The JVM warm-up

WARM HANDOFF Why and how to implement it and successful approaches for CCOs Content 1. Define

Iterated Tikhonov with Generalized Singular Value Decomposition Mirjeta PASHA Kent State

A structural geometrical analysis of ill-conditioned semidefinite programs Takashi Tsuchiya

CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer

Preconditioners for ill conditioned (block) Toeplitz systems: facts and ideas Paris Vassalos

JUST THE MATHS SLIDES NUMBER 1.7 ALGEBRA 7 (Simultaneous linear equations) by

HW Due, Odds &amp; Ends of Why the derivative important topics problem is fundamentally

Spectral Reconstruction with Deep Neural Networks Lukas Kades Cold Quantum Coffee - Heidelberg

Robust Camera Location Estimation by Convex Programming sil and Amit Singer Onur

HW Due, Odds & Ends of Why the derivative important topics problem is fundamentally