L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - - PowerPoint PPT Presentation
L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I - - PowerPoint PPT Presentation
15-382 C OLLECTIVE I NTELLIGENCE - S18 L ECTURE 17: S WARM I NTELLIGENCE 3 / C LASSICAL O PTIMIZATION I NSTRUCTOR : G IANNI A. D I C ARO P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A Decision-making / search strategy
2
P S O : S WA R M C O O P E R A T I O N + M E M O RY + I N E R T I A
individual swarm / neighborhood
P P
- Decision-making / search strategy
for an individual particle / agent
3
A T Y P I C A L S WA R M - L E V E L B E H A V I O R …
4
W H A T I F W E H A V E O N E S I N G L E A G E N T …
- PSO leverages the presence of a swarm: the outcome is truly a collective behavior
- If left alone, each individual agent would behave like a hill-climber when moving in
the direction of a local optimum, and then it will have a quite hard time to escape it A single agent doesn’t look impressive 🤕 … How can a single agent be smarter?
5
A ( F I R S T ) G E N E R A L A P P R O A C H
- What is a good direction p?
- How small / large / constant / variable should be the step size 𝜷k?
- How do we check that we are at the minimum?
- This could work for a local optimum, but what about finding the global optimum?
6
I F W E H A V E T H E F U N C T I O N ( A N D I T S D E R I VA T I V E S )
! " ≈ ! "% + '(!("%)*(" − "%)
- From 1st order Taylor series: Equation of the tangent plane to 𝒈 in 𝒚0 : the gradient
vector is orthogonal to the tangent plane
- The partial derivatives determine the slope of the plane
- In each point 𝒚, the gradient vector is orthogonal to the isocountours, {𝒚 : 𝒈(𝒚)=𝑑}
- ⇒ The gradient vector points in the direction of maximal change of the function in the
- point. The magnitude of the gradient is the (max) velocity of change for the function
7
G R A D I E N T S A N D R A T E O F C H A N G E
- Directional derivative for a function 𝒈: ℝn ⟼ ℝ is the rate of change of the function
along a given direction 𝒘 (where 𝒘 is a unit-norm vector):
- Partial derivatives are directional derivatives along the vectors of the canonical basis of ℝn
- E.g., for a function of two variables and a direction 𝒘=(𝑤x, 𝑤y)
8
G R A D I E N T S A N D R A T E O F C H A N G E
- Theorem: Given a direction 𝒘 and a differentiable function 𝒈, then in each point 𝒚0 of
the function domain the following relation exists between the gradient vector and the directional derivative:
- Corollary 1:
- Corollary 2:
- nly when the gradient is parallel to the
directional derivative (𝜾=0) that is, the directional derivative in a point gets its maximum value when the direction 𝒘 is the same as the gradient vector
- ⇒ In each point, the gradient vector corresponds to the direction of maximal
change of the function
- ⇒ The norm of the gradient corresponds to the max velocity of change in
the point
9
G R A D I E N T D E S C E N T / A S C E N T
- Move in the direction opposite to (min) to or
aligned with (max) the gradient vector, the direction of maximal change of the function
10
O N LY L O C A L O P T I M A
The final local optimum depends on where we start from (for non-convex functions)
11
~ M O T I O N I N A P O T E N T I A L F I E L D
§ GD run ~ Motion of a mass in a potential field towards the minimum energy
- configuration. At each point the gradient defines the attraction force, while the
step ! scales the force, to define the next point
12
G R A D I E N T D E S C E N T A L G O R I T H M ( M I N )
- 1. Initialization
(a) Definition of a starting point x0 (b) Definition of a tolerance parameter for converence ✏ (c) Initialization of the iteration variable, k 0
- 2. Computation of a feasible direction for moving
- dk rkf(xk)
- 3. Definition of the feasible (max) length of the move
- ↵k minα f(xk + ↵dk)
(1-dimensional problem in ↵ 2 R, the farthest point to where f(xk + ↵dk) keeps increasing)
- 4. Move to new point in the direction of gradient descent
- xk+1 xk + ↵kdk
- 5. Check for convergence
- If k↵kdkk < ✏ [or, if ↵k c↵0, where c > 0 is a small constant] (i.e., gradient becomes zero)
(a) Output: x∗ = xk+1, f(x∗) = f(xk+1) (b) Stop
- Otherwise, k k + 1, and go to Step 2
13
S T E P B E H A V I O R
- If the gradient in 𝒚0 points to the local minimum,
then a line search would determine a step size 𝜷k that would take directly to the minimum
- However, this lucky case only happens in perfectly
conditioned functions, or for a restricted set of points
- It might be heavy to solve a 1D optimization
problem at each step, such that some approximate methods can be preferred
- In the general case, the moving directions 𝒆k are
perpendicular to each other: 𝜷* minimizes 𝒈(𝒚k + 𝜷*𝒆k), such that 𝑒𝒈/𝑒𝜷 must be zero in 𝜷* ⇒ but,
14
I L L - C O N D I T I O N E D P R O B L E M S
Ill-conditioned problem Well-conditioned problem
- If the function is very anisotropic, then the problem is said ill-conditioned, since
the gradient vector doesn’t point to the direction of the local minimum, resulting into a zig-zagging trajectory
- Ill-conditioning can be determined by computing the ratio between the
eigenvalues of the Hessian matrix (the matrix of the second partial derivatives)
15
W H A T A B O U T A C O N S TA N T S T E P S I Z E ?
Small, good !, convergence
! is too large, divergence
16
W H A T A B O U T A C O N S TA N T S T E P S I Z E ?
- Adapting the step size may be necessary to avoid either a too slow progress, or
- vershooting the target minimum
! is too low, slow convergence
! is too large, divergence
17
E X A M P L E : S U P E R V I S E D M A C H I N E L E A R N I N G
! ← ! − % & ' ((*) ((*), - ! − .(*)
/ *01
Ø If the averaging factor 1/2& is used, then the update action becomes:
- Loss function: Sum of squared errors:
- 𝑛 labeled training samples (examples)
- 𝑧(𝑗) = known correct value for sample 𝑗
- 𝑦(𝑗) · 𝜄 = linear hypothesis function, 𝜄 vector of parameters
- Goal: find the value of the parameter vector 𝜄 such that the loss (errors in
classification / regression) is minimized (over the training set)
- Any analogy with PSO?
Gradient:
18
S T O C H A S T I C G R A D I E N T D E S C E N T
19
A D A P T I V E S T E P S I Z E ?
- Problem minα f(xk +αdk) can be tackled using any algorithm for one-dimensional search:
Fibonacci, Golden section, Newton’s method, Secant method, . . . – In the algorithm, dk is the direction of steepest descent, yk = αrk, such that: f(xk αrk) ⇡ f(xk) αrT
k rk + 1
2α2rT
k Hkrk d f
– We are looking for α minimizing f ! first order conditions must be satisfied, d
f dα = 0:
d f(xk αrk) dα ⇡ rT
k rk + αrT k Hkrk = 0
) α = αk ⇡ rT
k rk
rT
k Hkrk
– Updating rule of the gradient descent algorithm becomes: xk+1 xk rT
k rk
rT
k Hkrk
rk
- If f 2 C2, it’s possible to compute the Hessian Hk, and αk can be determined analytically:
– Let yk = αdk be the (feasible) small displacement applied in xk, such that from Taylor’s series about xk, truncated at the 2nd order (quadratic approximation): f(xk + yk) ⇡ f(xk) + yT
k rk + 1
2yT
k Hkyk
20
C O N D I T I O N S O N T H E H E S S I A N M A T R I X
xk+1 xk rT
k rk
rT
k Hkrk
rk
- The value found for αk is a sound estimate of the ”correct” value of α as long as the
quadratic approximation of the Taylor series is an accurate approximation of the f(x).
- At the beginning of the iterations, kykk will be quite large since the approximation will
be inaccurate, but getting closer to the minimum kykk will decrease accordingly, and the accuracy will keep increasing.
- If f is a quadratic function, the Taylor series is exact, and we can use = instead of ⇡ )
αk is exact at each iteration
- The Hessian matrix is the matrix of 2nd order partial derivatives.
E.g., for f : X ✓ R3 7! R, the Hessian matrix computed in a point x∗: H|x∗ = ∂2f ∂x2
1
∂2f ∂x1∂x2 ∂2f ∂x1∂x3 ∂2f ∂x1∂x2 ∂2f ∂x2
2
∂2f ∂x2∂x3 ∂2f ∂x3∂x1 ∂2f ∂x3∂x2 ∂2f ∂x2
3
|x∗
21
P R O P E R T I E S O F T H E H E S S I A N M A T R I X
- Theorem (Quadratic form of a matrix): Given a matrix H 2 Rn×n, squared and symmetric, the
associated quadratic form is defined as the function: q(x) = 1 2xTHx The matrix is said:
Positive semi-definite Indefinite (saddle) Positive definite (convex) Negative definite (concave)
- Theorem (Schwartz): If a function is C2 (twice differentiable with continuity) in Rn ! The order of
derivation is irrelevant. ) The Hessian matrix is symmetric 2 The matrix is said: – Positive definite if xTHx > 0, 8x 2 Rn, x 6= 0 – Positive semi-definite if xTHx 0, 8x 2 Rn – Negative definite if xTHx < 0, 8x 2 Rn, x 6= 0 – Negative semi-definite if xTHx 0, 8x 2 Rn – Indefinite if xTHx > 0 for some x and xTHx < 0 for others x
22
H E S S I A N Q U A D R A T I C F O R M A N D M I N / M A X
- Theorem (Use of the eigenvalues): A matrix H 2 Rn×n is positive definite (semi-definite) if and only
if all its eigenvalues are positives (non negatives). Proof sketch: For a diagonal matrix D = diag(d1, · · · , dn) where each diagonal entry is positive, the theorem holds, since xTDx = P dix2
i > 0 (unless x = 0).
Since H is real and symmetric (Hermitian), the spectral theorem says that it is diagonalizable using the eigenvector matrix E as orthonormal basis for the transformation: E−1AE = Λ, where the elements of the diagonal matrix Λ are the eigenvalues of A. Therefore, A is positive (negative) definite if all its eigenvalues are positive (negative).
- Theorem (Hessian and min/max points): Given x∗, a stationary point of f : X ✓ Rn 7! R
(i.e., a point where rf(x∗) = 0), and given f’s Hessian matrix H(f), evaluated in in x∗, the following conditions are sufficient to determine the nature of x∗ as an extremum of the function:
- 1. If H is positive definite ! x∗ is a minimum (local or global)
- 2. If H is negative definite ! x∗ is a maximum (local or global)
- 3. If H has eigenvalues of opposite sign ! x∗ is (in general), a saddle
- If H semi-definite, positive or negative, is more complex to analyze . . .
23
C O N V E R G E N C E S P E E D O F G R A D I E N T D E S C E N T
- Previous theorem says that in the neighborhood of a local minimum any function 𝐷2 is
approximated by a positive definite quadratic form: the function is strictly convex
- The Hessian matrix is symmetric → The quadratic form is also symmetric
- In 2D → The isocurves are ellipsoids:
- Axes are oriented as the eigenvectors of H
- Length is proportional to
Convergence speed depends on the shape of the ellipsoids
- ~ Spherical: fast convergence
- Unbalanced: slow convergence
Convergence rate (p=1, linear) For a quadratic function, p=1:
24
N U M E R I C E X A M P L E
- H =
✓2 1 1 2 ◆ , from the characteristic equation, det(H − λI) = 0 we get: λ2 − 4λ + 3 = 0 → λ1 = 1, λ2 = 3.
- For the eigenvectors, Hx = λx:
– λ1 = 1: Hx = x → (H − I)x = 0, that reduces to equation x2 = −x1 → all λ1’s eigenvectors are ✓ 1 −1 ◆ x1 – λ2 = 3: Hx = x → (H − 3I)x = 0, that reduces to equation x2 = x1 → all λ2’s eigenvectors are ✓1 1 ◆ x1
- There are two directions defined by the eigenvectors: x2 = −x1 and x2 = x1
- Isolines of the quadratic form: xTHx = k → 2x2
1 + 2x1x2 + 2x2 2 = k
- We can transform it in the canonical form, in the eigenvector basis → X
λ2
1 + Y
λ2
2 = k, which is an ellipsis
- riented in the direction of the eigenvectors and semi-axes of length
p k/λ1, p k/λ2
25
S T E P - S I Z E W I T H O U T H E S S I A N M A T R I X
) rT
k Hkrk ⇡ 2( ˆ
fk fk + ˆ αrT
k rk)
ˆ α2 ) αk ⇡ rT
k rk ˆ
α2 2( ˆ fk fk + ˆ αrT
k rk)
) xk+1 xk 2( ˆ fk fk + ˆ αrT
k rk)
ˆ α2 rk
- A reasonable choice for ˆ
α is αk−1: the (optimal) value taken by α in the previous iteration, starting with α0 = 1
- Also in this case the approximation is more or less good depending on how the 2nd order
Taylor series is an accurate approximation of f(x)
- Let’s assume to have a local estimate ˆ
α of the values αk, and let’s denote f(xk) as fk, and f(xk ˆ αrk) as ˆ
- fk. From the previous 2nd order truncated Taylor’s series:
ˆ fk = f(xk ˆ αrk) ⇡ fk ˆ αrT
k rk + 1
2rT
k Hkrk
Moral: Isn’t easier (more effective / general) to use and play with PSO?