Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview - - PDF document
Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview - - PDF document
Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate
Conjugate Gradient (CG)
Majid Lesani Alireza Masoum
Overview
Backpropagation Gradient Descent Quadratic Forms Gradient Descent in Quadratic Forms Eigen vectors and values Gradient Descent Convergence Conjugate Gradient
BackPropagation
Abstraction Generalization problem
- Heuristic features
- Small networks
- Early stopping
- Regularization
Search Convergence problem
Or Steepest Descent
Gradient Descent
x y x f ∂ ∂ ) , ( y y x f ∂ ∂ ) , (
Faster Training
Gradient Descent modification
Gradient Descent BP with Momentum Variable Learning Rate BP
numerical optimization techniques
Conjugate Gradient BP Quasi-Newton BP
Gradient Descent
The problem is choosing the step size
Gradient Descent Choosing Best Step Size
Choose Where
is minimum
(By chain rule)
i
α
) (
1 + i
x f
) (
1 =
∂ ∂
+ i i
x f α
). ( ) (
1
= ∇ = ∂ + ∂ ⇒
+ i i i i i i
r x f r x f α α
1 =
⇒
+ i T i r
r
Gradient Descent Choosing Best Step Size
Quadratic forms
Our discussion is to minimize the quadratic
function:
c x b Ax x x f
T T
+ − = 2 1 ) (
Positive definite (for every vector v, )
> Av vT
Quadratic Forms
A Symmetric Positive-Definite Matrix have a
global minimum where gradient is zero
Solving equation Ax = b equals to minimize f
c x b Ax x x f
T T
+ − = 2 1 ) (
b Ax x f − = ∇ = ) (
Gradient Descent for Quadratic Forms
steepest descent for quadratic form is
Eigen Vectors and Eigen Values
An eigenvector of a matrix A is a nonzero vector that
does not rotate when A is applied to it. Only scale by constant
Every symmetric matrix have n orthogonal eigen
vector with it’s related eigen value
Using Eigen Vectors
think of a vector as a sum of other
vectors whose behavior is understood
Using Eigen Vectors
Positive definite matrix is a matrix that
all its eigen values are positive
Eigen vectors are axis of our rotated
ellipse and each radius relate to corresponding eigen value
General Convergence of Steepest Descent
Relation between eigen values of A Eigen vector components of error
Fast Convergence
Same eigen values have fast
convergence
Poor Convergence
Different Eigen vectors and error component in
direction of eigen vectors of smaller eigen values
Conjugate Gradient Overview
Orthogonal Directions Conjugate vectors Conjugate Directions Gram-Schmidt algorithm Gradient and error optimality Conjugate Gradient
Orthogonal Directions
Steepest descent go in one direction
many times
if we have n orthogonal search
directions and choose best step every time After n steps we are at the goal!
Orthogonal Directions
We need every time error be
- rthogonal to previous direction
Conjugate vectors
Conjugate vectors
Two vectors
and are A-orthogonal ( or conjugate) if
Being Conjugate in scaled space
means orthogonal in unscaled space
Conjugate Directions
If we have n conjugate search
directions and like orthogonal directions choose best step every time After n steps we are at the goal!
Conjugate Directions
Orthogonal Directions
Conjugate Directions
We need every time error be
A-orthogonal to previous direction
Conjugate Directions
i i i i i i
r b Ax Ax Ax Ae x x e − = − = − = − =
Gram-Schmidt algorithm
So, only remains to find n conjugate
directions
Gram-Schmidt algorithm do it
have n independent Gives n conjugate directions
Gram-Schmidt algorithm
Gram-Schmidt algorithm
Conjugate Directions
So Algorithm is complete but it’s
!
We had Gaussian elimination
algorithm before
Conjugate Directions with axial unit vectors
Gradient and error optimality
For every We have It means
Conjugate Gradient
Use
for
Makes equations very simple Complexity from O(n^2) per iteration reduce to O(m),
m is number of nonzero entries of A
Line Search
Finding stepsize
compute best step-size
) ( min arg
i i i
d x f ⋅ + ∈
≥
α α
α
End
Thanks for your patience!