2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

2 elements of convex optjmizatjon
SMART_READER_LITE
LIVE PREVIEW

2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

Introductjon to Machine Learning CentraleSuplec Paris Fall 2017 2. Elements of convex optjmizatjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Why talk about


slide-1
SLIDE 1
  • 2. Elements of convex
  • ptjmizatjon

Introductjon to Machine Learning CentraleSupélec Paris — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

Why talk about optjmizatjon?

  • Supervised ML: empirical risk minimizatjon
slide-3
SLIDE 3

Why talk about optjmizatjon?

  • Supervised ML: empirical risk minimizatjon
  • Quadratjc loss
slide-4
SLIDE 4

Why talk about optjmizatjon?

  • Supervised ML: empirical risk minimizatjon
  • Absolute loss
slide-5
SLIDE 5

Why talk about optjmizatjon?

  • Supervised ML: empirical risk minimizatjon
  • 0/1 loss
slide-6
SLIDE 6
  • Unsupervised machine learning also involves

minimizing functjons. Examples:

– Dimensionality reductjon: fjnd a set of m features, m<p,

such that the data projected on these m features retains maximal informatjon.

– Clustering: fjnd K groups of samples such that the

between-groups variance is high and the within-group variance is small.

Why talk about optjmizatjon?

slide-7
SLIDE 7

Learning objectjves

  • Recognize a convex optjmizatjon problem.
  • Solve an unconstrained convex optjmizatjon

problem

– Exactly when possible – By gradient descent and a number of its variants.

  • Solve a quadratjc program

– Formulate the dual problem – Write down Karush-Kuhn-Tucker conditjons.

  • Transform inequality constraints with slack variables.
slide-8
SLIDE 8

Convex functjons

slide-9
SLIDE 9

Convex set

is a convex set ifg:

for all and

Line segments between 2 points of S lie entjrely in S.

Convex set of ℝ2 Non-convex set of ℝ2

slide-10
SLIDE 10

Convex functjon

is convex ifg:

  • its domain is a convex set
  • for all and

f lies below the line segment joining f(u) and f(v).

Convex functjon of → ℝ ℝ Non-convex functjon of → ℝ ℝ

slide-11
SLIDE 11

Concave functjon

is concave ifg:

  • its domain is a convex set
  • for all and

f concave

  • f convex

slide-12
SLIDE 12

Are the following univariate functjons convex?

slide-13
SLIDE 13

Are the following univariate functjons convex?

  • ?
slide-14
SLIDE 14

Are the following univariate functjons convex?

  • Yes!
slide-15
SLIDE 15

Are the following univariate functjons convex?

  • ?
slide-16
SLIDE 16

Are the following univariate functjons convex?

  • No!
slide-17
SLIDE 17

Are the following univariate functjons convex?

  • No

No Yes Yes No Yes Yes

slide-18
SLIDE 18

Univariate examples

– Exponentjal: – Logarithmic: – Power functjons:

slide-19
SLIDE 19

More examples

  • Affjne functjons are both convex and concave
  • Quadratjc functjons
  • Lp norms
  • Max

Q positjve semi-defjnite

slide-20
SLIDE 20

More examples

  • Affjne functjons are both convex and concave
  • Quadratjc functjons
  • Lp norms
  • Max

Q positjve semi-defjnite ?

slide-21
SLIDE 21

More examples

  • Affjne functjons are both convex and concave
  • Quadratjc functjons
  • Lp norms
  • Max

– All eigenvalues of Q are non-negatjve – The bilinear form is an inner product – Q is a Gram matrix of independent vectors – Unique Cholesky decompositjon

Q positjve semi-defjnite

slide-22
SLIDE 22

More examples

  • Affjne functjons are both convex and concave
  • Quadratjc functjons
  • Lp norms
  • Max

Q positjve semi-defjnite

slide-23
SLIDE 23
  • is strictly convex ifg:

f is convex and has greater curvature than a linear functjon.

  • is strongly convex of parameter m>0 ifg:

f is convex and has curvature as least as great as a quadratjc functjon. strongly convex strictly convex convex ⇒ ⇒

is convex.

slide-24
SLIDE 24

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

slide-25
SLIDE 25

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

Gradient of f

slide-26
SLIDE 26

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

?

slide-27
SLIDE 27

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

First-order Taylor expansion of f in u

slide-28
SLIDE 28

(u, f(u)) (v, f(v)) (v, f(u)+ f’(u).(v-u))

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

First-order Taylor expansion of f in u

slide-29
SLIDE 29

(u, f(u)) (v, f(v)) (v, f(u)+ f’(u).(v-u))

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

First-order Taylor expansion of f in u What does it mean if ?

?

slide-30
SLIDE 30

(u, f(u)) (v, f(v)) (v, f(u)+ f’(u).(v-u))

First-order characterizatjon

  • If f is difgerentjable, then f is convex if and only if:

– its domain is a convex set – for all

First-order Taylor expansion of f in u

slide-31
SLIDE 31

Second-order characterizatjon

  • If f is twice difgerentjable, then f is convex ifg:

– its domain is a convex set – for all

?

slide-32
SLIDE 32

Second-order characterizatjon

  • If f is twice difgerentjable, then f is convex ifg:

– its domain is a convex set – for all

Hessian of f

slide-33
SLIDE 33

Second-order characterizatjon

  • If f is twice difgerentjable, then f is convex ifg:

– its domain is a convex set – for all

  • f has positjve curvature in any point u.

Hessian of f

slide-34
SLIDE 34

Operatjons preserving convexity

  • Non-negatjve linear combinatjon

If convex and then is convex.

  • Pointwise maximizatjon

If convex, then is convex (also true for an infjnite number of functjons ).

  • Partjal minimizatjon

If convex and C is a convex set, then is convex.

slide-35
SLIDE 35

Convex optjmizatjon

slide-36
SLIDE 36

Unconstrained convex optjmizatjon

Unconstrained convex optjmizatjon program/problem:

where f is convex.

slide-37
SLIDE 37

Unconstrained convex optjmizatjon

Unconstrained convex optjmizatjon program/problem:

where f is convex.

slide-38
SLIDE 38

Unconstrained convex optjmizatjon

Unconstrained convex optjmizatjon program/problem:

where f is convex.

u* f(u*)

slide-39
SLIDE 39

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

– f is convex – are convex – are affjne – D is the common domain of all the functjons.

slide-40
SLIDE 40

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

f g1 g2 y=0

Where is the solutjon?

?

slide-41
SLIDE 41

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

f g1 g2 y=0 g1 ≤ 0 g2 ≤ 0

slide-42
SLIDE 42

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

f g1 g2 y=0 g1 ≤ 0 g2 ≤ 0

slide-43
SLIDE 43

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

– f is the objectjve functjon – are the inequality constraints – are the equality constraints – that verifjes all constraints is a feasible point – The set of all feasible points is the feasible region

slide-44
SLIDE 44

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

– Assuming it exists, the solutjon , that is to say, the

minimum value of over all feasible points, is the

  • ptjmal value (optjmum)

– feasible such that is called optjmal, or a

minimizer (it needs not be unique).

– If u is feasible and then is actjve at u.

slide-45
SLIDE 45

Local & global optjma

For convex optjmizatjon problems, local minima are global minima! If u is feasible and minimizes f in a local neighborhood:

for all feasible

then u minimizes f globally.

slide-46
SLIDE 46

Local & global optjma

For convex optjmizatjon problems, local minima are global minima! If u is feasible and minimizes f in a local neighborhood:

for all feasible

then u minimizes f globally.

slide-47
SLIDE 47

Why talk about convex optjmizatjon?

convex non-convex

  • Convex optjmizatjon is “easy”.
  • We’ll ofuen try to formulate ML problems as convex
  • ptjmizatjon problems.
slide-48
SLIDE 48

Why talk about convex optjmizatjon?

  • Supervised ML: empirical risk minimizatjon
  • Losses for classifjcatjon

The 0/1 loss is non-convex. We’ll replace it with other losses.

slide-49
SLIDE 49

Unconstrained convex optjmizatjon

slide-50
SLIDE 50

First-order characterizatjon

  • Suppose f difgerentjable
  • Given the fjrst-order characterizatjon of convex

functjons, how can we solve ?

?

slide-51
SLIDE 51

(u, f(u)) (v, f(v)) (v, f(u)+ f(u).(v-u))

First-order characterizatjon

  • Suppose f difgerentjable
  • Given the fjrst-order characterizatjon of convex

functjons, how can we solve ?

slide-52
SLIDE 52

(u, f(u)) (v, f(v)) (v, f(u)+ f(u).(v-u))

First-order characterizatjon

  • Suppose f difgerentjable
  • Given the fjrst-order characterizatjon of convex

functjons, how can we solve ? Set the gradient of f to 0

slide-53
SLIDE 53

First-order characterizatjon

  • Suppose f difgerentjable
  • Given the fjrst-order characterizatjon of convex

functjons, how can we solve ? Set the gradient of f to 0

  • But what if cannot be solved?
slide-54
SLIDE 54

Gradient descent

  • Start from a random point u.
  • How do I get closer to the solutjon?

u f(u)

?

slide-55
SLIDE 55

Gradient descent

  • Start from a random point u.
  • How do I get closer to the solutjon?
  • Follow the opposite of the gradient.

The gradient indicates the directjon of steepest increase.

  • f(u))

∇ u f(u)

slide-56
SLIDE 56

Gradient descent

  • Start from a random point u.
  • How do I get closer to the solutjon?
  • Follow the opposite of the gradient.

The gradient indicates the directjon of steepest increase.

  • f(u))

∇ u f(u) u+ f(u+)

slide-57
SLIDE 57

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …
  • Stop at some point
slide-58
SLIDE 58

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …
  • Stop at some point

step size stopping criterion

slide-59
SLIDE 59

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …
  • Stop at some point

step size stopping criterion Usually: stop when

slide-60
SLIDE 60

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If the step size is too big, the search might diverge – If the step size is too small, the search might take a very

long tjme

– Backtracking line search makes it possible to chose the

step size adaptjvely. step size

?

slide-61
SLIDE 61

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If the step size is too big, the search might diverge – If the step size is too small, the search might take a very

long tjme

– Backtracking line search makes it possible to chose the

step size adaptjvely. step size

slide-62
SLIDE 62

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If the step size is too big, the search might diverge – If the step size is too small, the search might take a very

long tjme

– Backtracking line search makes it possible to chose the

step size adaptjvely. step size

slide-63
SLIDE 63

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If the step size is too big, the search might diverge – If the step size is too small, the search might take a very

long tjme

– Backtracking line search makes it possible to chose the

step size adaptjvely. step size

?

slide-64
SLIDE 64

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If the step size is too big, the search might diverge – If the step size is too small, the search might take a very

long tjme

– Backtracking line search makes it possible to chose the

step size adaptjvely. step size

slide-65
SLIDE 65

Gradient descent algorithm

  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If the step size is too big, the search might diverge – If the step size is too small, the search might take a very

long tjme

– Backtracking line search makes it possible to chose the

step size adaptjvely. step size

slide-66
SLIDE 66

BLS: shrinking needed

u f(u) u- f(u) α∇ f(u- f(u)) α∇

The step size is too big and we are overshootjng our goal.

slide-67
SLIDE 67

BLS: shrinking needed

u f(u) u- f(u) α∇ f(u- f(u)) α∇

The step size is too big and we are overshootjng our goal.

u- /2 f(u) α ∇

?

slide-68
SLIDE 68

BLS: shrinking needed

u f(u) u- f(u) α∇ f(u- f(u)) α∇

The step size is too big and we are overshootjng our goal.

u- /2 f(u) α ∇ f(u)- /2 f(u) α ∇

T f(u)

slide-69
SLIDE 69

BLS: shrinking needed

u f(u) u- f(u) α∇ f(u- f(u)) α∇

The step size is too big and we are overshootjng our goal.

u- /2 f(u) α ∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u- f(u)) α∇ >

slide-70
SLIDE 70

BLS: shrinking needed

u f(u) u- f(u) α∇ f(u- f(u)) α∇ u- /2 f(u) α ∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u- f(u)) α∇ >

The step size is too big and we are overshootjng our goal.

slide-71
SLIDE 71

BLS: no shrinking needed

u f(u) u- f(u) α∇ f(u- f(u)) α∇ u- /2 f(u) α ∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u)- /2 f(u) α ∇

T f(u)

∇ f(u- f(u)) α∇ ≤

The step size is small enough.

slide-72
SLIDE 72

Backtracking line search

  • Shrinking parameter , initjal step size
  • Choose an initjal point
  • Repeat for k=1, 2, 3, …

– If

shrink the step size:

– Else: – Update:

  • Stop when
slide-73
SLIDE 73

Newton’s method

  • Suppose f is twice derivable
  • Second-order Taylor’s expansion:
  • Minimize in v instead of in u
slide-74
SLIDE 74

Newton’s method

  • Suppose f is twice derivable
  • Second-order Taylor’s expansion:
  • Minimize in v instead of in u ?
slide-75
SLIDE 75

Newton’s method

  • Suppose f is twice derivable
  • Second-order Taylor’s expansion:
  • Minimize in v:
slide-76
SLIDE 76
  • Computjng the inverse of the Hessian is

computatjonally intensive.

  • Instead, compute and

and solve for

  • What is the new update rule? ?

Newton CG (conjugate gradient)

slide-77
SLIDE 77
  • Computjng the inverse of the Hessian is

computatjonally intensive.

  • Instead, compute and

and solve for

  • New update rule:

Newton CG (conjugate gradient)

slide-78
SLIDE 78

Newton CG (conjugate gradient)

  • Computjng the inverse of the Hessian is

computatjonally intensive.

  • Instead, compute and

and solve for

This is a problem of the form

slide-79
SLIDE 79

Newton CG (conjugate gradient)

  • Computjng the inverse of the Hessian is

computatjonally intensive.

  • Instead, compute and

and solve for

This is a problem of the form

?

slide-80
SLIDE 80

Newton CG (conjugate gradient)

  • Computjng the inverse of the Hessian is

computatjonally intensive.

  • Instead, compute and

and solve for

This is a problem of the form Second-order characterizatjon of convex functjons

slide-81
SLIDE 81

Newton CG (conjugate gradient)

  • Computjng the inverse of the Hessian is

computatjonally intensive.

  • Instead, compute and

and solve for

This is a problem of the form Solve using the conjugate gradient method.

slide-82
SLIDE 82

Conjugate gradient method

Solve

  • Idea: build a set of A-conjugate vectors (basis of )

– Initjalisatjon: – At step t:

  • Update rule:
  • residual
  • – Convergence:

hence ensures

slide-83
SLIDE 83

Conjugate gradient method

Prove Given

– Initjalisatjon: – At step t:

  • Update rule:
  • residual
  • and assuming

?

slide-84
SLIDE 84

Prove Given

– Initjalisatjon: – Update rule: – residual – and assuming

slide-85
SLIDE 85

Conjugate gradient method

Prove and conclude the proof Given

– Initjalisatjon: – At step t:

  • Update rule:
  • residual
  • ?
slide-86
SLIDE 86

Prove Given

– Initjalisatjon: – Update rule: – residual –

slide-87
SLIDE 87

Quasi-Newton methods

  • What if the Hessian is unavailable / expensive to compute at

each iteratjon?

  • Approximate the inverse Hessian:

update iteratjvely

  • Conditjons:

– – Secant equatjon:

⇒ 1st order Taylor applied to f ∇

  • Initjalizatjon: Identjty
slide-88
SLIDE 88

Quasi-Newton methods

  • What if the Hessian is unavailable / expensive to compute at

each iteratjon?

  • Approximate the inverse Hessian:

update iteratjvely

  • Conditjons:

– – Secant equatjon: The mean value G of between u and v verifjes – ⇒

  • BFGS: Broyden-Fletcher-Goldfarb-Shanno
  • L-BFGS: Limited memory variant

Do not store the full matrix Wk.

slide-89
SLIDE 89

Stochastjc gradient descent

  • For
  • Gradient descent:
  • Stochastjc gradient descent:

– Cyclic: cycle over 1, 2, …, m, 1, 2, …, m, … – Randomized: chose ik uniformely at random in {1, 2, …, m}.

slide-90
SLIDE 90

Coordinate Descent

  • For

– g: convex and difgerentjable – hi: convex

the ⇒ non-smooth part of f is separable.

  • Minimize coordinate by coordinate:

– Initjalisatjon: – For k=1, 2, …:

slide-91
SLIDE 91

Coordinate Descent

  • For

– g: convex and difgerentjable – hi: convex

the ⇒ non-smooth part of f is separable.

  • Minimize coordinate by coordinate:

– Initjalisatjon: – For k=1, 2, …:

slide-92
SLIDE 92

Coordinate Descent

  • For

– g: convex and difgerentjable – hi: convex

the ⇒ non-smooth part of f is separable.

  • Minimize coordinate by coordinate:

– Initjalisatjon: – For k=1, 2, …:

Variants: – re-order the coordinates randomly

– Proceed by blocks of coordinates (2 or more at a tjme)

slide-93
SLIDE 93

Summary: Unconstrained convex optjmizatjon

If f is difgerentjable

– Set its gradient to zero – If hard to solve: gradient descent

Settjng the learning rate:

  • Backtracking Line Search (adapt heuristjcally to avoid “overshootjng”)
  • Newton’s method: Suppose f twice difgerentjable

– – If the Hessian is hard to invert, compute by solving

by the conjugate gradient method

– If the Hessian is hard to compute, approximate the inverse Hessian with a

quasi-Newton method such as BFGS (L-BFGS: less memory)

– If f is separable: stochastjc gradient descent – If the non-smooth part of f is separable: coordinate descent.

slide-94
SLIDE 94

Constrained convex optjmizatjon

slide-95
SLIDE 95

Constrained convex optjmizatjon

  • Convex optjmizatjon program/problem:

– f is convex – are convex – are affjne – The feasible set is convex

slide-96
SLIDE 96

Lagrangian

  • Lagrangian:

= Lagrange multjpliers = dual variables

slide-97
SLIDE 97

Lagrange dual functjon

  • Lagrangian:
  • Lagrange dual functjon:
  • Q is concave (independently of the convexity of f)

Infjmum = the greatest value x such that x ≤ L(u, α, β)

slide-98
SLIDE 98

Lagrange dual functjon

  • Q is concave (independently of the convexity of f)
slide-99
SLIDE 99

Lagrange dual functjon

  • The dual functjon gives a lower bound on our

solutjon Let Then for any

feasible set

slide-100
SLIDE 100

Weak duality

  • for any
  • What is the best lower bound on p* we can get? ?
slide-101
SLIDE 101

Weak duality

  • for any
  • What is the best lower bound on p* we can get?
  • Optjmal values α*, β* of α, β are called dual
  • ptjmal or optjmal Lagrange multjpliers.
  • Original optjmizatjon problem = primal
  • The dual is a convex optjmizatjon problem (even if

the primal is not!)

slide-102
SLIDE 102

Weak duality

  • for any
  • What is the best lower bound on p* we can get?
  • Optjmal values α*, β* of α, β are called dual
  • ptjmal or optjmal Lagrange multjpliers.
  • Original optjmizatjon problem = primal
  • The dual is a convex optjmizatjon problem (even if

the primal is not!) Lagrange dual problem

slide-103
SLIDE 103

Weak duality

  • for any
  • What is the best lower bound on p* we can get?
  • Optjmal values α*, β* of α, β are called dual
  • ptjmal or optjmal Lagrange multjpliers.
  • Original optjmizatjon problem = primal
  • The dual is a convex optjmizatjon problem (even if

the primal is not!) Lagrange dual problem

slide-104
SLIDE 104

Weak duality

  • Let d* be the solutjon to the dual problem
  • Because for every dual admissible α, β

Weak duality (always holds)

slide-105
SLIDE 105

Strong duality & Slater’s conditjons

  • Strong duality: d* = p*

– Does not hold in general – But ofuen holds for convex optjmizatjon problems

  • Constraint qualifjcatjons: conditjons under which

strong duality holds (in additjon to convexity)

  • In partjcular: Slater’s conditjons:

– If the primal is convex and there exists at least one

strictly feasible point (i.e. the inequalitjes hold strictly), then strong duality holds

– Strict inequalitjes only need to hold for non-affjne

constraints.

slide-106
SLIDE 106

Strong duality & Slater’s conditjons

  • Strong duality: d* = p*

– Does not hold in general – But ofuen holds for convex optjmizatjon problems

  • Constraint qualifjcatjons: conditjons under which

strong duality holds (in additjon to convexity)

  • In partjcular: Slater’s conditjons:

– If the primal is convex and there exists at least one

strictly feasible point (i.e. the inequalitjes hold strictly), then strong duality holds

– Strict inequalitjes only need to hold for non-affjne

constraints.

slide-107
SLIDE 107

Strong duality & Slater’s conditjons

  • Strong duality: d* = p*

– Does not hold in general – But ofuen holds for convex optjmizatjon problems

  • Constraint qualifjcatjons: conditjons under which

strong duality holds (in additjon to convexity)

  • In partjcular: Slater’s conditjons:

– If the primal is convex and there exists at least one

strictly feasible point (i.e. the inequalitjes hold strictly), then strong duality holds

– Strict inequalitjes only need to hold for non-affjne

constraints.

slide-108
SLIDE 108

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality

[strong duality] [defjnitjon of Q] [defjnitjon of inf]

slide-109
SLIDE 109

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality

[strong duality] [defjnitjon of Q] [defjnitjon of inf]

slide-110
SLIDE 110

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality

[strong duality] [defjnitjon of Q] [defjnitjon of inf]

slide-111
SLIDE 111

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality

[strong duality] [defjnitjon of Q] [defjnitjon of inf]

?

What is the sign of this expression?

slide-112
SLIDE 112

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality
  • Hence all above inequalitjes are equalitjes

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

slide-113
SLIDE 113

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality
  • Hence all above inequalitjes are equalitjes

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

slide-114
SLIDE 114

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality
  • Hence all above inequalitjes are equalitjes

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

?

slide-115
SLIDE 115

Karush-Kuhn-Tucker conditjons

  • Suppose f, gi, hj difgerentjable + strong duality
  • Hence all above inequalitjes are equalitjes

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

slide-116
SLIDE 116

Karush-Kuhn-Tucker conditjons

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

= =

  • Suppose f, gi, hj difgerentjable + strong duality
slide-117
SLIDE 117

Karush-Kuhn-Tucker conditjons

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

= =

  • Suppose f, gi, hj difgerentjable + strong duality
slide-118
SLIDE 118

Karush-Kuhn-Tucker conditjons

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

= =

  • Suppose f, gi, hj difgerentjable + strong duality
  • statjonarity
slide-119
SLIDE 119

Karush-Kuhn-Tucker conditjons

[strong duality] [defjnitjon of Q] [defjnitjon of inf] u* feasible h ⇒

j = 0

u* feasible g ⇒

i ≤ 0

α* feasible α ⇒

i ≥ 0

= =

  • Suppose f, gi, hj difgerentjable + strong duality
  • complementary slackness
slide-120
SLIDE 120

Karush-Kuhn-Tucker conditjons

  • Let’s sum up all of our conditjons:

– Primal feasibility: – Dual feasibility: – Complementary slackness: – Statjonarity:

slide-121
SLIDE 121

Karush-Kuhn-Tucker conditjons

  • Let’s sum up all of our conditjons:

– Primal feasibility: – Dual feasibility: – Complementary slackness: – Statjonarity:

Karush-Kuhn-Tucker (KKT) conditjons

slide-122
SLIDE 122

Karush-Kuhn-Tucker conditjons

  • Let’s sum up all of our conditjons:

– Primal feasibility: – Dual feasibility: – Complementary slackness: – Statjonarity:

  • For convex optjmizatjon problems, any (u, α, β)

that verify the KKT conditjons are optjmal. Karush-Kuhn-Tucker (KKT) conditjons

slide-123
SLIDE 123

Karush-Kuhn-Tucker conditjons

  • For convex optjmizatjon problems, any (u, α, β)

that verify the KKT conditjons are optjmal.

slide-124
SLIDE 124

Geometric interpretatjon

slide-125
SLIDE 125

Geometric interpretatjon

unconstrained minimum of f

slide-126
SLIDE 126

Geometric interpretatjon

  • Case 1: the unconstrained minimum lies in the

feasible region

feasible region unconstrained minimum of f

slide-127
SLIDE 127

Geometric interpretatjon

feasible region

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not.

unconstrained minimum of f

slide-128
SLIDE 128

Geometric interpretatjon

feasible region

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not.

unconstrained minimum of f

?

SOLUTION

slide-129
SLIDE 129

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not.

SOLUTION ?

slide-130
SLIDE 130

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not.

SOLUTION

slide-131
SLIDE 131

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not.

SOLUTION What can you say about the gradients of f and g? ?

slide-132
SLIDE 132

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not. The solutjon lies where the iso-

contours of f meet the feasible region.

slide-133
SLIDE 133

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not. The solutjon lies where the iso-

contours of f meet the feasible region.

slide-134
SLIDE 134

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not. The solutjon lies where the iso-

contours of f meet the feasible region.

  • The gradients of f and g are

parallel, of opposite directjons

slide-135
SLIDE 135

Geometric interpretatjon

feasible region iso-contours of f unconstrained minimum of f

  • Case 1: the unconstrained minimum lies in the

feasible region

  • Case 2: it does not. The solutjon lies where the iso-

contours of f meet the feasible region.

  • The gradients of f and g are

parallel, of opposite directjons

  • The solutjon lies at the border
  • f the feasible space
slide-136
SLIDE 136

Geometric interpretatjon

  • Case 1:
  • Case 2:

and

  • Can be summarized as:

and

– Either (case 1) – or (case 2).

slide-137
SLIDE 137

Geometric interpretatjon

  • Case 1:
  • Case 2:

and

  • Can be summarized as:

and

– Either (case 1) – or (case 2).

?

slide-138
SLIDE 138

Geometric interpretatjon

  • Case 1:
  • Case 2:

and

  • Can be summarized as:

and

– Either (case 1) – or (case 2).

statjonarity

slide-139
SLIDE 139

Geometric interpretatjon

  • Case 1:
  • Case 2:

and

  • Can be summarized as:

and

– Either (case 1) – or (case 2).

statjonarity

?

slide-140
SLIDE 140

Geometric interpretatjon

  • Case 1:
  • Case 2:

and

  • Can be summarized as:

and

– Either (case 1) – or (case 2).

statjonarity complementary slackness

slide-141
SLIDE 141

Quadratjc Programs

  • Special case of convex optjmizatjon problems where

– f is quadratjc – gi and hj are affjne

  • The feasible set is ?
slide-142
SLIDE 142

Quadratjc Programs

  • Special case of convex optjmizatjon problems where

– f is quadratjc – gi and hj are affjne

  • The feasible set is a polyhedron.

iso-contours of f unconstrained minimum of f feasible region

slide-143
SLIDE 143

Quadratjc Programs

  • Many methods can be used to solve QPs, for

example

– Interior point methods – Actjve set methods

  • Many solvers implement them

– CPLEX – CVXOPT – CGAL and more.

slide-144
SLIDE 144

Slack variables

  • Replace the inequality constraints
  • si = slack variable.
slide-145
SLIDE 145

Summary

  • We ofuen try to formulate machine learning

problems as convex optjmizatjon problems

  • If f difgerentjable:
  • Unconstrained convex optjmizatjon problems can

be solved by gradient descent.

Flavors: Backtracking line search, Newton’s methods, BFGS, stochastjc gradient descent.

  • Constrained convex optjmizatjon problems can be

solved in dual space via the Lagrangian.

slide-146
SLIDE 146

146

References

  • Convex optjmizatjon. S. Boyd and L. Vandenberghe.

https://web.stanford.edu/~boyd/cvxbook/

– Convex sets: Chapter 2.1 – Convex functjons: Chapter 3.1.1 – 3.1.5, 3.2 – Convex optjmizatjon problems: 4.1.1 – 4.1.2 + 4.2.2 – Unconstrained minimizatjon: 9.1.1 + 9.2 – 9.3 (gradient descent) +

9.5 (Newton)

– QP: 4.4.1 + 5.1 (Lagrange) + 5.2 (Duality) + 5.3.2 (Slater) + 5.5.3 (KKT) – Slack variables: 4.1.3 – Also see the Bibliography sectjon at the end of each chapter.

  • To go further

– Numerical Optjmizatjon. J. Bonnans, J. Gilbert, C. Lemaréchal, C.

Sagastjzábal. Quasi-Newton methods: 4.3 – 4.4.

– Stochastjc gradient descent tricks. L. Botuou (2012).

http://leon.bottou.org/publications/pdf/tricks-2012.pdf

– Coordinate Descent Algorithms. S. Wright (2015).

https://arxiv.org/abs/1502.04759

slide-147
SLIDE 147

147

Homework

  • By Monday (Oct 2nd)

Visit http://tinyurl.com/ma2823-2017 Download and read the complete syllabus. Set up your computer for the labs.

  • By Friday (Oct 6th)

Download, solve and turn in HW 1.

See you on Monday, 8:30am in Amphi sc.046!