Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS - - PowerPoint PPT Presentation

optimization models
SMART_READER_LITE
LIVE PREVIEW

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS - - PowerPoint PPT Presentation

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2020 Sp20 1 / 30 LECTURE 26 Implicit Deep Learning The Matrix is everywhere. It is all around us. Morpheus Sp20 2 / 30 Outline 1 Implicit


slide-1
SLIDE 1

Optimization Models

EECS 127 / EECS 227AT Laurent El Ghaoui

EECS department UC Berkeley

Spring 2020

Sp20 1 / 30

slide-2
SLIDE 2

LECTURE 26 Implicit Deep Learning

The Matrix is everywhere. It is all around us. Morpheus

Sp20 2 / 30

slide-3
SLIDE 3

Outline

1

Implicit Rules

2

Link with Neural Nets

3

Well-Posedness

4

Robustness Analysis

5

Training Implicit Models

6

Take-Aways

Sp20 3 / 30

slide-4
SLIDE 4

Collaborators

Joint work with: Armin Askari, Fangda Gu, Bert Travacca, Alicia Tsai (UC Berkeley); Mert Pilanci (Stanford); Emmanuel Vallod, Stefano Proto (www.sumup.ai). Sponsors:

Sp20 4 / 30

slide-5
SLIDE 5

Implicit prediction rule

Equilibrium equation: x = φ(Ax + Bu) Prediction: ˆ y(u) = Cx + Du Input u ∈ Rp, predicted output ˆ y(u) ∈ Rq, hidden “state” vector x ∈ Rn. Model parameter matrix: M = A B C D

  • .

Activation: vector map φ : Rn → Rn, e.g. the ReLU: φ(·) = max(·, 0) (acting componentwise on vectors).

Sp20 5 / 30

slide-6
SLIDE 6

Deep neural nets as implicit models

Figure: A neural network. Figure: An implicit model.

Implicit models are more general: they allow loops in the network graph.

Sp20 6 / 30

slide-7
SLIDE 7

Example

Fully connected, feedforward neural network: ˆ y(u) = WLxL, xl+1 = φl(Wlxl), l = 1, . . . , L − 1, x0 = u. Implicit model: A B C D

  • =

        WL−1 . . . ... . . . . . . ... W1 W0 WL . . .         , x =    xL . . . x1    , φ(z) =    φL(zL) . . . φ1(z1)    . The equilibrium equation x = φ(Ax + Bu) is easily solved via backward substitution (forward pass).

Sp20 7 / 30

slide-8
SLIDE 8

Example: ResNet20

Figure: The A matrix for ResNet20.

20-layer network, implicit model of

  • rder n ∼ 180000.

Convolutional layers have blocks with Toeplitz structure. Residual connections appear as lines.

Sp20 8 / 30

slide-9
SLIDE 9

Neural networks as implicit models

Framework covers most neural network architectures: Neural nets have strictly upper triangular matrix A. Equilibrium equation solved by substitution, i.e. “forward pass”. State vector x contains all the hidden features. Activation φ can be different for each component or blocks of x. Covers CNNs, RNNs, recurrent neural networks, (Bi-)LSTM, attention, transformers, etc.

Sp20 9 / 30

slide-10
SLIDE 10

Related concept: state-space models

The so-called “state-space” models for dynamical systems use the same idea to represent high-order differential equations . . . Linear, time-invariant (LTI) dynamical system: ˙ x = Ax + Bu, y = Cx + Du

Figure: LTI system

Sp20 10 / 30

slide-11
SLIDE 11

Well-posedness

The matrix A ∈ Rn×n is said to be well-posed for φ if, for every b ∈ Rn, a solution x ∈ Rn to the equation x = φ(Ax + b), exists, and it is unique.

Figure: Equation has two or no solutions, depending on sgn(b). Figure: Solution is unique for every b.

Sp20 11 / 30

slide-12
SLIDE 12

Perron-Frobenius theory [1]

A square matrix P with non-negative entries admits a real eigenvalue λ with a non-negative eigenvector v = 0: Pv = λv. The value λ dominates all the other eigenvalues: for any other (complex) eigenvalue µ ∈ C, we have |µ| ≤ λPF.

Figure: A web link matrix.

Google’s Page rank search engine relies

  • n computing the Perron-Frobenius

eigenvector of the web link matrix.

Sp20 12 / 30

slide-13
SLIDE 13

PF Sufficient condition for well-posedness

Fact: Assume that φ is componentwise non-expansive (e.g., φ = ReLU): ∀ u, v ∈ Rn : |φ(u) − φ(v)| ≤ |u − v|. Then the matrix A is well-posed for φ if the non-negative matrix |A| satisfies λpf (|A|) < 1, in which case the solution can be found via the fixed-point iterations: x(t + 1) = φ(Ax(t) + b), t = 0, 1, 2, . . . Covers neural networks: since then |A| is strictly upper triangular, thus λpf (|A|) = 0.

Sp20 13 / 30

slide-14
SLIDE 14

Proof: existence

We have |x(t + 1) − x(t)| = |φ(Ax(t) + b) − φ(Ax(t − 1) + b)| ≤ |A||x(t) − x(t − 1)|, which implies that for every t, h ≥ 0: |x(t + τ) − x(t)| ≤

t+τ

  • k=t

|A|k|x(1) − x(0)| ≤ |A|t

τ

  • k=0

|A|k|x(1) − x(0)| ≤ |A|tw, where w :=

+∞

  • k=0

|A|k|x(1) − x(0)| = (I − |A|)−1|x(1) − x(0)|, since, due to λPF(|A|) < 1, I − |A| is invertible, and the series above converges. Since limt→0 |A|t = 0, we obtain that x(t) is a Cauchy sequence, hence it has a limit point, x∞. By continuity of φ we further obtain that x∞ = φ(Ax∞ + b), which establishes the existence of a solution.

Sp20 14 / 30

slide-15
SLIDE 15

Proof: unicity

To prove unicity, consider x1, x2 ∈ Rn

+ two solutions to the equation. Using the

hypotheses in the theorem, we have, for any k ≥ 1: |x1 − x2| ≤ |A||x1 − x2| ≤ |A|k|x1 − x2|. The fact that |A|k → 0 as k → +∞ then establishes unicity.

Sp20 15 / 30

slide-16
SLIDE 16

Norm condition

More conservative condition: A∞ < 1, where λPF(|A|) ≤ A∞ := max

i

  • j

|Aij|. Under previous PF conditions for well-posedness: we can always rescale the model so that A∞ < 1, without altering the prediction rule; scaling related to PF eigenvector of |A|. Hence during training we may simply use norm condition.

Sp20 16 / 30

slide-17
SLIDE 17

Composing implicit models

Cascade connection

Figure: A cascade connection.

Class of implicit models closed under the following connections: Cascade Parallel and sum Multiplicative Feedback

Sp20 17 / 30

slide-18
SLIDE 18

Robustness analysis

Goal: analyze the impact of input perturbations on the state and outputs. Motivations: Diagnose a given (implicit) model. Generate adversarial attacks. Defense: modify the training problem so as to improve robustness properties.

Sp20 18 / 30

slide-19
SLIDE 19

Why does it matter?

Changing a few carefully chosen pixels in a test image can cause a classifier to mis-categorize the image (Kwiatkowska et al., 2019).

Sp20 19 / 30

slide-20
SLIDE 20

Robustness analysis

Input is unknown-but-bounded: u ∈ U, with U :=

  • u0 + δ ∈ Rp : |δ| ≤ σu
  • ,

u0 ∈ Rn is a “nominal” input; σu ∈ Rn

+ is a measure of componentwise uncertainty around it.

Assume (sufficient condition for) well-posedness: φ componentwise non-expansive; λPF(|A|) < 1. Nominal prediction: x0 = φ(Ax0 + Bu0), ˆ y(u0) = Cx0 + Du0.

Sp20 20 / 30

slide-21
SLIDE 21

Component-wise bounds on the state and output

Fact: If λPF(|A|) < 1, then I − |A| is invertible, and |ˆ y(u) − ˆ y(u0)| ≤ S|u − u0|, where S := |C|(I − |A|)−1|B| + |D| is a “sensitivity matrix” of the implicit model.

Figure: Sensitivity matrix of a classification network with 10 outputs (each image is a row).

Sp20 21 / 30

slide-22
SLIDE 22

Generate a sparse attack on a targeted output

Attack method: select the output to attack based on the rows (class) of sensitivity matrix; select top k entries in chosen row; randomly alter corresponding pixels. Changing k = 1 (top) k = 2 (mid, bot) pixels, images are wrongly classified, and accuracy decreases from 99% to 74%.

Sp20 22 / 30

slide-23
SLIDE 23

Generate a sparse attack on a targeted output

Attack method: select the output to attack based on the rows (class) of sensitivity matrix; select top k entries in chosen row; randomly alter corresponding pixels. Changing k = 1 (top) k = 2 (mid, bot) pixels, images are wrongly classified, and accuracy decreases from 99% to 74%.

Sp20 22 / 30

slide-24
SLIDE 24

Generate a sparse bounded attack on a targeted output

Target a specific output with sparse attacks: U :=

  • u0 + δ ∈ Rp : |δ| ≤ σu, Card(δ) ≤ k
  • ,

With k ≤ n. Solve a linear program, with c related to chosen target: max

x, u c⊤x :

x ≥ Ax + Bu, x ≥ 0, |x − x0| ≤ σx, |u − u0| ≤ σu diag (() σu)−1(u − u0)1 ≤ k. Changing k = 100 pixels by a tiny amount (σu = 0.1), targe images are wrongly classified b a network with 99% nominal accuracy.

Sp20 23 / 30

slide-25
SLIDE 25

Generate a sparse bounded attack on a targeted output

Target a specific output with sparse attacks: U :=

  • u0 + δ ∈ Rp : |δ| ≤ σu, Card(δ) ≤ k
  • ,

With k ≤ n. Solve a linear program, with c related to chosen target: max

x, u c⊤x :

x ≥ Ax + Bu, x ≥ 0, |x − x0| ≤ σx, |u − u0| ≤ σu diag (() σu)−1(u − u0)1 ≤ k. Changing k = 100 pixels by a tiny amount (σu = 0.1), targe images are wrongly classified b a network with 99% nominal accuracy.

Sp20 23 / 30

slide-26
SLIDE 26

Training problem

Setup

Inputs: U = [u1, . . . , um], with m data points ui ∈ Rp, i ∈ [m]. Outputs: Y = [y1, . . . , ym], with m responses yi ∈ Rq, i ∈ [m]. Predictions: with X = [x1, . . . , xm] ∈ Rn×m the matrix of hidden feature vectors, and φ acting columnwise, ˆ Y = CX + DU, X = φ(AX + BU).

Sp20 24 / 30

slide-27
SLIDE 27

Training problem

Constrained problem

min

X,A,B,C,D

L(Y , ˆ Y ) + π(A, B, C, D) s.t. ˆ Y = CX + DU, X = φ(AX + BU), A∞ ≤ κ. Constraint on A with κ < 1 ensures well-posedness. π(·) is a (convex) penalty, e.g. one that encourages robustness: π(A, B, C, D) ∝ 1 2 B2

∞ + C2 ∞

1 − A∞ + D∞. May also incorporate penalties to encourage sparsity, low-rank, etc., e.g.:

  • i∈[p]

Bei∞ encourages entire columns of B to be zero, for feature selection.

Sp20 25 / 30

slide-28
SLIDE 28

Projected (sub) gradient

SGD can be adapted to the problem: Differentiating through the equilibrium equation is possible. Need to deal with the constraint of well-posedness via projection. Projection on constraint A∞ ≤ κ can be done extremely fast using (vectorized) bisection, solving for each row of A in parallel. Can extend to Frank-Wolfe methods, which are suited to seeking sparse models.

Sp20 26 / 30

slide-29
SLIDE 29

Example: traffic sign data set

2 4 6 8 10

epochs

0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

accuracy

implicit model test accuracy neural network test accuracy 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

loss

implicit model test loss neural network test loss

Sp20 27 / 30

slide-30
SLIDE 30

Take-aways

Implicit models are more general than standard neural networks. Well-posedness is a key property that can be enforced via norm or eigenvalue conditions. Models can be composed together in modular fashion. The notationally very simple framework allows for rigorous analyses for robustness, model compression, architecture optimization, etc. The corresponding training problem is amenable to SGD methods.

Sp20 28 / 30

slide-31
SLIDE 31

Towards a general theory?

Sp20 29 / 30

slide-32
SLIDE 32

References

Stephen Boyd. Perron-Frobenius theory, 2008. Lecture slides for EE 363, Stanford University. Geir E Dullerud and Fernando Paganini. A course in robust control theory: a convex approach, volume 36. Springer Science & Business Media, 2013.

  • L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai.

Implicit deep learning. Submitted to ICML, preliminary version at https://arxiv.org/abs/1908.06315, February 2020. Sp20 30 / 30