optimization models
play

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS - PowerPoint PPT Presentation

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2020 Sp20 1 / 30 LECTURE 26 Implicit Deep Learning The Matrix is everywhere. It is all around us. Morpheus Sp20 2 / 30 Outline 1 Implicit


  1. Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2020 Sp20 1 / 30

  2. LECTURE 26 Implicit Deep Learning The Matrix is everywhere. It is all around us. Morpheus Sp20 2 / 30

  3. Outline 1 Implicit Rules 2 Link with Neural Nets Well-Posedness 3 Robustness Analysis 4 Training Implicit Models 5 6 Take-Aways Sp20 3 / 30

  4. Collaborators Joint work with: Armin Askari, Fangda Gu, Bert Travacca, Alicia Tsai (UC Berkeley); Mert Pilanci (Stanford); Emmanuel Vallod, Stefano Proto ( www.sumup.ai ). Sponsors: Sp20 4 / 30

  5. Implicit prediction rule Equilibrium equation: x = φ ( Ax + Bu ) Prediction: y ( u ) = Cx + Du ˆ Input u ∈ R p , predicted output ˆ y ( u ) ∈ R q , hidden “state” vector x ∈ R n . Model parameter matrix: � A � B M = . C D Activation: vector map φ : R n → R n , e.g. the ReLU: φ ( · ) = max( · , 0) (acting componentwise on vectors). Sp20 5 / 30

  6. Deep neural nets as implicit models Figure: A neural network. Figure: An implicit model. Implicit models are more general: they allow loops in the network graph. Sp20 6 / 30

  7. Example Fully connected, feedforward neural network: ˆ y ( u ) = W L x L , x l +1 = φ l ( W l x l ) , l = 1 , . . . , L − 1 , x 0 = u . Implicit model:  0 0 0  W L − 1 . . . x = φ ( z ) = . . ... . . � A   0 . .       � φ L ( z L ) B x L   ... = , , .   . . C D W 1 0 . .       . .       0 W 0   x 1 φ 1 ( z 1 ) 0 0 0 W L . . . The equilibrium equation x = φ ( Ax + Bu ) is easily solved via backward substitution (forward pass). Sp20 7 / 30

  8. Example: ResNet20 20-layer network, implicit model of order n ∼ 180000. Convolutional layers have blocks with Toeplitz structure. Residual connections appear as lines. Figure: The A matrix for ResNet20. Sp20 8 / 30

  9. Neural networks as implicit models Framework covers most neural network architectures: Neural nets have strictly upper triangular matrix A . Equilibrium equation solved by substitution, i.e. “forward pass”. State vector x contains all the hidden features. Activation φ can be different for each component or blocks of x . Covers CNNs, RNNs, recurrent neural networks, (Bi-)LSTM, attention, transformers, etc. Sp20 9 / 30

  10. Related concept: state-space models The so-called “state-space” models for dynamical systems use the same idea to represent high-order differential equations . . . Linear, time-invariant (LTI) dynamical system: x = Ax + Bu , y = Cx + Du ˙ Figure: LTI system Sp20 10 / 30

  11. Well-posedness The matrix A ∈ R n × n is said to be well-posed for φ if, for every b ∈ R n , a solution x ∈ R n to the equation x = φ ( Ax + b ) , exists, and it is unique. Figure: Equation has two or no solutions, Figure: Solution is unique for every b . depending on sgn ( b ). Sp20 11 / 30

  12. Perron-Frobenius theory [1] A square matrix P with non-negative entries admits a real eigenvalue λ with a non-negative eigenvector v � = 0: Pv = λ v . The value λ dominates all the other eigenvalues: for any other (complex) eigenvalue µ ∈ C , we have | µ | ≤ λ PF . Google’s Page rank search engine relies on computing the Perron-Frobenius eigenvector of the web link matrix. Figure: A web link matrix. Sp20 12 / 30

  13. PF Sufficient condition for well-posedness Fact: Assume that φ is componentwise non-expansive ( e.g. , φ = ReLU): ∀ u , v ∈ R n : | φ ( u ) − φ ( v ) | ≤ | u − v | . Then the matrix A is well-posed for φ if the non-negative matrix | A | satisfies λ pf ( | A | ) < 1 , in which case the solution can be found via the fixed-point iterations: x ( t + 1) = φ ( Ax ( t ) + b ) , t = 0 , 1 , 2 , . . . Covers neural networks: since then | A | is strictly upper triangular, thus λ pf ( | A | ) = 0. Sp20 13 / 30

  14. Proof: existence We have | x ( t + 1) − x ( t ) | = | φ ( Ax ( t ) + b ) − φ ( Ax ( t − 1) + b ) | ≤ | A || x ( t ) − x ( t − 1) | , which implies that for every t , h ≥ 0: t + τ τ � � | A | k | x (1) − x (0) | ≤ | A | t | A | k | x (1) − x (0) | ≤ | A | t w , | x ( t + τ ) − x ( t ) | ≤ k = t k =0 where + ∞ � | A | k | x (1) − x (0) | = ( I − | A | ) − 1 | x (1) − x (0) | , w := k =0 since, due to λ PF ( | A | ) < 1, I − | A | is invertible, and the series above converges. Since lim t → 0 | A | t = 0, we obtain that x ( t ) is a Cauchy sequence, hence it has a limit point, x ∞ . By continuity of φ we further obtain that x ∞ = φ ( Ax ∞ + b ), which establishes the existence of a solution. Sp20 14 / 30

  15. Proof: unicity To prove unicity, consider x 1 , x 2 ∈ R n + two solutions to the equation. Using the hypotheses in the theorem, we have, for any k ≥ 1: | x 1 − x 2 | ≤ | A || x 1 − x 2 | ≤ | A | k | x 1 − x 2 | . The fact that | A | k → 0 as k → + ∞ then establishes unicity. Sp20 15 / 30

  16. Norm condition More conservative condition: � A � ∞ < 1, where � λ PF ( | A | ) ≤ � A � ∞ := max | A ij | . i j Under previous PF conditions for well-posedness: we can always rescale the model so that � A � ∞ < 1, without altering the prediction rule; scaling related to PF eigenvector of | A | . Hence during training we may simply use norm condition. Sp20 16 / 30

  17. Composing implicit models Cascade connection Figure: A cascade connection. Class of implicit models closed under the following connections: Cascade Parallel and sum Multiplicative Feedback Sp20 17 / 30

  18. Robustness analysis Goal: analyze the impact of input perturbations on the state and outputs. Motivations: Diagnose a given (implicit) model. Generate adversarial attacks. Defense: modify the training problem so as to improve robustness properties. Sp20 18 / 30

  19. Why does it matter? Changing a few carefully chosen pixels in a test image can cause a classifier to mis-categorize the image (Kwiatkowska et al., 2019). Sp20 19 / 30

  20. Robustness analysis Input is unknown-but-bounded: u ∈ U , with u 0 + δ ∈ R p : | δ | ≤ σ u � � U := , u 0 ∈ R n is a “nominal” input; σ u ∈ R n + is a measure of componentwise uncertainty around it. Assume (sufficient condition for) well-posedness: φ componentwise non-expansive; λ PF ( | A | ) < 1. Nominal prediction: x 0 = φ ( Ax 0 + Bu 0 ) , ˆ y ( u 0 ) = Cx 0 + Du 0 . Sp20 20 / 30

  21. Component-wise bounds on the state and output Fact: If λ PF ( | A | ) < 1, then I − | A | is invertible, and y ( u 0 ) | ≤ S | u − u 0 | , | ˆ y ( u ) − ˆ where S := | C | ( I − | A | ) − 1 | B | + | D | is a “sensitivity matrix” of the implicit model. Figure: Sensitivity matrix of a classification network with 10 outputs (each image is a row). Sp20 21 / 30

  22. Generate a sparse attack on a targeted output Attack method: select the output to attack based on the rows (class) of sensitivity matrix; select top k entries in chosen row; randomly alter corresponding pixels. Changing k = 1 (top) k = 2 (mid, bot) pixels, images are wrongly classified, and accuracy decreases from 99% to 74%. Sp20 22 / 30

  23. Generate a sparse attack on a targeted output Attack method: select the output to attack based on the rows (class) of sensitivity matrix; select top k entries in chosen row; randomly alter corresponding pixels. Changing k = 1 (top) k = 2 (mid, bot) pixels, images are wrongly classified, and accuracy decreases from 99% to 74%. Sp20 22 / 30

  24. Generate a sparse bounded attack on a targeted output Target a specific output with sparse attacks: u 0 + δ ∈ R p : | δ | ≤ σ u , Card ( δ ) ≤ k � � U := , With k ≤ n . Solve a linear program, with c related to chosen target: x , u c ⊤ x : x ≥ Ax + Bu , x ≥ 0 , | x − x 0 | ≤ σ x , | u − u 0 | ≤ σ u max � diag (() σ u ) − 1 ( u − u 0 ) � 1 ≤ k . Changing k = 100 pixels by a tiny amount ( σ u = 0 . 1), targe images are wrongly classified b a network with 99% nominal accuracy. Sp20 23 / 30

  25. Generate a sparse bounded attack on a targeted output Target a specific output with sparse attacks: u 0 + δ ∈ R p : | δ | ≤ σ u , Card ( δ ) ≤ k � � U := , With k ≤ n . Solve a linear program, with c related to chosen target: x , u c ⊤ x : x ≥ Ax + Bu , x ≥ 0 , | x − x 0 | ≤ σ x , | u − u 0 | ≤ σ u max � diag (() σ u ) − 1 ( u − u 0 ) � 1 ≤ k . Changing k = 100 pixels by a tiny amount ( σ u = 0 . 1), targe images are wrongly classified b a network with 99% nominal accuracy. Sp20 23 / 30

  26. Training problem Setup Inputs: U = [ u 1 , . . . , u m ], with m data points u i ∈ R p , i ∈ [ m ]. Outputs: Y = [ y 1 , . . . , y m ], with m responses y i ∈ R q , i ∈ [ m ]. Predictions: with X = [ x 1 , . . . , x m ] ∈ R n × m the matrix of hidden feature vectors, and φ acting columnwise, ˆ Y = CX + DU , X = φ ( AX + BU ) . Sp20 24 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend