Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS - PowerPoint PPT Presentation

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2020 Sp20 1 / 30

LECTURE 26 Implicit Deep Learning The Matrix is everywhere. It is all around us. Morpheus Sp20 2 / 30

Outline 1 Implicit Rules 2 Link with Neural Nets Well-Posedness 3 Robustness Analysis 4 Training Implicit Models 5 6 Take-Aways Sp20 3 / 30

Collaborators Joint work with: Armin Askari, Fangda Gu, Bert Travacca, Alicia Tsai (UC Berkeley); Mert Pilanci (Stanford); Emmanuel Vallod, Stefano Proto ( www.sumup.ai ). Sponsors: Sp20 4 / 30

Implicit prediction rule Equilibrium equation: x = φ ( Ax + Bu ) Prediction: y ( u ) = Cx + Du ˆ Input u ∈ R p , predicted output ˆ y ( u ) ∈ R q , hidden “state” vector x ∈ R n . Model parameter matrix: � A � B M = . C D Activation: vector map φ : R n → R n , e.g. the ReLU: φ ( · ) = max( · , 0) (acting componentwise on vectors). Sp20 5 / 30

Deep neural nets as implicit models Figure: A neural network. Figure: An implicit model. Implicit models are more general: they allow loops in the network graph. Sp20 6 / 30

Example Fully connected, feedforward neural network: ˆ y ( u ) = W L x L , x l +1 = φ l ( W l x l ) , l = 1 , . . . , L − 1 , x 0 = u . Implicit model:  0 0 0  W L − 1 . . . x = φ ( z ) = . . ... . . � A   0 . .       � φ L ( z L ) B x L   ... = , , .   . . C D W 1 0 . .       . .       0 W 0   x 1 φ 1 ( z 1 ) 0 0 0 W L . . . The equilibrium equation x = φ ( Ax + Bu ) is easily solved via backward substitution (forward pass). Sp20 7 / 30

Example: ResNet20 20-layer network, implicit model of order n ∼ 180000. Convolutional layers have blocks with Toeplitz structure. Residual connections appear as lines. Figure: The A matrix for ResNet20. Sp20 8 / 30

Neural networks as implicit models Framework covers most neural network architectures: Neural nets have strictly upper triangular matrix A . Equilibrium equation solved by substitution, i.e. “forward pass”. State vector x contains all the hidden features. Activation φ can be different for each component or blocks of x . Covers CNNs, RNNs, recurrent neural networks, (Bi-)LSTM, attention, transformers, etc. Sp20 9 / 30

Related concept: state-space models The so-called “state-space” models for dynamical systems use the same idea to represent high-order differential equations . . . Linear, time-invariant (LTI) dynamical system: x = Ax + Bu , y = Cx + Du ˙ Figure: LTI system Sp20 10 / 30

Well-posedness The matrix A ∈ R n × n is said to be well-posed for φ if, for every b ∈ R n , a solution x ∈ R n to the equation x = φ ( Ax + b ) , exists, and it is unique. Figure: Equation has two or no solutions, Figure: Solution is unique for every b . depending on sgn ( b ). Sp20 11 / 30

Perron-Frobenius theory [1] A square matrix P with non-negative entries admits a real eigenvalue λ with a non-negative eigenvector v � = 0: Pv = λ v . The value λ dominates all the other eigenvalues: for any other (complex) eigenvalue µ ∈ C , we have | µ | ≤ λ PF . Google’s Page rank search engine relies on computing the Perron-Frobenius eigenvector of the web link matrix. Figure: A web link matrix. Sp20 12 / 30

PF Sufficient condition for well-posedness Fact: Assume that φ is componentwise non-expansive ( e.g. , φ = ReLU): ∀ u , v ∈ R n : | φ ( u ) − φ ( v ) | ≤ | u − v | . Then the matrix A is well-posed for φ if the non-negative matrix | A | satisfies λ pf ( | A | ) < 1 , in which case the solution can be found via the fixed-point iterations: x ( t + 1) = φ ( Ax ( t ) + b ) , t = 0 , 1 , 2 , . . . Covers neural networks: since then | A | is strictly upper triangular, thus λ pf ( | A | ) = 0. Sp20 13 / 30

Proof: existence We have | x ( t + 1) − x ( t ) | = | φ ( Ax ( t ) + b ) − φ ( Ax ( t − 1) + b ) | ≤ | A || x ( t ) − x ( t − 1) | , which implies that for every t , h ≥ 0: t + τ τ � � | A | k | x (1) − x (0) | ≤ | A | t | A | k | x (1) − x (0) | ≤ | A | t w , | x ( t + τ ) − x ( t ) | ≤ k = t k =0 where + ∞ � | A | k | x (1) − x (0) | = ( I − | A | ) − 1 | x (1) − x (0) | , w := k =0 since, due to λ PF ( | A | ) < 1, I − | A | is invertible, and the series above converges. Since lim t → 0 | A | t = 0, we obtain that x ( t ) is a Cauchy sequence, hence it has a limit point, x ∞ . By continuity of φ we further obtain that x ∞ = φ ( Ax ∞ + b ), which establishes the existence of a solution. Sp20 14 / 30

Proof: unicity To prove unicity, consider x 1 , x 2 ∈ R n + two solutions to the equation. Using the hypotheses in the theorem, we have, for any k ≥ 1: | x 1 − x 2 | ≤ | A || x 1 − x 2 | ≤ | A | k | x 1 − x 2 | . The fact that | A | k → 0 as k → + ∞ then establishes unicity. Sp20 15 / 30

Norm condition More conservative condition: � A � ∞ < 1, where � λ PF ( | A | ) ≤ � A � ∞ := max | A ij | . i j Under previous PF conditions for well-posedness: we can always rescale the model so that � A � ∞ < 1, without altering the prediction rule; scaling related to PF eigenvector of | A | . Hence during training we may simply use norm condition. Sp20 16 / 30

Composing implicit models Cascade connection Figure: A cascade connection. Class of implicit models closed under the following connections: Cascade Parallel and sum Multiplicative Feedback Sp20 17 / 30

Robustness analysis Goal: analyze the impact of input perturbations on the state and outputs. Motivations: Diagnose a given (implicit) model. Generate adversarial attacks. Defense: modify the training problem so as to improve robustness properties. Sp20 18 / 30

Why does it matter? Changing a few carefully chosen pixels in a test image can cause a classifier to mis-categorize the image (Kwiatkowska et al., 2019). Sp20 19 / 30

Robustness analysis Input is unknown-but-bounded: u ∈ U , with u 0 + δ ∈ R p : | δ | ≤ σ u � � U := , u 0 ∈ R n is a “nominal” input; σ u ∈ R n + is a measure of componentwise uncertainty around it. Assume (sufficient condition for) well-posedness: φ componentwise non-expansive; λ PF ( | A | ) < 1. Nominal prediction: x 0 = φ ( Ax 0 + Bu 0 ) , ˆ y ( u 0 ) = Cx 0 + Du 0 . Sp20 20 / 30

Component-wise bounds on the state and output Fact: If λ PF ( | A | ) < 1, then I − | A | is invertible, and y ( u 0 ) | ≤ S | u − u 0 | , | ˆ y ( u ) − ˆ where S := | C | ( I − | A | ) − 1 | B | + | D | is a “sensitivity matrix” of the implicit model. Figure: Sensitivity matrix of a classification network with 10 outputs (each image is a row). Sp20 21 / 30

Generate a sparse attack on a targeted output Attack method: select the output to attack based on the rows (class) of sensitivity matrix; select top k entries in chosen row; randomly alter corresponding pixels. Changing k = 1 (top) k = 2 (mid, bot) pixels, images are wrongly classified, and accuracy decreases from 99% to 74%. Sp20 22 / 30

Generate a sparse bounded attack on a targeted output Target a specific output with sparse attacks: u 0 + δ ∈ R p : | δ | ≤ σ u , Card ( δ ) ≤ k � � U := , With k ≤ n . Solve a linear program, with c related to chosen target: x , u c ⊤ x : x ≥ Ax + Bu , x ≥ 0 , | x − x 0 | ≤ σ x , | u − u 0 | ≤ σ u max � diag (() σ u ) − 1 ( u − u 0 ) � 1 ≤ k . Changing k = 100 pixels by a tiny amount ( σ u = 0 . 1), targe images are wrongly classified b a network with 99% nominal accuracy. Sp20 23 / 30

Training problem Setup Inputs: U = [ u 1 , . . . , u m ], with m data points u i ∈ R p , i ∈ [ m ]. Outputs: Y = [ y 1 , . . . , y m ], with m responses y i ∈ R q , i ∈ [ m ]. Predictions: with X = [ x 1 , . . . , x m ] ∈ R n × m the matrix of hidden feature vectors, and φ acting columnwise, ˆ Y = CX + DU , X = φ ( AX + BU ) . Sp20 24 / 30

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS - PowerPoint PPT Presentation

Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley Spring 2020 Sp20 1 / 30 LECTURE 26 Implicit Deep Learning The Matrix is everywhere. It is all around us. Morpheus Sp20 2 / 30 Outline 1 Implicit

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization for Training Deep Models Xiaogang Wang xgwang@ee.cuhk.edu.hk February 12, 2019

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

MATHEMATICS 1 CONTENTS Unconstrained optimization Constrained optimization Lagrange method

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

1 Wednesday, October 6, 2010 Image credit: http://browsertoolkit.com/fault-tolerance.png 2

Non-Transitive Connectivity and DHTs Mike Freedman Karthik Lakshminarayanan Sean Rhea Ion

z Towards Plan-aware Resource Allocation in Serverless Query Processing Malay Bag Alekh Jindal

Applications of AI in music Smart music through machine learning Dorien Herremans ISTD,

Applied research group Systems+database people building prototypes, publishing papers Applied

programmers? Frances Buontempo @fbuontempo 1 Yes But @fbuontempo 2 What is AI?

Programming language shapes Programming thought programming languages are not merely

CE419 Session 1: Fundamentals Web Programming Let's begin with the premise that everything