Random Matrix Theory Proves that Deep Learning Representations of - PowerPoint PPT Presentation

Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures ICML 2020 MEA. Seddik 12 ∗ , C.Louart 13 , M. Tamaazousti 1 , R. Couillet 23 1 CEA List, France 2 CentraleSupélec, L2S, France 3 GIPSA Lab Grenoble-Alpes University, France ∗ http://melaseddik.github.io/ June 8, 2020 1 / 17

/ 2/17 Abstract Context: ◮ Study of large Gram matrices of concentrated data. Motivation: ◮ Gram matrices are at the core of various ML algorithms. ◮ RMT predicts their performances under Gaussian assumptions on the data. ◮ BUT Real data are unlikely close to Gaussian vectors. Results: ◮ GAN data ( ≈ Real data) fall within the class of Concentrated vectors. ◮ Universality result: Only first and second order statistics of Concentrated data matter to describe the behavior of Gram matrices. 2 / 17

Concentrated Vectors/ 3/17 Notion of Concentrated Vectors Definition (Concentrated Vectors) Given a normed space ( E , � · � E ) and q ∈ R , a random vector Z ∈ E is q -exponentially concentrated if for any 1- Lipschitz 1 function F : E → R , there exists C , c > 0 such that ∀ t > 0 , P {|F ( Z ) − E F ( Z ) | ≥ t } ≤ Ce − ( t / c ) q denoted − − − − − → Z ∈ E q ( c ) If c independent of dim( E ), we denote Z ∈ E q (1) Concentrated vectors enjoy: (P1 ) If X ∼ N ( 0 , I p ) then X ∈ E 2 (1) “Gaussian vectors are concentrated vectors” (P2 ) If X ∈ E q (1) and G is a λ G - Lipschitz map, then G ( X ) ∈ E q ( λ G ) “Concentrated vectors are stable through Lipschitz maps” 1 Reminder: F : E → F is λ F -Lipschitz if ∀ ( x , y ) ∈ E 2 : �F ( x ) − F ( y ) � F ≤ λ F � x − y � E . 3 / 17

GAN Data: An Example of Concentrated Vectors/ 4/17 Why Concentrated Vectors? Figure: Images artificially generated using the BigGAN model [Brock et al , ICLR’19]. Real Data ≈ GAN Data = F L ◦ F L − 1 ◦ · · · ◦ F 1 ( Gaussian ) � �� G where the F i ’s correspond to Fully Connected layers, Convolutional layers, Sub-sampling, Pooling and activation functions, residual connections or Batch Normalisation. ⇒ The F i ’s are essentially Lipschitz operations. 4 / 17

GAN Data: An Example of Concentrated Vectors/ 5/17 Why Concentrated Vectors? ◮ Fully Connected Layers and Convolutional Layers are affine operations: F i ( x ) = W i x + b i , � W i u � p and �F i � lip = sup u � = 0 , for any p -norm. � u � p ◮ Pooling Layers and Activation Functions: Are 1-Lipschitz operations with respect to any p -norm (e.g., ReLU and Max-pooling). ◮ Residual Connections: F i ( x ) = x + F ( ℓ ) ◦ · · · ◦ F (1) ( x ) i i where the F ( j ) ’s are Lipschitz operations, thus F i is a Lipschitz operation with i Lipschitz constant bounded by 1 + � ℓ j =1 �F ( j ) � lip . i ◮ . . . By: (P1 ) If X ∼ N ( 0 , I p ) then X ∈ E 2 (1) (P2 ) If X ∈ E q (1) and G is a λ G - Lipschitz map, then G ( X ) ∈ E q ( λ G ) ⇒ GAN data are concentrated vectors by design. Remark: Still we need to control λ G . 5 / 17

GAN Data: An Example of Concentrated Vectors/ 6/17 Control of λ G with Spectral Normalization Let σ ∗ > 0 and G be a neural network composed of N affine layers, each one of input dimension d i − 1 and output dimension d i for i ∈ [ N ], with 1-Lipschitz activation functions. Consider the following dynamics with learning rate η : W ← W − η E , with E i , j ∼ N (0 , 1) W ← W − max(0 , σ 1 ( W ) − σ ∗ ) u 1 ( W ) v 1 ( W ) ⊺ . The Lipschitz constant of G is bounded at convergence with high probability as: N � � � � σ 2 ∗ + η 2 d i d i − 1 λ G ≤ ε + . i =1 6 Largest singular value σ 1 Without SN 5 With SN Theoretical bound σ ∗ = 4 4 σ ∗ = 3 3 σ ∗ = 2 2 1 0 200 400 600 800 1 , 000 Iterations Figure: Parameters N = 1, d 0 = d 1 = 100 and η = 1 / d 0 . 6 / 17

GAN Data: An Example of Concentrated Vectors/ 7/17 Model & Assumptions (A1) Data matrix (distributed in k classes C 1 , C 2 , . . . , C k ):      ∈ R p × n X =  x 1 , . . . , x n 1 , x n 1 +1 , . . . , x n 2 , . . . , x n − n k +1 , . . . , x n � �� ∈E q 1 (1) ∈E q 2 (1) ∈E qk (1) Model statistics: µ ℓ = E x i ∈C ℓ [ x i ] , C ℓ = E x i ∈C ℓ [ x i x ⊺ i ] (A2) Growth rate assumptions: As p → ∞ , 1. p / n → c ∈ (0 , ∞ ). 2. The number of classers k is bounded. 3. For any ℓ ∈ [ k ], � µ ℓ � = O ( √ p ). Gram matrix and its resolvent: G = 1 Q ( z ) = ( G + z I n ) − 1 p X ⊺ X , � m L ( z ) = 1 UU ⊺ = − 1 n tr ( Q ( − z )) , Q ( − z ) dz 2 π i γ 7 / 17

Behavior of the Gram Matrix for Concentrated Vectors/ 8/17 Main Result Theorem Under Assumptions (A1) and (A2) , we have Q ( z ) ∈ E q ( p − 1 2 ) . Furthermore, �� log p Q ( z ) = 1 z Λ ( z ) + 1 � Q ( z ) � � E [ Q ( z )] − ˜ � = O where ˜ p z J Ω( z ) J ⊺ p � � k 1 n ℓ ℓ ˜ R ( z ) µ ℓ } k and Ω( z ) = diag { µ ⊺ with Λ ( z ) = diag 1+ δ ℓ ( z ) ℓ =1 ℓ =1 � � − 1 k � 1 C ℓ ˜ R ( z ) = 1 + δ ℓ ( z ) + z I p k ℓ =1 with δ ( z ) = [ δ 1 ( z ) , . . . , δ k ( z )] is the unique fixed point of the system of equations � � � − 1 � k 1 � C j δ ℓ ( z ) = tr C ℓ 1 + δ j ( z ) + z I p for each ℓ ∈ [ k ] . k j =1 8 / 17

Behavior of the Gram Matrix for Concentrated Vectors/ 9/17 Main Result Theorem Under Assumptions (A1) and (A2) , we have Q ( z ) ∈ E q ( p − 1 2 ) . Furthermore, �� Q ( z ) � log p Q ( z ) = 1 z Λ ( z ) + 1 � E [ Q ( z )] − ˜ � = O where ˜ p z J Ω( z ) J ⊺ p � � k 1 n ℓ and Ω( z ) = diag { µ ℓ ⊺ ˜ R ( z ) µ ℓ } k with Λ ( z ) = diag 1+ δ ℓ ( z ) ℓ =1 ℓ =1 � � − 1 k � 1 C ℓ ˜ R ( z ) = 1 + δ ℓ ( z ) + z I p k ℓ =1 with δ ( z ) = [ δ 1 ( z ) , . . . , δ k ( z )] is the unique fixed point of the system of equations � � � − 1 � k � 1 C j for each ℓ ∈ [ k ] . δ ℓ ( z ) = tr C ℓ 1 + δ j ( z ) + z I p k j =1 Key Observation: Only first and second order statistics matter! 9 / 17

Application to CNN Representations of GAN Images/ 10/17 Application to CNN Representations of GAN Images Generator Discriminator Real / Fake Lipschitz operation Representation Network Concentrated Vectors Lipschitz operation ◮ CNN representations correspond to the penultimate layer. ◮ Popular architectures considered in practice are: Resnet, VGG, Densenet . 10 / 17

Application to CNN Representations of GAN Images/ 11/17 Application to CNN Representations of GAN Images GAN Images Real Images Figure: k = 3 classes, n = 3000 images. 11 / 17

Application to CNN Representations of GAN Images/ 12/17 Application to CNN Representations of GAN Images GAN Images Real Images 12 / 17

Application to CNN Representations of GAN Images/ 15/17 Performance of a linear SVM classifier GAN Images 15 / 17

Application to CNN Representations of GAN Images/ 16/17 Performance of a linear SVM classifier Real Images 16 / 17

Application to CNN Representations of GAN Images/ 17/17 Take away messages ◮ Concentrated Vectors seem appropriate for realistic data modelling. ◮ Universality of linear classifiers regardless of the data distribution. ◮ RMT can anticipate the performances of standard classifiers for DL representations of GAN images. ◮ Universality supports the Gaussianity assumption on the data representations as considered in the literature, e.g., the FID metric � � 1 d 2 (( µ , C ) , ( µ w , C w )) = � µ − µ w � 2 + tr C + C w − 2( CC w ) . 2 17 / 17

Random Matrix Theory Proves that Deep Learning Representations of - PowerPoint PPT Presentation

Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures ICML 2020 MEA. Seddik 12 , C.Louart 13 , M. Tamaazousti 1 , R. Couillet 23 1 CEA List, France 2 CentraleSuplec, L2S, France 3 GIPSA Lab

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Random Vectors, Random Matrices, and Matrix Expected Value James H. Steiger Department of

Random Matrix Theory in a nutshell and applications Manuela Girotti IFT 6085, February 27th,

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Seeking Gold in Sand Applying Random Matrix Theory to Separation of Signals from Noise in Stock

Lecture 2. Random Matrix Theory and Phase Transitions of PCA Yuan Yao Hong Kong University of

OBJECT ORIENTED PROGRAMMING Coin.java and CoinTester.java This excellent tutorial written by

Stat 5101 Lecture Slides: Deck 8 Dirichlet Distribution Charles J. Geyer School of Statistics

Why and how to use random forest Introduction Construction R functions variable importance

Lecture 15 : Pairs of Discrete Random Variables 0/ 21 Today we start Chapter 5. The transition we

Random Sampling of Bandlimited Signals on Graphs Pierre Vandergheynst cole Polytechnique

Bijective approach to percolation on triangulations and Liouville quantum gravity Olivier

Random Dieudonn e Modules and the Cohen-Lenstra Heuristics David Zureick-Brown Bryden Cais

Approximate Matchings in Dynamic Graph Streams Sanjeev Khanna University of Pennsylvania Joint