Random Matrix Theory Proves that Deep Learning Representations of - - PowerPoint PPT Presentation

random matrix theory proves that deep learning
SMART_READER_LITE
LIVE PREVIEW

Random Matrix Theory Proves that Deep Learning Representations of - - PowerPoint PPT Presentation

Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures ICML 2020 MEA. Seddik 12 , C.Louart 13 , M. Tamaazousti 1 , R. Couillet 23 1 CEA List, France 2 CentraleSuplec, L2S, France 3 GIPSA Lab


slide-1
SLIDE 1

Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures

ICML 2020

  • MEA. Seddik12∗, C.Louart13, M. Tamaazousti1, R. Couillet23

1CEA List, France 2CentraleSupélec, L2S, France 3GIPSA Lab Grenoble-Alpes University, France ∗http://melaseddik.github.io/

June 8, 2020

1 / 17

slide-2
SLIDE 2

/ 2/17

Abstract

Context: ◮ Study of large Gram matrices of concentrated data. Motivation: ◮ Gram matrices are at the core of various ML algorithms. ◮ RMT predicts their performances under Gaussian assumptions on the data. ◮ BUT Real data are unlikely close to Gaussian vectors. Results: ◮ GAN data (≈ Real data) fall within the class of Concentrated vectors. ◮ Universality result: Only first and second order statistics of Concentrated data matter to describe the behavior of Gram matrices.

2 / 17

slide-3
SLIDE 3

Concentrated Vectors/ 3/17

Notion of Concentrated Vectors

Definition (Concentrated Vectors)

Given a normed space (E, · E) and q ∈ R, a random vector Z ∈ E is q-exponentially concentrated if for any 1-Lipschitz1 function F : E → R, there exists C, c > 0 such that ∀t > 0, P {|F(Z) − EF(Z)| ≥ t} ≤ Ce−(t/c)q

denoted

− − − − − → Z ∈ Eq(c) If c independent of dim(E), we denote Z ∈ Eq(1) Concentrated vectors enjoy: (P1) If X ∼ N(0, Ip) then X ∈ E2(1) “Gaussian vectors are concentrated vectors” (P2) If X ∈ Eq(1) and G is a λG-Lipschitz map, then G(X) ∈ Eq(λG) “Concentrated vectors are stable through Lipschitz maps”

1Reminder: F : E → F is λF -Lipschitz if ∀(x, y) ∈ E2 : F(x) − F(y)F ≤ λF x − yE . 3 / 17

slide-4
SLIDE 4

GAN Data: An Example of Concentrated Vectors/ 4/17

Why Concentrated Vectors?

Figure: Images artificially generated using the BigGAN model [Brock et al, ICLR’19].

Real Data ≈ GAN Data = FL ◦ FL−1 ◦ · · · ◦ F1

  • G

(Gaussian) where the Fi’s correspond to Fully Connected layers, Convolutional layers, Sub-sampling, Pooling and activation functions, residual connections or Batch Normalisation. ⇒ The Fi’s are essentially Lipschitz operations.

4 / 17

slide-5
SLIDE 5

GAN Data: An Example of Concentrated Vectors/ 5/17

Why Concentrated Vectors?

◮ Fully Connected Layers and Convolutional Layers are affine operations: Fi(x) = W ix + bi, and Filip = supu=0

W i up up

, for any p-norm. ◮ Pooling Layers and Activation Functions: Are 1-Lipschitz operations with respect to any p-norm (e.g., ReLU and Max-pooling). ◮ Residual Connections: Fi(x) = x + F(ℓ)

i

  • · · · ◦ F(1)

i

(x) where the F(j)

i

’s are Lipschitz operations, thus Fi is a Lipschitz operation with Lipschitz constant bounded by 1 + ℓ

j=1 F(j) i

lip. ◮ . . . By: (P1) If X ∼ N(0, Ip) then X ∈ E2(1) (P2) If X ∈ Eq(1) and G is a λG-Lipschitz map, then G(X) ∈ Eq(λG) ⇒ GAN data are concentrated vectors by design. Remark: Still we need to control λG.

5 / 17

slide-6
SLIDE 6

GAN Data: An Example of Concentrated Vectors/ 6/17

Control of λG with Spectral Normalization

Let σ∗ > 0 and G be a neural network composed of N affine layers, each one of input dimension di−1 and output dimension di for i ∈ [N], with 1-Lipschitz activation

  • functions. Consider the following dynamics with learning rate η:

W ← W − ηE, with Ei,j ∼ N(0, 1) W ← W − max(0, σ1(W) − σ∗) u1(W)v1(W)⊺. The Lipschitz constant of G is bounded at convergence with high probability as: λG ≤

N

  • i=1
  • ε +
  • σ2

∗ + η2didi−1

  • .

200 400 600 800 1,000 1 2 3 4 5 6 σ∗ = 2 σ∗ = 3 σ∗ = 4 Iterations Largest singular value σ1

Without SN With SN Theoretical bound

Figure: Parameters N = 1, d0 = d1 = 100 and η = 1/d0.

6 / 17

slide-7
SLIDE 7

GAN Data: An Example of Concentrated Vectors/ 7/17

Model & Assumptions

(A1) Data matrix (distributed in k classes C1, C2, . . . , Ck): X =

  x1, . . . , xn1

  • ∈Eq1 (1)

, xn1+1, . . . , xn2

  • ∈Eq2 (1)

, . . . , xn−nk+1, . . . , xn

  • ∈Eqk (1)

   ∈ Rp×n

Model statistics: µℓ = Exi ∈Cℓ[xi], Cℓ = Exi ∈Cℓ[xix⊺

i ]

(A2) Growth rate assumptions: As p → ∞,

  • 1. p/n → c ∈ (0, ∞).
  • 2. The number of classers k is bounded.
  • 3. For any ℓ ∈ [k], µℓ = O(√p).

Gram matrix and its resolvent: G = 1 p X⊺X, Q(z) = (G + zIn)−1 mL(z) = 1 n tr (Q(−z)), UU⊺ = −1 2πi

  • γ

Q(−z)dz

7 / 17

slide-8
SLIDE 8

Behavior of the Gram Matrix for Concentrated Vectors/ 8/17

Main Result

Theorem

Under Assumptions (A1) and (A2), we have Q(z) ∈ Eq(p− 1

2 ). Furthermore,

  • E[Q(z)] − ˜

Q(z)

= O

  • log p

p

  • where ˜

Q(z) = 1 z Λ(z) + 1 p z JΩ(z)J⊺ with Λ(z) = diag

  • 1nℓ

1+δℓ(z)

k

ℓ=1

and Ω(z) = diag{µ⊺

ℓ ˜

R(z)µℓ}k

ℓ=1

˜ R(z) =

  • 1

k

k

  • ℓ=1

Cℓ 1 + δℓ(z) + zIp

−1

with δ(z) = [δ1(z), . . . , δk(z)] is the unique fixed point of the system of equations δℓ(z) = tr

  • Cℓ
  • 1

k

k

  • j=1

Cj 1 + δj(z) + zIp

−1

for each ℓ ∈ [k].

8 / 17

slide-9
SLIDE 9

Behavior of the Gram Matrix for Concentrated Vectors/ 9/17

Main Result

Theorem

Under Assumptions (A1) and (A2), we have Q(z) ∈ Eq(p− 1

2 ). Furthermore,

  • E[Q(z)] − ˜

Q(z)

= O

  • log p

p

  • where ˜

Q(z) = 1 z Λ(z) + 1 p z JΩ(z)J⊺ with Λ(z) = diag

  • 1nℓ

1+δℓ(z)

k

ℓ=1

and Ω(z) = diag{µℓ⊺ ˜ R(z)µℓ}k

ℓ=1

˜ R(z) =

  • 1

k

k

  • ℓ=1

Cℓ 1 + δℓ(z) + zIp

−1

with δ(z) = [δ1(z), . . . , δk(z)] is the unique fixed point of the system of equations δℓ(z) = tr

  • Cℓ
  • 1

k

k

  • j=1

Cj 1 + δj(z) + zIp

−1

for each ℓ ∈ [k]. Key Observation: Only first and second order statistics matter!

9 / 17

slide-10
SLIDE 10

Application to CNN Representations of GAN Images/ 10/17

Application to CNN Representations of GAN Images

Generator Discriminator Lipschitz operation Real / Fake Representation Network Lipschitz operation Concentrated Vectors

◮ CNN representations correspond to the penultimate layer. ◮ Popular architectures considered in practice are: Resnet, VGG, Densenet.

10 / 17

slide-11
SLIDE 11

Application to CNN Representations of GAN Images/ 11/17

Application to CNN Representations of GAN Images

GAN Images Real Images

Figure: k = 3 classes, n = 3000 images.

11 / 17

slide-12
SLIDE 12

Application to CNN Representations of GAN Images/ 12/17

Application to CNN Representations of GAN Images

GAN Images Real Images

12 / 17

slide-13
SLIDE 13

Application to CNN Representations of GAN Images/ 13/17

Application to CNN Representations of GAN Images

GAN Images Real Images

13 / 17

slide-14
SLIDE 14

Application to CNN Representations of GAN Images/ 14/17

Application to CNN Representations of GAN Images

GAN Images Real Images

14 / 17

slide-15
SLIDE 15

Application to CNN Representations of GAN Images/ 15/17

Performance of a linear SVM classifier

GAN Images

15 / 17

slide-16
SLIDE 16

Application to CNN Representations of GAN Images/ 16/17

Performance of a linear SVM classifier

Real Images

16 / 17

slide-17
SLIDE 17

Application to CNN Representations of GAN Images/ 17/17

Take away messages

◮ Concentrated Vectors seem appropriate for realistic data modelling. ◮ Universality of linear classifiers regardless of the data distribution. ◮ RMT can anticipate the performances of standard classifiers for DL representations of GAN images. ◮ Universality supports the Gaussianity assumption on the data representations as considered in the literature, e.g., the FID metric d2((µ, C), (µw, Cw)) = µ − µw2 + tr

  • C + Cw − 2(CCw)

1 2

  • .

17 / 17