Generalization Bounds of Stochastic Gradient Descent for Wide and - - PowerPoint PPT Presentation

generalization bounds of stochastic gradient descent for
SMART_READER_LITE
LIVE PREVIEW

Generalization Bounds of Stochastic Gradient Descent for Wide and - - PowerPoint PPT Presentation

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14 Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks


slide-1
SLIDE 1

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Yuan Cao and Quanquan Gu

Computer Science Department

1 / 14

slide-2
SLIDE 2

Learning Over-parameterized DNNs

Empirical observation on extremely wide deep neural networks (Zhang et

  • al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019)

2 / 14

slide-3
SLIDE 3

Learning Over-parameterized DNNs

Empirical observation on extremely wide deep neural networks (Zhang et

  • al. 2017; Bartlett et al. 2017; Neyshabur et al. 2018; Arora et al. 2019)

◮ Why can extremely wide neural networks generalize? ◮ What data can be learned by deep and wide neural networks?

3 / 14

slide-4
SLIDE 4

Learning Over-parameterized DNNs

◮ Fully connected neural network with width m: fW(x) = √m · WLσ(WL−1 · · · σ(W1x) · · · )). ◮ σ(·) is the ReLU activation function: σ(t) = max(0, t). ◮ L(xi,yi)(W) = ℓ[yi · fW(xi)], ℓ(z) = log(1 + exp(−z)).

4 / 14

slide-5
SLIDE 5

Learning Over-parameterized DNNs

◮ Fully connected neural network with width m: fW(x) = √m · WLσ(WL−1 · · · σ(W1x) · · · )). ◮ σ(·) is the ReLU activation function: σ(t) = max(0, t). ◮ L(xi,yi)(W) = ℓ[yi · fW(xi)], ℓ(z) = log(1 + exp(−z)).

Algorithm SGD for DNNs starting at Gaussian initialization W(0)

l

∼ N(0, 2/m), l ∈ [L − 1], W(0)

L

∼ N(0, 1/m) for i = 1, 2, . . . , n do Draw (xi, yi) from D. Update W(i) = W(i−1) − η · ∇WL(xi,yi)(W(i−1)). end for Output: Randomly choose W uniformly from {W(0), . . . , W(n−1)}.

5 / 14

slide-6
SLIDE 6

Generalization Bounds for DNNs

Theorem

For any R > 0, if m ≥ Ω

  • poly(R, L, n)
  • , then with high probability, SGD returns

W that satisfies E

  • L0−1

D

( W)

inf

f∈F(W(0),R)

  • 4

n

n

  • i=1

ℓ[yi · f(xi)]

  • + O
  • LR

√n +

  • log(1/δ)

n

  • ,

where F(W(0), R) =

  • fW(0)(·) + ∇WfW(0)(·), W : WlF ≤ R · m−1/2, l ∈ [L]
  • .

6 / 14

slide-7
SLIDE 7

Generalization Bounds for DNNs

Theorem

For any R > 0, if m ≥ Ω

  • poly(R, L, n)
  • , then with high probability, SGD returns

W that satisfies E

  • L0−1

D

( W)

inf

f∈F(W(0),R)

  • 4

n

n

  • i=1

ℓ[yi · f(xi)]

  • + O
  • LR

√n +

  • log(1/δ)

n

  • ,

where F(W(0), R) =

  • fW(0)(·) + ∇WfW(0)(·), W : WlF ≤ R · m−1/2, l ∈ [L]
  • .

Neural Tangent Random Feature (NTRF) model

7 / 14

slide-8
SLIDE 8

Generalization Bounds for DNNs

Corollary

Let y = (y1, . . . , yn)⊤ and λ0 = λmin(Θ(L)). If m ≥ Ω

  • poly(L, n, λ−1

0 )

  • , then with high probability,

SGD returns W that satisfies E

  • L0−1

D

( W)

O

  • L ·

inf

  • yiyi≥1
  • y⊤(Θ(L))−1

y n

  • + O
  • log(1/δ)

n

  • .

where Θ(L) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ(L)

i,j := limm→∞ m−1∇WfW(0)(xi), ∇WfW(0)(xj).

8 / 14

slide-9
SLIDE 9

Generalization Bounds for DNNs

Corollary

Let y = (y1, . . . , yn)⊤ and λ0 = λmin(Θ(L)). If m ≥ Ω

  • poly(L, n, λ−1

0 )

  • , then with high probability,

SGD returns W that satisfies E

  • L0−1

D

( W)

O

  • L ·

inf

  • yiyi≥1
  • y⊤(Θ(L))−1

y n

  • + O
  • log(1/δ)

n

  • .

where Θ(L) is the neural tangent kernel (Jacot et al. 2018) Gram matrix. Θ(L)

i,j := limm→∞ m−1∇WfW(0)(xi), ∇WfW(0)(xj).

The “classifiability” of the underlying data distribution D can also be measured by the quantity inf

yiyi≥1

  • y⊤(Θ(L))−1

y.

9 / 14

slide-10
SLIDE 10

Overview of the Proof

Key observations

◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization fW′(xi) ≈ fW(xi) + ∇fW(xi), W′ − W. ◮ L(xi,yi)(W) is Lipschitz continuous and almost convex ∇WlL(xi,yi)(W)F ≤ O(√m), l ∈ [L], L(xi,yi)(W′) L(xi,yi)(W) + ∇WL(xi,yi)(W), W′ − W.

10 / 14

slide-11
SLIDE 11

Overview of the Proof

Key observations

◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization fW′(xi) ≈ fW(xi) + ∇fW(xi), W′ − W. ◮ L(xi,yi)(W) is Lipschitz continuous and almost convex ∇WlL(xi,yi)(W)F ≤ O(√m), l ∈ [L], L(xi,yi)(W′) L(xi,yi)(W) + ∇WL(xi,yi)(W), W′ − W. Optimization for Lipschitz and (almost) convex functions + Online-to-batch conversion

11 / 14

slide-12
SLIDE 12

Overview of the Proof

Key observations

◮ Deep ReLU networks are almost linear in terms of their parameters in a small neighbour- hood around random initialization fW′(xi) ≈ fW(xi) + ∇fW(xi), W′ − W. ◮ L(xi,yi)(W) is Lipschitz continuous and almost convex ∇WlL(xi,yi)(W)F ≤ O(√m), l ∈ [L], L(xi,yi)(W′) L(xi,yi)(W) + ∇WL(xi,yi)(W), W′ − W. Applicable to general loss functions: ℓ(·) is convex/Lipschitz/smooth ⇒ L(xi,yi)(W) is (almost) convex/Lipschitz/smooth

12 / 14

slide-13
SLIDE 13

Summary

◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. ◮ A quantification of the “classifiability” of data: inf

yiyi≥1

  • y⊤(Θ(L))−1

y. ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings.

13 / 14

slide-14
SLIDE 14

Summary

◮ Generalization bounds for wide DNNs that do not increase in network width. ◮ A random feature model (NTRF model) that naturally connects over-parameterized DNNs with NTK. ◮ A quantification of the “classifiability” of data: inf

yiyi≥1

  • y⊤(Θ(L))−1

y. ◮ A clean and simple proof framework for neural networks in the “NTK regime” that is applicable to various problem settings.

Thank you!

Poster #141

14 / 14