Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks
Yuan Cao and Quanquan Gu
Computer Science Department
1 / 14
Generalization Bounds of Stochastic Gradient Descent for Wide and - - PowerPoint PPT Presentation
Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao and Quanquan Gu Computer Science Department 1 / 14 Learning Over-parameterized DNNs Empirical observation on extremely wide deep neural networks
1 / 14
2 / 14
3 / 14
◮ Fully connected neural network with width m: fW(x) = √m · WLσ(WL−1 · · · σ(W1x) · · · )). ◮ σ(·) is the ReLU activation function: σ(t) = max(0, t). ◮ L(xi,yi)(W) = ℓ[yi · fW(xi)], ℓ(z) = log(1 + exp(−z)).
4 / 14
◮ Fully connected neural network with width m: fW(x) = √m · WLσ(WL−1 · · · σ(W1x) · · · )). ◮ σ(·) is the ReLU activation function: σ(t) = max(0, t). ◮ L(xi,yi)(W) = ℓ[yi · fW(xi)], ℓ(z) = log(1 + exp(−z)).
l
L
5 / 14
For any R > 0, if m ≥ Ω
W that satisfies E
D
( W)
inf
f∈F(W(0),R)
n
n
ℓ[yi · f(xi)]
√n +
n
6 / 14
For any R > 0, if m ≥ Ω
W that satisfies E
D
( W)
inf
f∈F(W(0),R)
n
n
ℓ[yi · f(xi)]
√n +
n
7 / 14
Let y = (y1, . . . , yn)⊤ and λ0 = λmin(Θ(L)). If m ≥ Ω
0 )
SGD returns W that satisfies E
D
( W)
O
inf
y n
n
i,j := limm→∞ m−1∇WfW(0)(xi), ∇WfW(0)(xj).
8 / 14
Let y = (y1, . . . , yn)⊤ and λ0 = λmin(Θ(L)). If m ≥ Ω
0 )
SGD returns W that satisfies E
D
( W)
O
inf
y n
n
i,j := limm→∞ m−1∇WfW(0)(xi), ∇WfW(0)(xj).
yiyi≥1
9 / 14
10 / 14
11 / 14
12 / 14
yiyi≥1
13 / 14
yiyi≥1
14 / 14