[PPT] - Nearly-tight VC-dimension bounds for piecewise linear neural PowerPoint Presentation

SLIDE 1

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian University of British Columbia COLT ’17 July 10, 2017

SLIDE 2

Neural networks

𝑦" 𝑦# 𝑦$ 𝑦% 𝑦&

𝜏(𝑥*𝑦 + 𝑐)

𝜏 𝜏 𝜏 𝜏 𝜏 𝜏 𝜏

Identity

hidden layer 1 hidden layer 2

utput

layer input layer

𝜏(𝑦) = max ¡ {𝑦, 0} (ReLU)

SLIDE 3

VC-dimension

Defn: If ¡𝐺 is a family of functions then 𝑊𝐷𝑒𝑗𝑛 𝐺 ≥ 𝑙 iff ∃ 𝑌 = {𝑦", … , 𝑦A} s.t. 𝐺 achieves all 2A signings, i.e. {(𝑡𝑗𝑕𝑜(𝑔(𝑦")), … , 𝑡𝑗𝑕𝑜(𝑔(𝑦A))): 𝑔 ∈ 𝐺} = {0,1}A e.g. Hyperplanes in 𝑆K have VC-dimension 𝑒 + 1. Can shatter Impossible to shatter any 4 points

SLIDE 4

VC-dimension

Defn: If ¡𝐺 is a family of functions then 𝑊𝐷𝑒𝑗𝑛 𝐺 ≥ 𝑙 iff ∃ 𝑌 = {𝑦", … , 𝑦A} s.t. 𝐺 achieves all 2A signings, i.e. {(𝑡𝑗𝑕𝑜(𝑔(𝑦")), … , 𝑡𝑗𝑕𝑜(𝑔(𝑦A))): 𝑔 ∈ 𝐺} = {0,1}A Thm [Fund. thm. of learning]: 𝐺 is learnable iff 𝑊𝐷𝑒𝑗𝑛 𝐺 < ∞. Moreover, sample complexity is Θ(𝑊𝐷𝑒𝑗𝑛 𝐺 ).

SLIDE 5

VC-dimension of NNs

Known lower bounds:

Ω 𝑋𝑀

[BMM ‘98]

Ω 𝑋 log 𝑋

[M ‘94]

Known upper bounds:

𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM ‘98]

𝑃 𝑋#

[GJ ‘95]

W ¡-‑ # ¡parameters/edges L ¡-‑ # ¡layers

SLIDE 6

VC-dimension of NNs

Known lower bounds:

Ω 𝑋𝑀

[BMM ‘98]

Ω 𝑋 log 𝑋

[M ‘94]

Known upper bounds:

𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM ‘98]

𝑃 𝑋#

[GJ ‘95]

Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋)

Means there exists NN with this VCdim W ¡-‑ # ¡parameters/edges L ¡-‑ # ¡layers

SLIDE 7

VC-dimension of NNs

Known lower bounds:

Ω 𝑋𝑀

[BMM ‘98]

Ω 𝑋 log 𝑋

[M ‘94]

Known upper bounds:

𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM ‘98]

𝑃 𝑋#

[GJ ‘95]

Independently proved by Bartlett ‘17

Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋) Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋)

Means there exists NN with this VCdim W ¡-‑ # ¡parameters/edges L ¡-‑ # ¡layers

SLIDE 8

VC-dimension of NNs

Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋)

Known lower bounds:

Ω 𝑋𝑀

[BMM ‘98]

Ω 𝑋 log 𝑋

[M ‘94]

Known upper bounds:

𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM ‘98]

𝑃 𝑋#

[GJ ‘95]

Independently proved by Bartlett ‘17 Recently, lots of work on “power of depth” for expressiveness of NNs

[T ‘16, ES ‘16, Y ’16, LS ‘16, SS’16, CSS’ 16, LGMRA ‘17, D ‘17]

Means there exists NN with this VCdim W ¡-‑ # ¡parameters/edges L ¡-‑ # ¡layers

SLIDE 9

Lower bound

(refinement of [BMM ’98])

Shattered set: 𝑇 = 𝑓^ ^∈[`]×{𝑓

c}c∈[d]

Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

SLIDE 10

Lower bound

(refinement of [BMM ’98])

Shattered set: 𝑇 = 𝑓^ ^∈[`]×{𝑓

c}c∈[d]

Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

Given 𝑓^, easy to extract 𝑏^

𝑏^ 𝑏^ 𝑏^," 𝑏^,# 𝑏^,c 𝑏^,d 1

NN block extracts bits from 𝑏^

1

Rest of NN

𝑓^ 𝑓

c

Select bit j from 𝑏^

SLIDE 11

Lower bound

(refinement of [BMM ’98])

Shattered set: 𝑇 = 𝑓^ ^∈[`]×{𝑓

c}c∈[d]

Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

Given 𝑓^, easy to extract 𝑏^
Design bit extractor to extract 𝑏^,c
[BMM ’98] do this 1 bit per layer ⇒ Ω(𝑋𝑀)
More efficient: log(𝑋/𝑀) bits per layer

⇒ Ω(𝑋𝑀 ¡log ¡ (𝑋/𝑀))

𝑏^ 𝑏^ 𝑏^," 𝑏^,# 𝑏^,c 𝑏^,d 1

NN block extracts bits from 𝑏^

1

Rest of NN

𝑓^ 𝑓

c

Select bit j from 𝑏^

SLIDE 12

Lower bound

(refinement of [BMM ’98])

Shattered set: 𝑇 = 𝑓^ ^∈[`]×{𝑓

c}c∈[d]

Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

Given 𝑓^, easy to extract 𝑏^
Design bit extractor to extract 𝑏^,c
[BMM ’98] do this 1 bit per layer ⇒ Ω(𝑋𝑀)
More efficient: log(𝑋/𝑀) bits per layer

⇒ Ω(𝑋𝑀 ¡log ¡ (𝑋/𝑀))

𝑏^ 𝑏^ 𝑏^," 𝑏^,# 𝑏^,c 𝑏^,d 1

NN block extracts bits from 𝑏^

1

Rest of NN

Thm [HLM ‘17]: Suppose a ReLU NN w/ 𝑋 ¡params, 𝑀 ¡layers extracts 𝑛th bit of input. Then 𝑛 ≤ 𝑃(𝑀 ¡log ¡ (𝑋/𝑀)). 𝑓^ 𝑓

c

Select bit j from 𝑏^

SLIDE 13

𝑦$

^

𝜏 𝜏 𝜏 𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

Fix a shattered set 𝑌 = {𝑦", … , 𝑦d}
Partition parameter space s.t. input to 1st

hidden layer has constant sign

can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
Number of partition is small, i.e. ≤ (𝐷𝑛)g
Repeat procedure for each layer to get

partition of size ≤ (𝐷𝑀𝑛)h(gi)

In each piece, output is polynomial of deg. 𝑀

so total # of signings ≤ 𝐷𝑀𝑛 h gi

Since 𝑌 is shattered, need 2d ≤ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

SLIDE 14

𝑦$

^

𝜏 𝜏 𝜏 𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

Fix a shattered set 𝑌 = {𝑦", … , 𝑦d}
Partition parameter space s.t. input to 1st

hidden layer has constant sign

can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
Number of partition is small, i.e. ≤ (𝐷𝑛)g
Repeat procedure for each layer to get

partition of size ≤ (𝐷𝑀𝑛)h(gi)

In each piece, output is polynomial of deg. 𝑀

so total # of signings ≤ 𝐷𝑀𝑛 h gi

Since 𝑌 is shattered, need 2d ≤ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

SLIDE 15

𝑦$

^

𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

Fix a shattered set 𝑌 = {𝑦", … , 𝑦d}
Partition parameter space s.t. input to 1st

hidden layer has constant sign

can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
Number of partition is small, i.e. ≤ (𝐷𝑛)g
Repeat procedure for each layer to get

partition of size ≤ (𝐷𝑀𝑛)h(gi)

In each piece, output is polynomial of deg. 𝑀

so total # of signings ≤ 𝐷𝑀𝑛 h gi

Since 𝑌 is shattered, need 2d ≤ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

SLIDE 16

𝑦$

^

𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

Fix a shattered set 𝑌 = {𝑦", … , 𝑦d}
Partition parameter space s.t. input to 1st

hidden layer has constant sign

can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
Size of partition is small, i.e. ≤ (𝐷𝑛)g[Warren ‘68]
Repeat procedure for each layer to get

partition of size ≤ (𝐷𝑀𝑛)h(gi)

In each piece, output is polynomial of deg. 𝑀

so total # of signings ≤ 𝐷𝑀𝑛 h gi

Since 𝑌 is shattered, need 2d ≤ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

* ¡𝐷 ¡ > ¡1 is some constant

SLIDE 17

𝑦$

^

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

Fix a shattered set 𝑌 = {𝑦", … , 𝑦d}
Partition parameter space s.t. input to 1st

hidden layer has constant sign

can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
Size of partition is small, i.e. ≤ (𝐷𝑛)g[Warren ‘68]
Repeat procedure for each layer to get

partition of size ≤ (𝐷𝑀𝑛)h(gi)

In each piece, output is polynomial of deg. 𝑀

so total # of signings ≤ 𝐷𝑀𝑛 h gi

Since 𝑌 is shattered, need 2d ≤ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

* ¡𝐷 ¡ > ¡1 is some constant

SLIDE 18

𝑦$

^

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

Fix a shattered set 𝑌 = {𝑦", … , 𝑦d}
Partition parameter space s.t. input to 1st

hidden layer has constant sign

can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
Size of partition is small, i.e. ≤ (𝐷𝑛)g[Warren ‘68]
Repeat procedure for each layer to get

partition of size ≤ (𝐷𝑀𝑛)h(gi)

In each piece, output is polynomial of deg. 𝑀

so total # of signings ≤ 𝐷𝑀𝑛 h gi

Since 𝑌 is shattered, need 2d ≤ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

* ¡𝐷 ¡ > ¡1 is some constant

SLIDE 19

Open questions

Can we close the gap for ReLU NNs:

Ω(𝑋𝑀 log 𝑋/𝑀 ) vs 𝑃(𝑋𝑀 log 𝑋)?

For polynomial NNs, we have

Ω k 𝑋𝑀 ≤ 𝑊𝐷𝑒𝑗𝑛 ≤ 𝑃 l 𝑋𝑀# . Can we close this gap? Do poly NNs have higher VCdim than ReLU NNs?

What about VC dimensions of CNNs, RNNs, ResNets, etc.?

SLIDE 20

Open questions

Can we close the gap for ReLU NNs:

Ω(𝑋𝑀 log 𝑋/𝑀 ) vs 𝑃(𝑋𝑀 log 𝑋)?

For polynomial NNs, we have

Ω k 𝑋𝑀 ≤ 𝑊𝐷𝑒𝑗𝑛 ≤ 𝑃 l 𝑋𝑀# . Can we close this gap? Do poly NNs have higher VCdim than ReLU NNs?

What about VC dimensions of CNNs, RNNs, ResNets, etc.?

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Neural networks

𝑦" 𝑦# 𝑦$ 𝑦% 𝑦&

𝜏(𝑥*𝑦 + 𝑐)

Identity

𝜏(𝑦) = max ¡ {𝑦, 0} (ReLU)

VC-dimension

VC-dimension

VC-dimension of NNs

Known lower bounds:

Known upper bounds:

VC-dimension of NNs

Known lower bounds:

Known upper bounds:

Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋)

VC-dimension of NNs

Known lower bounds:

Known upper bounds:

Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋) Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋)

VC-dimension of NNs

Main Thm [HLM ‘17]: For a ReLU NN w/ W params, L layers Ω 𝑋𝑀 log(𝑋/𝑀) ≤ VCdim ≤ 𝑃(𝑋𝑀 log 𝑋)

Known lower bounds:

Known upper bounds:

Lower bound

Lower bound

𝑓^ 𝑓

Lower bound

𝑓^ 𝑓

Lower bound

Thm [HLM ‘17]: Suppose a ReLU NN w/ 𝑋 ¡params, 𝑀 ¡layers extracts 𝑛th bit of input. Then 𝑛 ≤ 𝑃(𝑀 ¡log ¡ (𝑋/𝑀)). 𝑓^ 𝑓

Upper bound

Upper bound

Upper bound

Upper bound

Upper bound

Upper bound

Open questions

Ω(𝑋𝑀 log 𝑋/𝑀 ) vs 𝑃(𝑋𝑀 log 𝑋)?

Ω k 𝑋𝑀 ≤ 𝑊𝐷𝑒𝑗𝑛 ≤ 𝑃 l 𝑋𝑀# . Can we close this gap? Do poly NNs have higher VCdim than ReLU NNs?

Open questions

Ω(𝑋𝑀 log 𝑋/𝑀 ) vs 𝑃(𝑋𝑀 log 𝑋)?

Ω k 𝑋𝑀 ≤ 𝑊𝐷𝑒𝑗𝑛 ≤ 𝑃 l 𝑋𝑀# . Can we close this gap? Do poly NNs have higher VCdim than ReLU NNs?

Thank you!