Nearly-tight VC-dimension bounds for piecewise linear neural - - PowerPoint PPT Presentation

β–Ά
nearly tight vc dimension bounds for piecewise linear
SMART_READER_LITE
LIVE PREVIEW

Nearly-tight VC-dimension bounds for piecewise linear neural - - PowerPoint PPT Presentation

Nearly-tight VC-dimension bounds for piecewise linear neural networks Nicholas J. A. Harvey, Christopher Liaw , Abbas Mehrabian University of British Columbia COLT 17 July 10, 2017 Neural networks () = max {, 0} (ReLU)


slide-1
SLIDE 1

Nearly-tight VC-dimension bounds for piecewise linear neural networks

Nicholas J. A. Harvey, Christopher Liaw, Abbas Mehrabian University of British Columbia COLT ’17 July 10, 2017

slide-2
SLIDE 2

Neural networks

𝑦" 𝑦# 𝑦$ 𝑦% 𝑦&

𝜏(π‘₯*𝑦 + 𝑐)

𝜏 𝜏 𝜏 𝜏 𝜏 𝜏 𝜏

Identity

hidden layer 1 hidden layer 2

  • utput

layer input layer

𝜏(𝑦) = max Β‘ {𝑦, 0} (ReLU)

slide-3
SLIDE 3

VC-dimension

Defn: If ‘𝐺 is a family of functions then π‘Šπ·π‘’π‘—π‘› 𝐺 β‰₯ 𝑙 iff βˆƒ π‘Œ = {𝑦", … , 𝑦A} s.t. 𝐺 achieves all 2A signings, i.e. {(π‘‘π‘—π‘•π‘œ(𝑔(𝑦")), … , π‘‘π‘—π‘•π‘œ(𝑔(𝑦A))): 𝑔 ∈ 𝐺} = {0,1}A e.g. Hyperplanes in 𝑆K have VC-dimension 𝑒 + 1. Can shatter Impossible to shatter any 4 points

slide-4
SLIDE 4

VC-dimension

Defn: If ‘𝐺 is a family of functions then π‘Šπ·π‘’π‘—π‘› 𝐺 β‰₯ 𝑙 iff βˆƒ π‘Œ = {𝑦", … , 𝑦A} s.t. 𝐺 achieves all 2A signings, i.e. {(π‘‘π‘—π‘•π‘œ(𝑔(𝑦")), … , π‘‘π‘—π‘•π‘œ(𝑔(𝑦A))): 𝑔 ∈ 𝐺} = {0,1}A Thm [Fund. thm. of learning]: 𝐺 is learnable iff π‘Šπ·π‘’π‘—π‘› 𝐺 < ∞. Moreover, sample complexity is Θ(π‘Šπ·π‘’π‘—π‘› 𝐺 ).

slide-5
SLIDE 5

VC-dimension of NNs

Known lower bounds:

  • Ξ© 𝑋𝑀

[BMM β€˜98]

  • Ξ© 𝑋 log 𝑋

[M β€˜94]

Known upper bounds:

  • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM β€˜98]

  • 𝑃 𝑋#

[GJ β€˜95]

W Β‘-­‑ # Β‘parameters/edges L Β‘-­‑ # Β‘layers

slide-6
SLIDE 6

VC-dimension of NNs

Known lower bounds:

  • Ξ© 𝑋𝑀

[BMM β€˜98]

  • Ξ© 𝑋 log 𝑋

[M β€˜94]

Known upper bounds:

  • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM β€˜98]

  • 𝑃 𝑋#

[GJ β€˜95]

Main Thm [HLM β€˜17]: For a ReLU NN w/ W params, L layers Ξ© 𝑋𝑀 log(𝑋/𝑀) ≀ VCdim ≀ 𝑃(𝑋𝑀 log 𝑋)

Means there exists NN with this VCdim W Β‘-­‑ # Β‘parameters/edges L Β‘-­‑ # Β‘layers

slide-7
SLIDE 7

VC-dimension of NNs

Known lower bounds:

  • Ξ© 𝑋𝑀

[BMM β€˜98]

  • Ξ© 𝑋 log 𝑋

[M β€˜94]

Known upper bounds:

  • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM β€˜98]

  • 𝑃 𝑋#

[GJ β€˜95]

Independently proved by Bartlett β€˜17

Main Thm [HLM β€˜17]: For a ReLU NN w/ W params, L layers Ξ© 𝑋𝑀 log(𝑋/𝑀) ≀ VCdim ≀ 𝑃(𝑋𝑀 log 𝑋) Main Thm [HLM β€˜17]: For a ReLU NN w/ W params, L layers Ξ© 𝑋𝑀 log(𝑋/𝑀) ≀ VCdim ≀ 𝑃(𝑋𝑀 log 𝑋)

Means there exists NN with this VCdim W Β‘-­‑ # Β‘parameters/edges L Β‘-­‑ # Β‘layers

slide-8
SLIDE 8

VC-dimension of NNs

Main Thm [HLM β€˜17]: For a ReLU NN w/ W params, L layers Ξ© 𝑋𝑀 log(𝑋/𝑀) ≀ VCdim ≀ 𝑃(𝑋𝑀 log 𝑋)

Known lower bounds:

  • Ξ© 𝑋𝑀

[BMM β€˜98]

  • Ξ© 𝑋 log 𝑋

[M β€˜94]

Known upper bounds:

  • 𝑃 𝑋𝑀 log 𝑋 + 𝑋𝑀#

[BMM β€˜98]

  • 𝑃 𝑋#

[GJ β€˜95]

Independently proved by Bartlett β€˜17 Recently, lots of work on β€œpower of depth” for expressiveness of NNs

[T β€˜16, ES β€˜16, Y ’16, LS β€˜16, SS’16, CSS’ 16, LGMRA β€˜17, D β€˜17]

Means there exists NN with this VCdim W Β‘-­‑ # Β‘parameters/edges L Β‘-­‑ # Β‘layers

slide-9
SLIDE 9

Lower bound

(refinement of [BMM ’98])

  • Shattered set: 𝑇 = 𝑓^ ^∈[`]Γ—{𝑓

c}c∈[d]

  • Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

slide-10
SLIDE 10

Lower bound

(refinement of [BMM ’98])

  • Shattered set: 𝑇 = 𝑓^ ^∈[`]Γ—{𝑓

c}c∈[d]

  • Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

  • Given 𝑓^, easy to extract 𝑏^

𝑏^ 𝑏^ 𝑏^," 𝑏^,# 𝑏^,c 𝑏^,d 1

NN block extracts bits from 𝑏^

1

Rest of NN

𝑓^ 𝑓

c

Select bit j from 𝑏^

slide-11
SLIDE 11

Lower bound

(refinement of [BMM ’98])

  • Shattered set: 𝑇 = 𝑓^ ^∈[`]Γ—{𝑓

c}c∈[d]

  • Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

  • Given 𝑓^, easy to extract 𝑏^
  • Design bit extractor to extract 𝑏^,c
  • [BMM ’98] do this 1 bit per layer β‡’ Ξ©(𝑋𝑀)
  • More efficient: log(𝑋/𝑀) bits per layer

β‡’ Ξ©(𝑋𝑀 Β‘log Β‘ (𝑋/𝑀))

𝑏^ 𝑏^ 𝑏^," 𝑏^,# 𝑏^,c 𝑏^,d 1

NN block extracts bits from 𝑏^

1

Rest of NN

𝑓^ 𝑓

c

Select bit j from 𝑏^

slide-12
SLIDE 12

Lower bound

(refinement of [BMM ’98])

  • Shattered set: 𝑇 = 𝑓^ ^∈[`]Γ—{𝑓

c}c∈[d]

  • Encode 𝑔 w/ weights 𝑏^ = 0. 𝑏^," … 𝑏^,d where

𝑏^,c = 𝑔(𝑓^, 𝑓

c)

  • Given 𝑓^, easy to extract 𝑏^
  • Design bit extractor to extract 𝑏^,c
  • [BMM ’98] do this 1 bit per layer β‡’ Ξ©(𝑋𝑀)
  • More efficient: log(𝑋/𝑀) bits per layer

β‡’ Ξ©(𝑋𝑀 Β‘log Β‘ (𝑋/𝑀))

𝑏^ 𝑏^ 𝑏^," 𝑏^,# 𝑏^,c 𝑏^,d 1

NN block extracts bits from 𝑏^

1

Rest of NN

Thm [HLM β€˜17]: Suppose a ReLU NN w/ 𝑋 Β‘params, 𝑀 Β‘layers extracts 𝑛th bit of input. Then 𝑛 ≀ 𝑃(𝑀 Β‘log Β‘ (𝑋/𝑀)). 𝑓^ 𝑓

c

Select bit j from 𝑏^

slide-13
SLIDE 13

𝑦$

^

𝜏 𝜏 𝜏 𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

  • Fix a shattered set π‘Œ = {𝑦", … , 𝑦d}
  • Partition parameter space s.t. input to 1st

hidden layer has constant sign

  • can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
  • Number of partition is small, i.e. ≀ (𝐷𝑛)g
  • Repeat procedure for each layer to get

partition of size ≀ (𝐷𝑀𝑛)h(gi)

  • In each piece, output is polynomial of deg. 𝑀

so total # of signings ≀ 𝐷𝑀𝑛 h gi

  • Since π‘Œ is shattered, need 2d ≀ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

slide-14
SLIDE 14

𝑦$

^

𝜏 𝜏 𝜏 𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

  • Fix a shattered set π‘Œ = {𝑦", … , 𝑦d}
  • Partition parameter space s.t. input to 1st

hidden layer has constant sign

  • can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
  • Number of partition is small, i.e. ≀ (𝐷𝑛)g
  • Repeat procedure for each layer to get

partition of size ≀ (𝐷𝑀𝑛)h(gi)

  • In each piece, output is polynomial of deg. 𝑀

so total # of signings ≀ 𝐷𝑀𝑛 h gi

  • Since π‘Œ is shattered, need 2d ≀ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

slide-15
SLIDE 15

𝑦$

^

𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

  • Fix a shattered set π‘Œ = {𝑦", … , 𝑦d}
  • Partition parameter space s.t. input to 1st

hidden layer has constant sign

  • can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
  • Number of partition is small, i.e. ≀ (𝐷𝑛)g
  • Repeat procedure for each layer to get

partition of size ≀ (𝐷𝑀𝑛)h(gi)

  • In each piece, output is polynomial of deg. 𝑀

so total # of signings ≀ 𝐷𝑀𝑛 h gi

  • Since π‘Œ is shattered, need 2d ≀ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

slide-16
SLIDE 16

𝑦$

^

𝜏 𝜏 𝜏 𝜏

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

  • Fix a shattered set π‘Œ = {𝑦", … , 𝑦d}
  • Partition parameter space s.t. input to 1st

hidden layer has constant sign

  • can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
  • Size of partition is small, i.e. ≀ (𝐷𝑛)g[Warren β€˜68]
  • Repeat procedure for each layer to get

partition of size ≀ (𝐷𝑀𝑛)h(gi)

  • In each piece, output is polynomial of deg. 𝑀

so total # of signings ≀ 𝐷𝑀𝑛 h gi

  • Since π‘Œ is shattered, need 2d ≀ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

* ‘𝐷 ‘ > ‘1 is some constant

slide-17
SLIDE 17

𝑦$

^

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

  • Fix a shattered set π‘Œ = {𝑦", … , 𝑦d}
  • Partition parameter space s.t. input to 1st

hidden layer has constant sign

  • can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
  • Size of partition is small, i.e. ≀ (𝐷𝑛)g[Warren β€˜68]
  • Repeat procedure for each layer to get

partition of size ≀ (𝐷𝑀𝑛)h(gi)

  • In each piece, output is polynomial of deg. 𝑀

so total # of signings ≀ 𝐷𝑀𝑛 h gi

  • Since π‘Œ is shattered, need 2d ≀ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

* ‘𝐷 ‘ > ‘1 is some constant

slide-18
SLIDE 18

𝑦$

^

𝑦#

^

𝑦"

^

𝑦%

^

𝑦&

^

Upper bound

(refinement of [BMM ’98] for ReLU)

  • Fix a shattered set π‘Œ = {𝑦", … , 𝑦d}
  • Partition parameter space s.t. input to 1st

hidden layer has constant sign

  • can replace 𝜏 with 0 (if < 0) or identity (if > 0)!
  • Size of partition is small, i.e. ≀ (𝐷𝑛)g[Warren β€˜68]
  • Repeat procedure for each layer to get

partition of size ≀ (𝐷𝑀𝑛)h(gi)

  • In each piece, output is polynomial of deg. 𝑀

so total # of signings ≀ 𝐷𝑀𝑛 h gi

  • Since π‘Œ is shattered, need 2d ≀ 𝐷𝑀𝑛 h gi

which implies 𝑛 = 𝑃(𝑋𝑀 log 𝑋)

* ‘𝐷 ‘ > ‘1 is some constant

slide-19
SLIDE 19

Open questions

  • Can we close the gap for ReLU NNs:

Ξ©(𝑋𝑀 log 𝑋/𝑀 ) vs 𝑃(𝑋𝑀 log 𝑋)?

  • For polynomial NNs, we have

Ξ© k 𝑋𝑀 ≀ π‘Šπ·π‘’π‘—π‘› ≀ 𝑃 l 𝑋𝑀# . Can we close this gap? Do poly NNs have higher VCdim than ReLU NNs?

  • What about VC dimensions of CNNs, RNNs, ResNets, etc.?
slide-20
SLIDE 20

Open questions

  • Can we close the gap for ReLU NNs:

Ξ©(𝑋𝑀 log 𝑋/𝑀 ) vs 𝑃(𝑋𝑀 log 𝑋)?

  • For polynomial NNs, we have

Ξ© k 𝑋𝑀 ≀ π‘Šπ·π‘’π‘—π‘› ≀ 𝑃 l 𝑋𝑀# . Can we close this gap? Do poly NNs have higher VCdim than ReLU NNs?

  • What about VC dimensions of CNNs, RNNs, ResNets, etc.?

Thank you!