The committee machine: Computational to statistical gaps in - - PowerPoint PPT Presentation

the committee machine computational to statistical gaps
SMART_READER_LITE
LIVE PREVIEW

The committee machine: Computational to statistical gaps in - - PowerPoint PPT Presentation

The committee machine: Computational to statistical gaps in learning a two-layers neural network Benjamin Aubin , Antoine Maillard, Jean Barbier Nicolas Macris, Florent Krzakala & Lenka Zdeborov Benjamin Aubin - Institut de Physique T


slide-1
SLIDE 1

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

The committee machine: Computational to statistical gaps in learning a two-layers neural network

Benjamin Aubin, Antoine Maillard, Jean Barbier Nicolas Macris, Florent Krzakala & Lenka Zdeborovà

slide-2
SLIDE 2

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

slide-3
SLIDE 3

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

slide-4
SLIDE 4

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

✓ Committee machine: second layer fixed


[Schwarze’93]

slide-5
SLIDE 5

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

✓ Committee machine: second layer fixed


[Schwarze’93]

✓ i.i.d samples

slide-6
SLIDE 6

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher: ๏ Student:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

✓ Committee machine: second layer fixed


[Schwarze’93]

✓ i.i.d samples

slide-7
SLIDE 7

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher: ๏ Student:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

Y ı

i

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

✓ Committee machine: second layer fixed


[Schwarze’93]

✓ i.i.d samples

slide-8
SLIDE 8

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher: ๏ Student:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

Y ı

i

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

✓ Committee machine: second layer fixed


[Schwarze’93]

✓ i.i.d samples ✓ Learning task possible ?

slide-9
SLIDE 9

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

(Xi)n

i=1

samples

« Can we efficiently learn a teacher network from a limited number of samples? »

๏ Teacher: ๏ Student:

(Xi)n

i=1

samples )n

i=1

ples Wı ∈ Rp×K Yi

ı

W(2)

fixed

f (1) f (1) f (2) p features K hidden units

  • utput

Y ı

i

?

W Yi W(2)

fixed

f (1) f (1) f (2) K hidden units

✓ Committee machine: second layer fixed


[Schwarze’93]

✓ i.i.d samples ✓ Learning task possible ? ✓ Computational complexity?

slide-10
SLIDE 10

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

➡ Traditional approach

๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments

Motivation

slide-11
SLIDE 11

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

➡ Traditional approach

๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments

Motivation

➡ Complementary approach

✓ Revisit the statistical physics typical case scenario [Sompolinsky’92, Mezard’87] :


i.i.d data coming from a probabilistic model

slide-12
SLIDE 12

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

➡ Traditional approach

๏ Worst case scenario/PAC bounds: VC-dim & Rademacher complexity ๏ Numerical experiments ✓ Theoretical understanding of the generalization performance ✓ Regime:

p → ∞, n p = Θ(1)

Motivation

➡ Complementary approach

✓ Revisit the statistical physics typical case scenario [Sompolinsky’92, Mezard’87] :


i.i.d data coming from a probabilistic model

slide-13
SLIDE 13

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (1) - Generalization error

๏ Information theoretically optimal generalization error


(Bayes optimal case)

✏(p)

g

≡ 1 2EX,W? h EW|X ⇥ Y (XW) ⇤ − Y ?(XW?) 2i − − − →

p→∞ ✏g(q∗)

slide-14
SLIDE 14

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (1) - Generalization error

๏ Information theoretically optimal generalization error


(Bayes optimal case)

✏(p)

g

≡ 1 2EX,W? h EW|X ⇥ Y (XW) ⇤ − Y ?(XW?) 2i − − − →

p→∞ ✏g(q∗)

EXPLICIT

slide-15
SLIDE 15

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (1) - Generalization error

๏ Information theoretically optimal generalization error


(Bayes optimal case)

✏(p)

g

≡ 1 2EX,W? h EW|X ⇥ Y (XW) ⇤ − Y ?(XW?) 2i − − − →

p→∞ ✏g(q∗)

EXPLICIT

slide-16
SLIDE 16

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (1) - Generalization error

๏ : extremizing the variational formulation of this mutual information:

q∗

lim

p→∞

1 pI(W; Y|X) = − sup

r∈S+

K

inf

q∈S+

K

n ψP0(r) + αΨout(q) − 1 2Tr(rq)

  • + cst

๏ Information theoretically optimal generalization error


(Bayes optimal case)

✏(p)

g

≡ 1 2EX,W? h EW|X ⇥ Y (XW) ⇤ − Y ?(XW?) 2i − − − →

p→∞ ✏g(q∗)

EXPLICIT

slide-17
SLIDE 17

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (1) - Generalization error

๏ : extremizing the variational formulation of this mutual information:

q∗

lim

p→∞

1 pI(W; Y|X) = − sup

r∈S+

K

inf

q∈S+

K

n ψP0(r) + αΨout(q) − 1 2Tr(rq)

  • + cst

๏ Information theoretically optimal generalization error


(Bayes optimal case)

✏(p)

g

≡ 1 2EX,W? h EW|X ⇥ Y (XW) ⇤ − Y ?(XW?) 2i − − − →

p→∞ ✏g(q∗)

EXPLICIT

Heuristic replica mutual information well known in statistical 
 physics since 80’s

slide-18
SLIDE 18

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (1) - Generalization error

๏ : extremizing the variational formulation of this mutual information:

q∗

lim

p→∞

1 pI(W; Y|X) = − sup

r∈S+

K

inf

q∈S+

K

n ψP0(r) + αΨout(q) − 1 2Tr(rq)

  • + cst

๏ Information theoretically optimal generalization error


(Bayes optimal case)

✏(p)

g

≡ 1 2EX,W? h EW|X ⇥ Y (XW) ⇤ − Y ?(XW?) 2i − − − →

p→∞ ✏g(q∗)

EXPLICIT

Heuristic replica mutual information well known in statistical 
 physics since 80’s

✓ Main contribution: rigorous proof by adaptive (Guerra) interpolation

slide-19
SLIDE 19

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Main result (2) - Message Passing Algorithm

slide-20
SLIDE 20

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018 ๏ Traditional approach:

  • Minimize a loss function. Not optimal for

limited number of samples.

Main result (2) - Message Passing Algorithm

slide-21
SLIDE 21

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018 ๏ Traditional approach:

  • Minimize a loss function. Not optimal for

limited number of samples.

๏ Approximate Message Passing (AMP)

algorithm:

  • Expansion of BP equations on a factor graph.

Closed set of iterative equations.
 Estimates marginal probabilities

mj→i(wj) Pout(Yi|XiW) wj P0(wj)

Factor graph representation

  • f the committee machine

mj(wj)

Main result (2) - Message Passing Algorithm

slide-22
SLIDE 22

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

✓ Conjectured to be optimal among

polynomial algorithms

✓ Can be tracked rigorously (state evolution

given by critical points of the replica mutual information) [Montanari-Bayati ‘10]

๏ Traditional approach:

  • Minimize a loss function. Not optimal for

limited number of samples.

๏ Approximate Message Passing (AMP)

algorithm:

  • Expansion of BP equations on a factor graph.

Closed set of iterative equations.
 Estimates marginal probabilities

mj→i(wj) Pout(Yi|XiW) wj P0(wj)

Factor graph representation

  • f the committee machine

mj(wj)

Main result (2) - Message Passing Algorithm

slide-23
SLIDE 23

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

Gaussian weights - sign activation

slide-24
SLIDE 24

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-25
SLIDE 25

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-26
SLIDE 26

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-27
SLIDE 27

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-28
SLIDE 28

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-29
SLIDE 29

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-30
SLIDE 30

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-31
SLIDE 31

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Computational gap Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

slide-32
SLIDE 32

Benjamin Aubin - Institut de Physique T héorique NeurIPS 2018

Large number of hidden units K = Θp(1)

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Bayes optimal g( α) AMP g( α) Discontinuous specialization

2 4 6 8 10 12 14

  • α = (# of samples)/(#hidden units × input size)

0.0 0.1 0.2 0.3 0.4 0.5

Generalization error g( α)

Non-specialized hidden units Specialized hidden units Computational gap Bayes optimal g( α) AMP g( α) Discontinuous specialization

Gaussian weights - sign activation

Poster #111 TO KNOW MORE:

https:/ /github.com/ benjaminaubin/
 TheCommitteeMachine