Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk - - PowerPoint PPT Presentation

mad max
SMART_READER_LITE
LIVE PREVIEW

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk - - PowerPoint PPT Presentation

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk expectations time greek questions for the babylonians Why is deep learning so effective ? Can we derive deep learning systems from first principles ? When and why


slide-1
SLIDE 1

Richard Baraniuk

Mad Max:

Affine Spline Insights into Deep Learning

slide-2
SLIDE 2

time expectations

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

greek questions for the babylonians

  • Why is deep learning so effective?
  • Can we derive deep learning systems from first principles?
  • When and why does deep learning fail?
  • How can deep learning systems be improved and extended

in a principled fashion?

  • Where is the foundational framework for theory?

See also Mallat, Soatto, Arora, Poggio, Tishby, [growing community] …

slide-6
SLIDE 6

splines and deep learning

  • R. Balestriero & B

“A Spline Theory of Deep Networks,” ICML 2018 “Mad Max: Affine Spline Insights into Deep Learning,” arxiv.org/abs/1805.06576, 2018 “From Hard to Soft: Understanding Deep Network Nonlinearities…,” ICLR 2019 “A Max-Affine Spline Perspective of RNNs,” ICLR 2019 (w/ J. Wang)

slide-7
SLIDE 7

prediction problem

  • Unknown function/operator mapping data to labels
  • Goal: Learn an approximation to using training data

y = f(x)

data (signal, image, video, …) label

f f

b y = fΘ(x) =

{(xi, yi)}n

i=1

slide-8
SLIDE 8

deep nets approximate

  • Deep nets solve a function approx problem (black box)

b y = fΘ(x) =

b y

slide-9
SLIDE 9

deep nets approximate

  • Deep nets solve a function approx problem hierarchically

convo convo ReLU max-pool ReLU

layer 1 layer 2 layer 3

b y = fΘ(x) = ⇣ f (L)

θ(L) · · · f (3) θ(3) f (2) θ(2) f (1) θ(1)

⌘ (x)

b y

slide-10
SLIDE 10

deep nets and splines

convo convo ReLU max-pool ReLU

layer 1 layer 2 layer 3

b y = fΘ(x) = ⇣ f (L)

θ(L) · · · f (3) θ(3) f (2) θ(2) f (1) θ(1)

⌘ (x)

  • Deep nets solve a function approx problem hierarchically

using a very special family of splines

b y

slide-11
SLIDE 11

deep nets and splines

slide-12
SLIDE 12

spline approximation

  • A spline function approximation consists of

– a partition Ω of the independent variable (input space) – a (simple) local mapping on each region of the partition (our focus: piecewise-affine mappings)

x

Ω

slide-13
SLIDE 13

spline approximation

  • A spline function approximation consists of

– a partition Ω of the independent variable (input space) – a (simple) local mapping on each region of the partition

  • Powerful splines

– free, unconstrained partition Ω (ex: “free-knot” splines) – jointly optimize both the partition and local mappings (highly nonlinear, computationally intractable)

  • Easy splines

– fixed partition (ex: uniform grid, dyadic grid) – need only optimize the local mappings

slide-14
SLIDE 14

max-affine spline (MAS)

  • Consider piecewise-affine approximation of

a convex function over R regions

– Affine functions: – Convex approximation:

[Magnani & Boyd, 2009; Hannah & Dunson, 2013]

aT

r x + br,

r = 1, . . . , R z(x) = max

r=1,...,R aT r x + br

(a1, b1) (a2, b2) (a4, b4)

x

R = 4

(a3, b3)

slide-15
SLIDE 15

max-affine spline (MAS)

  • Key: Any set of affine parameters

implicitly determines a spline partition

– Affine functions: – Convex approximation:

[Magnani & Boyd, 2009; Hannah & Dunson, 2013]

aT

r x + br,

r = 1, . . . , R z(x) = max

r=1,...,R aT r x + br

(a1, b1) (a2, b2) (a3, b3) (a4, b4)

x

R = 4 (ar, br), r = 1, . . . , R

slide-16
SLIDE 16

scale + bias | ReLU is a MAS

  • Scale x by a + bias b | ReLU:

– Affine functions: – Convex approximation:

(a1, b1) (a2, b2)

x

z(x) = max(0, ax + b)

(a1, b1) = (0, 0), (a2, b2) = (a, b)

z(x) = max

r=1,2 aT r x + br

R = 2

slide-17
SLIDE 17

max-affine spline operator (MASO)

  • MAS for has affine parameters
  • A MASO is simply a concatenation of K MASs

x ∈ RD ar ∈ RD, br ∈ R x ∈ RD z ∈ RK

MAS with parameters

i k [a]k,i,r, br

slide-18
SLIDE 18

modern deep nets

  • Focus:

The lion-share of today’s deep net architectures

(convnets, resnets, skip-connection nets, inception nets, recurrent nets, …)

employ piecewise linear (affine) layers

(fully connected, conv; (leaky) ReLU, abs value; max/mean/channel-pooling)

convo convo ReLU max-pool ReLU

layer 1 layer 2 layer 3

b y = fΘ(x) = ⇣ f (L)

θ(L) · · · f (3) θ(3) f (2) θ(2) f (1) θ(1)

⌘ (x)

b y

slide-19
SLIDE 19

theorems

  • Each deep net layer is a MASO

– convex wrt each output dimension, piecewise-affine operator

b y

slide-20
SLIDE 20

theorems

  • Each deep net layer is a MASO

– convex, piecewise-affine operator

  • A deep net is a composition of MASOs

– non-convex piecewise-affine spline operator

b y

WLOG ignore

  • utput

softmax

slide-21
SLIDE 21

theorems

  • A deep net is a composition of MASOs

– non-convex piecewise-affine spline operator

  • A deep net is a convex MASO iff the convolution/fully connected weights in

all but the first layer are nonnegative and the intermediate nonlinearities are nondecreasing

b y

slide-22
SLIDE 22

MASO spline partition

  • The parameters of each deep net layer (MASO) induce a

partition of its input space with convex regions

– vector quantization (info theory) – k-means (statistics) – Voronoi tiling (geometry)

slide-23
SLIDE 23
  • The L layer-partitions of an L-layer deep net combine to form

the global input signal space partition

– affine spline operator – non-convex regions

  • Toy example: 3-layer “deep net”

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

MASO spline partition

slide-24
SLIDE 24
  • The L layer-partitions of an L-layer deep net combine to form

the global input signal space partition

– affine spline operator – non-convex regions

  • Toy example: 3-layer “deep net”

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

MASO spline partition

x[1] x[2]

slide-25
SLIDE 25

MASO spline partition

  • Toy example: 3-layer “deep net”

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

  • VQ partition of layer 1

depicted in the input space

– convex regions

slide-26
SLIDE 26

MASO spline partition

  • Toy example: 3-layer “deep net”

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

  • Given the partition region

containing the layer input/output mapping is affine

z(x) = AQ(x)x + bQ(x)

Q(x)

x

x

slide-27
SLIDE 27

MASO spline partition

  • Toy example: 3-layer “deep net”

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

  • VQ partition of layer 2

depicted in the input space

– non-convex regions due to visualization in the input space

slide-28
SLIDE 28

MASO spline partition

  • Toy example: 3-layer “deep net”

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

  • Given the partition region

containing the layer input/output mapping is affine

x

z(x) = AQ(x)x + bQ(x)

Q(x)

x

slide-29
SLIDE 29

MASO spline partition

  • Toy example: “Deep” net layer

– Input x: 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y: 4-D

  • VQ partition of layers 1 & 2

depicted in the input space

– non-convex regions

slide-30
SLIDE 30

learning

learning epochs (time) layers 1 & 2

slide-31
SLIDE 31

local affine mapping – CNN

WLOG ignore

  • utput

softmax

slide-32
SLIDE 32

local affine mapping – CNN

Fixed, different in each partition region

AQ(x), bQ(x)

slide-33
SLIDE 33

matched filters

slide-34
SLIDE 34

deep nets are matched filterbanks

z(L)(x) = AQ(x)x + bQ(x)

z(L)(x)

  • Row c of is a vectorized

signal/image corresponding to class c

  • Entry c of deep net output =

inner product between row c and signal

  • For classification, select largest output;

matched filter!

AQ(x)

c

slide-35
SLIDE 35

deep nets are matched filterbanks

slide-36
SLIDE 36

data memorization

slide-37
SLIDE 37
  • rthogonal deep nets
slide-38
SLIDE 38

partition-based signal distance

slide-39
SLIDE 39

partition-based signal distance

slide-40
SLIDE 40

partition-based signal distance

slide-41
SLIDE 41

additional directions

  • Study the geometry of deep nets and signals via VQ partition
  • Affine input/output formula enables explicit calculation of the

Lipschitz constant of a deep net for the analysis of stability, adversarial examples, …

  • Theory covers many recurrent neural networks (RNNs)
slide-42
SLIDE 42

additional directions

  • Theory extends to non-piecewise-affine operators (ex: sigmoid)

by replacing the “hard VQ” of a MASO with a “soft VQ”

– soft-VQ can generate new nonlinearities (ex: swish)

slide-43
SLIDE 43

summary

  • A wide range of deep nets solve function approximation problems using

a composition of max-affine spline operators (MASOs)

– links to vector quantization, k-means, Voronoi tiling

  • Input/output deep net mapping is a VQ-dependent affine transform

– enables explicit calculation of the Lipschitz constant of a deep net for the analysis of stability, adversarial examples, . . .

  • Deep nets are (learned) matched filterbanks

– new insights into dataset memorization

  • Theory is constructive

– inspires orthogonalized deep nets – new geometric distance via Hamming-VQ distance

slide-44
SLIDE 44

max-affine

splines and deep learning

  • R. Balestriero & B

“A Spline Theory of Deep Networks,” ICML 2018 “Mad Max: Affine Spline Insights into Deep Learning,” arxiv.org/abs/1805.06576, 2018 “From Hard to Soft: Understanding Deep Network Nonlinearities…,” ICLR 2019 “A Max-Affine Spline Perspective of RNNs,” ICLR 2019 (w/ J. Wang)

dsp.rice.edu