mad max
play

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk - PowerPoint PPT Presentation

Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk expectations time greek questions for the babylonians Why is deep learning so effective ? Can we derive deep learning systems from first principles ? When and why


  1. Mad Max: Affine Spline Insights into Deep Learning Richard Baraniuk

  2. expectations time

  3. greek questions for the babylonians • Why is deep learning so effective ? • Can we derive deep learning systems from first principles ? • When and why does deep learning fail ? • How can deep learning systems be improved and extended in a principled fashion? • Where is the foundational framework for theory? See also Mallat, Soatto, Arora, Poggio, Tishby, [growing community] …

  4. splines and deep learning R. Balestriero & B “A Spline Theory of Deep Networks,” ICML 2018 “Mad Max: Affine Spline Insights into Deep Learning,” arxiv.org/abs/1805.06576, 2018 “From Hard to Soft: Understanding Deep Network Nonlinearities…,” ICLR 2019 “A Max-Affine Spline Perspective of RNNs,” ICLR 2019 (w/ J. Wang)

  5. prediction problem • Unknown function/operator mapping data to labels f y = f ( x ) label data (signal, image, video, …) • Goal: Learn an approximation to using training data f { ( x i , y i ) } n b y = f Θ ( x ) = i =1

  6. deep nets approximate • Deep nets solve a function approx problem (black box) b y y = f Θ ( x ) = b

  7. deep nets approximate • Deep nets solve a function approx problem hierarchically ReLU convo ReLU max-pool convo b y layer 1 layer 2 layer 3 ⇣ ⌘ f ( L ) θ ( L ) � · · · � f (3) θ (3) � f (2) θ (2) � f (1) b y = f Θ ( x ) = ( x ) θ (1)

  8. deep nets and splines • Deep nets solve a function approx problem hierarchically using a very special family of splines ReLU convo ReLU max-pool convo b y layer 1 layer 2 layer 3 ⇣ ⌘ f ( L ) θ ( L ) � · · · � f (3) θ (3) � f (2) θ (2) � f (1) b y = f Θ ( x ) = ( x ) θ (1)

  9. deep nets and splines

  10. spline approximation • A spline function approximation consists of – a partition Ω of the independent variable (input space) – a (simple) local mapping on each region of the partition (our focus: piecewise-affine mappings) x Ω

  11. spline approximation • A spline function approximation consists of – a partition Ω of the independent variable (input space) – a (simple) local mapping on each region of the partition • Powerful splines – free, unconstrained partition Ω (ex: “free-knot” splines ) – jointly optimize both the partition and local mappings (highly nonlinear, computationally intractable) • Easy splines – fixed partition (ex: uniform grid, dyadic grid) – need only optimize the local mappings

  12. max-affine spline (MAS) [Magnani & Boyd, 2009; Hannah & Dunson, 2013] • Consider piecewise-affine approximation of a convex function over R regions a T r x + b r , r = 1 , . . . , R – Affine functions: r =1 ,...,R a T z ( x ) = max r x + b r – Convex approximation: ( a 4 , b 4 ) ( a 1 , b 1 ) R = 4 ( a 3 , b 3 ) ( a 2 , b 2 ) x

  13. max-affine spline (MAS) [Magnani & Boyd, 2009; Hannah & Dunson, 2013] ( a r , b r ) , r = 1 , . . . , R • Key: Any set of affine parameters implicitly determines a spline partition a T r x + b r , r = 1 , . . . , R – Affine functions: r =1 ,...,R a T z ( x ) = max r x + b r – Convex approximation: ( a 4 , b 4 ) ( a 1 , b 1 ) R = 4 ( a 3 , b 3 ) ( a 2 , b 2 ) x

  14. scale + bias | ReLU is a MAS z ( x ) = max(0 , ax + b ) • Scale x by a + bias b | ReLU: ( a 1 , b 1 ) = (0 , 0) , ( a 2 , b 2 ) = ( a, b ) – Affine functions: r =1 , 2 a T z ( x ) = max r x + b r – Convex approximation: R = 2 ( a 2 , b 2 ) ( a 1 , b 1 ) x

  15. max-affine spline operator (MASO) a r ∈ R D , b r ∈ R x ∈ R D • MAS for has affine parameters • A MASO is simply a concatenation of K MASs MAS with k i parameters … [ a ] k,i,r , b r x ∈ R D z ∈ R K

  16. modern deep nets • Focus: The lion-share of today’s deep net architectures (convnets, resnets, skip-connection nets, inception nets, recurrent nets, …) employ piecewise linear (affine) layers (fully connected, conv; (leaky) ReLU, abs value; max/mean/channel-pooling) ReLU convo ReLU max-pool convo b y layer 1 layer 2 layer 3 ⇣ ⌘ f ( L ) θ ( L ) � · · · � f (3) θ (3) � f (2) θ (2) � f (1) b y = f Θ ( x ) = ( x ) θ (1)

  17. theorems • Each deep net layer is a MASO – convex wrt each output dimension, piecewise-affine operator b y

  18. theorems • Each deep net layer is a MASO – convex , piecewise-affine operator b y WLOG ignore output softmax • A deep net is a composition of MASOs – non-convex piecewise-affine spline operator

  19. theorems • A deep net is a composition of MASOs – non-convex piecewise-affine spline operator b y • A deep net is a convex MASO iff the convolution/fully connected weights in all but the first layer are nonnegative and the intermediate nonlinearities are nondecreasing

  20. MASO spline partition • The parameters of each deep net layer (MASO) induce a partition of its input space with convex regions – vector quantization (info theory) – k -means (statistics) – Voronoi tiling (geometry)

  21. MASO spline partition • The L layer-partitions of an L -layer deep net combine to form the global input signal space partition – affine spline operator – non-convex regions • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D

  22. MASO spline partition • The L layer-partitions of an L -layer deep net combine to form the global input signal space partition – affine spline operator – non-convex regions x [2] • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D x [1]

  23. MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • VQ partition of layer 1 depicted in the input space – convex regions

  24. MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • Given the partition region Q ( x ) x containing the layer x input/output mapping is affine z ( x ) = A Q ( x ) x + b Q ( x )

  25. MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • VQ partition of layer 2 depicted in the input space – non-convex regions due to visualization in the input space

  26. MASO spline partition • Toy example: 3-layer “deep net” – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D x • Given the partition region Q ( x ) containing the layer x input/output mapping is affine z ( x ) = A Q ( x ) x + b Q ( x )

  27. MASO spline partition • Toy example: “Deep” net layer – Input x : 2-D (4 classes) – Fully connected | ReLU (45-D output) – Fully connected | ReLU (3-D output) – Fully connected | (softmax) (4-D output) – Output y : 4-D • VQ partition of layers 1 & 2 depicted in the input space – non-convex regions

  28. learning layers 1 & 2 learning epochs (time)

  29. local affine mapping – CNN WLOG ignore output softmax

  30. local affine mapping – CNN Fixed, different A Q ( x ) , b Q ( x ) in each partition region

  31. matched filters

  32. deep nets are matched filterbanks z ( L ) ( x ) = A Q ( x ) x + b Q ( x ) z ( L ) ( x ) • Row c of is a vectorized A Q ( x ) signal/image corresponding to class c c • Entry c of deep net output = inner product between row c and signal • For classification, select largest output; matched filter!

  33. deep nets are matched filterbanks

  34. data memorization

  35. orthogonal deep nets

  36. partition-based signal distance

  37. partition-based signal distance

  38. partition-based signal distance

  39. additional directions • Study the geometry of deep nets and signals via VQ partition • Affine input/output formula enables explicit calculation of the Lipschitz constant of a deep net for the analysis of stability, adversarial examples, … • Theory covers many recurrent neural networks (RNNs)

  40. additional directions • Theory extends to non-piecewise-affine operators (ex: sigmoid ) by replacing the “ hard VQ ” of a MASO with a “ soft VQ ” – soft-VQ can generate new nonlinearities (ex: swish)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend