Deep networks CS 446 The ERM perspective These lectures will - PowerPoint PPT Presentation

Deep networks CS 446

The ERM perspective These lectures will follow an ERM perspective on deep networks: ◮ Pick a model/predictor class (network architecture). (We will spend most of our time on this!) ◮ Pick a loss/risk. (We will almost always use cross-entropy!) ◮ Pick an optimizer. (We will mostly treat this as a black box!) The goal is low test error, whereas above only gives low training error; we will briefly discuss this as well. 1 / 20

1. Linear networks.

Iterated linear predictors The most basic view of a neural network is an iterated linear predictor. ◮ 1 layer: x �→ W 1 x + b 1 . ◮ 2 layers: x �→ W 2 ( W 1 x + b 1 ) + b 2 . ◮ 3 layers: � � x �→ W 3 W 2 ( W 1 x + b 1 ) + b 2 + b 3 . ◮ L layers: � � x �→ W L · · · ( W 1 x + b 1 ) · · · + b L . Alternatively, this is a composition of linear predictors: x �→ ( f L ◦ f L − 1 ◦ · · · ◦ f 1 ) ( x ) , where f i ( z ) = W i z + b i is an affine function. Note: “layer” terminology is ambiguous, we’ll revisit it. 2 / 20

Wait a minute. . . Note that � � W L · · · ( W 1 x + b 1 ) · · · + b L = ( W L · · · W 1 ) x + ( b L + W L b L − 1 + · · · + W L · · · W 2 b 1 ) T [ x = w 1 ] , where w ∈ R d +1 is T 1: d = W L · · · W 1 , w d +1 = b L + W L b L − 1 + · · · + W L · · · W 2 b 1 . w Oops, this is just a linear predictor. 3 / 20

2. Activations/nonlinearities.

Iterated logistic regression Recall that logistic regression could be interpreted as a probability model: 1 T x ) , Pr[ Y = 1 | X = x ] = 1 + exp( − w T x ) =: σ s ( w where σ s is the logistic or sigmoid function 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 4 / 20

Iterated logistic regression Recall that logistic regression could be interpreted as a probability model: 1 T x ) , Pr[ Y = 1 | X = x ] = 1 + exp( − w T x ) =: σ s ( w where σ s is the logistic or sigmoid function 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 Now suppose σ s is applied coordinate-wise, and consider x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ s ( W i z + b i ) . Don’t worry, we’ll slow down next slide; for now, iterated logistic regression gave our first deep network! Remark: can view intermediate layers as features to subsequent layers. 4 / 20

Basic deep networks A self-contained expression is � � � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L W L σ L − 1 , with equivalent “functional form” x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ i ( W i z + b i ) . Some further details (many more to come!): i =1 with W i ∈ R d i − 1 × d i are the weights, and ( b i ) L ◮ ( W i ) L i =1 are the biases. i =1 with σ i : R d i → R d i are called nonlinearties, or activations, or ◮ ( σ i ) L transfer functions, or link functions. ◮ This is only the basic setup; many things can and will change, please ask many questions! 5 / 20

Choices of activation Basic form: � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L . Choices of activation (univariate, coordinate-wise): ◮ Indicator/step/heavyside/threshold z �→ 1 [ z ≥ 0] . This was the original choice (1940s!). ◮ Sigmoid σ s ( z ) := 1 1+exp( − z ) . This was popular roughly 1970s - 2005? ◮ Hyperbolic tangent z �→ tanh( z ) . Similar to sigmoid, used during same interval. ◮ Rectified Linear Unit (ReLU) σ r ( z ) = max { 0 , z } . It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominant choice now; popularized in “Imagenet/AlexNet” paper (Krizhevsky-Sutskever-Hinton, 2012). ◮ Identity z �→ z ; we’ll often use this as the last layer when we use cross-entropy loss. ◮ NON-coordinate-wise choices: we will discuss “softmax” and “pooling” a bit later. 6 / 20

“Architectures” and “models” Basic form: � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L W L σ L − 1 . (( W i , b i )) L i =1 , the weights and biases, are the parameters. Let’s roll them into W := (( W i , b i )) L i =1 , and consider the network as a two-parameter function F W ( x ) = F ( x ; W ) . ◮ The model or class of functions is { F W : all possible W} . F (both arguments unset) is also called an architecture. ◮ When we fit/train/optimize, typically we leave the architecture fixed and vary W to minimize risk. (More on this in a moment.) 7 / 20

ERM recipe for basic networks Standard ERM recipe: ◮ First we pick a class of functions/predictors; for deep networks, that means a F ( · , · ) . ◮ Then we pick a loss function and write down an empirical risk minimization problem; in these lectures we will pick cross-entropy: n 1 � � � arg min ℓ ce y i , F ( x i , W ) n W i =1 n 1 � � � y i , F ( x i ; (( W i , b i )) L = arg min ℓ ce i =1 ) n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL n 1 � = arg min � y i , σ L ( · · · σ 1 ( W 1 x i + b 1 ) · · · ) � ℓ ce . n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL ◮ Then we pick an optimizer. In this class, we only use gradient descent variants. It is a miracle that this works. 8 / 20

Remark on affine expansion Note: we are writing � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · , rather than � � � �� W 1 [ x x �→ σ L · · · W 2 σ 1 1 ] · · · . ◮ First form seems natural: With “iterated linear prediction” perspective, it is natural to append 1 at every layer. ◮ Second form is sufficient: with ReLU, σ r (1) = 1 , so can pass forward the constant; similar (but more complicated) options exist for other activations. ◮ Why do we do it? It seems to make the optimization better behaved; this is currently not well understood. 9 / 20

Which architecture? How do choose an architecture? 10 / 20

Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? 10 / 20

Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? ◮ Split data into training and validation, train different architectures and evaluate them on validation, choose architecture with lowest validiation error. ◮ As with other methods, this is a proxy to minimizing test error. 10 / 20

Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? ◮ Split data into training and validation, train different architectures and evaluate them on validation, choose architecture with lowest validiation error. ◮ As with other methods, this is a proxy to minimizing test error. Note. ◮ For many standard tasks (e.g., classification of standard vision datasets), people know good architectures. ◮ For new problems and new domains, things are absolutely not settled. 10 / 20

3. What we have gained: representation power

Sometimes, linear just isn’t enough 1.00 1.00 0.75 0.75 0 -3.000 0 0 . 2 3 0.50 0.50 - -24.000 - 0.25 0.25 8 -1.500 . 0 0 0 0 0 0 -8.000 0.00 0.00 6 . 8.000 1 - 0.000 0 0.000 0 1.500 0 . 8 0.25 0.25 3.000 0.50 0.50 0.000 0.75 0.75 4.500 1.00 1.00 16.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Linear predictor: ReLU network: w T [ x x �→ 1 ] . x �→ W 2 σ r ( W 1 x + b 1 ) + b 2 . Some blue points misclassified. 0 misclassifications! 11 / 20

Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. 12 / 20

Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. 12 / 20

Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. ◮ If it doesn’t split the blue points, then one halfspace contains the common midpoint, and therefore wrong on at least one red point. 12 / 20

One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � ≤ ǫ, � � sup � f ( x ) − W 2 σ ( W 1 x + b 1 ) x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). 13 / 20

Deep networks CS 446 The ERM perspective These lectures will - PowerPoint PPT Presentation

Deep networks CS 446 The ERM perspective These lectures will follow an ERM perspective on deep networks: Pick a model/predictor class (network architecture). (We will spend most of our time on this!) Pick a loss/risk. (We will almost

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Neural Networks Stefan Edelkamp 1 Overview - Introduction - Percepton - Hofield-Nets -

EFET position paper One line title for an improved market design in intraday Irina Nikolova

Statistical Preliminaries Stony Brook University CSE545, Fall 2016 Random Variables X : A

CS 188: Artificial Intelligence Optimization and Neural Nets Instructors: Pieter Abbeel and Dan

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network &

ts tr P

Definition of Natural Logarithm Function Recall x n dx = x n +1 n + 1 + C n = 1 .

MATH 12002 - CALCULUS I 1.5: Intermediate Value Theorem Professor Donald L. White Department

Deep networks CS 446 The ERM perspective These lectures will - PowerPoint PPT Presentation

Deep networks CS 446 The ERM perspective These lectures will follow an ERM perspective on deep networks: Pick a model/predictor class (network architecture). (We will spend most of our time on this!) Pick a loss/risk. (We will almost

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Deep Neural Networks CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Deep learning slides credit:

Data Mining II Neural Networks and Deep Learning Heiko Paulheim Deep Learning A recent

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

Neural Networks Stefan Edelkamp 1 Overview - Introduction - Percepton - Hofield-Nets -

EFET position paper One line title for an improved market design in intraday Irina Nikolova

Statistical Preliminaries Stony Brook University CSE545, Fall 2016 Random Variables X : A

CS 188: Artificial Intelligence Optimization and Neural Nets Instructors: Pieter Abbeel and Dan

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network &amp;

ts tr P

Definition of Natural Logarithm Function Recall x n dx = x n +1 n + 1 + C n = 1 .

MATH 12002 - CALCULUS I 1.5: Intermediate Value Theorem Professor Donald L. White Department

Neural Networks Hopfield Nets and Boltzmann Machines Fall 2017 1 Recap: Hopfield network &