deep networks
play

Deep networks CS 446 The ERM perspective These lectures will - PowerPoint PPT Presentation

Deep networks CS 446 The ERM perspective These lectures will follow an ERM perspective on deep networks: Pick a model/predictor class (network architecture). (We will spend most of our time on this!) Pick a loss/risk. (We will almost


  1. Deep networks CS 446

  2. The ERM perspective These lectures will follow an ERM perspective on deep networks: ◮ Pick a model/predictor class (network architecture). (We will spend most of our time on this!) ◮ Pick a loss/risk. (We will almost always use cross-entropy!) ◮ Pick an optimizer. (We will mostly treat this as a black box!) The goal is low test error, whereas above only gives low training error; we will briefly discuss this as well. 1 / 20

  3. 1. Linear networks.

  4. Iterated linear predictors The most basic view of a neural network is an iterated linear predictor. ◮ 1 layer: x �→ W 1 x + b 1 . ◮ 2 layers: x �→ W 2 ( W 1 x + b 1 ) + b 2 . ◮ 3 layers: � � x �→ W 3 W 2 ( W 1 x + b 1 ) + b 2 + b 3 . ◮ L layers: � � x �→ W L · · · ( W 1 x + b 1 ) · · · + b L . Alternatively, this is a composition of linear predictors: x �→ ( f L ◦ f L − 1 ◦ · · · ◦ f 1 ) ( x ) , where f i ( z ) = W i z + b i is an affine function. Note: “layer” terminology is ambiguous, we’ll revisit it. 2 / 20

  5. Wait a minute. . . Note that � � W L · · · ( W 1 x + b 1 ) · · · + b L = ( W L · · · W 1 ) x + ( b L + W L b L − 1 + · · · + W L · · · W 2 b 1 ) T [ x = w 1 ] , where w ∈ R d +1 is T 1: d = W L · · · W 1 , w d +1 = b L + W L b L − 1 + · · · + W L · · · W 2 b 1 . w Oops, this is just a linear predictor. 3 / 20

  6. 2. Activations/nonlinearities.

  7. Iterated logistic regression Recall that logistic regression could be interpreted as a probability model: 1 T x ) , Pr[ Y = 1 | X = x ] = 1 + exp( − w T x ) =: σ s ( w where σ s is the logistic or sigmoid function 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 4 / 20

  8. Iterated logistic regression Recall that logistic regression could be interpreted as a probability model: 1 T x ) , Pr[ Y = 1 | X = x ] = 1 + exp( − w T x ) =: σ s ( w where σ s is the logistic or sigmoid function 1 0.8 0.6 0.4 0.2 0 -6 -4 -2 0 2 4 6 Now suppose σ s is applied coordinate-wise, and consider x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ s ( W i z + b i ) . Don’t worry, we’ll slow down next slide; for now, iterated logistic regression gave our first deep network! Remark: can view intermediate layers as features to subsequent layers. 4 / 20

  9. Basic deep networks A self-contained expression is � � � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L W L σ L − 1 , with equivalent “functional form” x �→ ( f L ◦ · · · ◦ f 1 )( x ) where f i ( z ) = σ i ( W i z + b i ) . Some further details (many more to come!): i =1 with W i ∈ R d i − 1 × d i are the weights, and ( b i ) L ◮ ( W i ) L i =1 are the biases. i =1 with σ i : R d i → R d i are called nonlinearties, or activations, or ◮ ( σ i ) L transfer functions, or link functions. ◮ This is only the basic setup; many things can and will change, please ask many questions! 5 / 20

  10. Choices of activation Basic form: � � � � x �→ σ L W L σ L − 1 · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L . Choices of activation (univariate, coordinate-wise): ◮ Indicator/step/heavyside/threshold z �→ 1 [ z ≥ 0] . This was the original choice (1940s!). ◮ Sigmoid σ s ( z ) := 1 1+exp( − z ) . This was popular roughly 1970s - 2005? ◮ Hyperbolic tangent z �→ tanh( z ) . Similar to sigmoid, used during same interval. ◮ Rectified Linear Unit (ReLU) σ r ( z ) = max { 0 , z } . It (and slight variants, e.g., Leaky ReLU, ELU, . . . ) are the dominant choice now; popularized in “Imagenet/AlexNet” paper (Krizhevsky-Sutskever-Hinton, 2012). ◮ Identity z �→ z ; we’ll often use this as the last layer when we use cross-entropy loss. ◮ NON-coordinate-wise choices: we will discuss “softmax” and “pooling” a bit later. 6 / 20

  11. “Architectures” and “models” Basic form: � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · + b L W L σ L − 1 . (( W i , b i )) L i =1 , the weights and biases, are the parameters. Let’s roll them into W := (( W i , b i )) L i =1 , and consider the network as a two-parameter function F W ( x ) = F ( x ; W ) . ◮ The model or class of functions is { F W : all possible W} . F (both arguments unset) is also called an architecture. ◮ When we fit/train/optimize, typically we leave the architecture fixed and vary W to minimize risk. (More on this in a moment.) 7 / 20

  12. ERM recipe for basic networks Standard ERM recipe: ◮ First we pick a class of functions/predictors; for deep networks, that means a F ( · , · ) . ◮ Then we pick a loss function and write down an empirical risk minimization problem; in these lectures we will pick cross-entropy: n 1 � � � arg min ℓ ce y i , F ( x i , W ) n W i =1 n 1 � � � y i , F ( x i ; (( W i , b i )) L = arg min ℓ ce i =1 ) n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL n 1 � = arg min � y i , σ L ( · · · σ 1 ( W 1 x i + b 1 ) · · · ) � ℓ ce . n W 1 ∈ R d × d 1 , b 1 ∈ R d 1 i =1 . . . W L ∈ R dL − 1 × dL , b L ∈ R dL ◮ Then we pick an optimizer. In this class, we only use gradient descent variants. It is a miracle that this works. 8 / 20

  13. Remark on affine expansion Note: we are writing � � � � x �→ σ L · · · W 2 σ 1 ( W 1 x + b 1 ) + b 2 · · · , rather than � � � �� � W 1 [ x x �→ σ L · · · W 2 σ 1 1 ] · · · . ◮ First form seems natural: With “iterated linear prediction” perspective, it is natural to append 1 at every layer. ◮ Second form is sufficient: with ReLU, σ r (1) = 1 , so can pass forward the constant; similar (but more complicated) options exist for other activations. ◮ Why do we do it? It seems to make the optimization better behaved; this is currently not well understood. 9 / 20

  14. Which architecture? How do choose an architecture? 10 / 20

  15. Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? 10 / 20

  16. Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? ◮ Split data into training and validation, train different architectures and evaluate them on validation, choose architecture with lowest validiation error. ◮ As with other methods, this is a proxy to minimizing test error. 10 / 20

  17. Which architecture? How do choose an architecture? ◮ How did we choose k in k -nn? ◮ Split data into training and validation, train different architectures and evaluate them on validation, choose architecture with lowest validiation error. ◮ As with other methods, this is a proxy to minimizing test error. Note. ◮ For many standard tasks (e.g., classification of standard vision datasets), people know good architectures. ◮ For new problems and new domains, things are absolutely not settled. 10 / 20

  18. 3. What we have gained: representation power

  19. Sometimes, linear just isn’t enough 1.00 1.00 0.75 0.75 0 -3.000 0 0 . 2 3 0.50 0.50 - -24.000 - 0.25 0.25 8 -1.500 . 0 0 0 0 0 0 -8.000 0.00 0.00 6 . 8.000 1 - 0.000 0 0.000 0 1.500 0 . 8 0.25 0.25 3.000 0.50 0.50 0.000 0.75 0.75 4.500 1.00 1.00 16.000 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Linear predictor: ReLU network: w T [ x x �→ 1 ] . x �→ W 2 σ r ( W 1 x + b 1 ) + b 2 . Some blue points misclassified. 0 misclassifications! 11 / 20

  20. Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. 12 / 20

  21. Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. 12 / 20

  22. Classical example: XOR Classical “XOR problem” (Minsky-Papert-’69). (Check wikipedia for “AI Winter”.) Theorem. On this data, any linear classifier (with affine expansion) makes at least one mistake. Picture proof. Recall: linear classifiers correspond to separating hyperplanes. ◮ If it splits the blue points, it’s incorrect on one of them. ◮ If it doesn’t split the blue points, then one halfspace contains the common midpoint, and therefore wrong on at least one red point. 12 / 20

  23. One layer was not enough. How about two? Theorem (Cybenko ’89, Hornik-Stinchcombe-White ’89, Funahashi ’89, Leshno et al ’92, . . . ). Given any continuous function f : R d → R and any ǫ > 0 , there exist parameters ( W 1 , b 1 , W 2 ) so that � ≤ ǫ, � � sup � f ( x ) − W 2 σ ( W 1 x + b 1 ) x ∈ [0 , 1] d as long as σ is “reasonable” (e.g., ReLU or sigmoid or threshold). 13 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend