csc 311 introduction to machine learning
play

CSC 311: Introduction to Machine Learning Lecture 4 - Neural - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec4 1 / 51 Announcements Homework 2 is posted! Deadline Oct


  1. CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec4 1 / 51

  2. Announcements Homework 2 is posted! Deadline Oct 14, 23:59. Intro ML (UofT) CSC311-Lec4 2 / 51

  3. Overview Design choices so far task: regression, binary classification, multi-way classification model : linear, logistic, hard coded feature maps, feed-forward neural network loss : squared error, 0-1 loss, cross-entropy regularization L 2 , L p , early stopping optimization : direct solutions, linear programming, gradient descent (backpropagation) Intro ML (UofT) CSC311-Lec4 3 / 51

  4. Neural Networks Intro ML (UofT) CSC311-Lec4 4 / 51

  5. Inspiration: The Brain Neurons receive input signals and accumulate voltage. After some threshold they will fire spiking responses. [Pic credit: www.moleculardevices.com] Intro ML (UofT) CSC311-Lec4 5 / 51

  6. Inspiration: The Brain For neural nets, we use a much simpler model neuron, or unit : Compare with logistic regression: y = σ ( w ⊤ x + b ) By throwing together lots of these incredibly simplistic neuron-like processing units, we can do some powerful computations! Intro ML (UofT) CSC311-Lec4 6 / 51

  7. Multilayer Perceptrons Intro ML (UofT) CSC311-Lec4 7 / 51

  8. Multilayer Perceptrons We can connect lots of units together into a directed acyclic graph . Typically, units are grouped into layers . This gives a feed-forward neural network . Intro ML (UofT) CSC311-Lec4 8 / 51

  9. Multilayer Perceptrons Each hidden layer i connects N i − 1 input units to N i output units. In a fully connected layer, all input units are connected to all output units. Note: the inputs and outputs for a layer are distinct from the inputs and outputs to the network. If we need to compute M outputs from N inputs, we can do so using matrix multiplication. This means we’ll be using a M × N matrix The outputs are a function of the input units: y = f ( x ) = φ ( Wx + b ) φ is typically applied component-wise. A multilayer network consisting of fully connected layers is called a multilayer perceptron. Intro ML (UofT) CSC311-Lec4 9 / 51

  10. Multilayer Perceptrons Some activation functions: Rectified Linear Identity Unit Soft ReLU (ReLU) y = log 1 + e z y = z y = max(0 , z ) Intro ML (UofT) CSC311-Lec4 10 / 51

  11. Multilayer Perceptrons Some activation functions: Hyperbolic Tangent Hard Threshold Logistic (tanh) � 1 if z > 0 1 y = e z − e − z y = y = 0 if z ≤ 0 1 + e − z e z + e − z Intro ML (UofT) CSC311-Lec4 11 / 51

  12. Multilayer Perceptrons Each layer computes a function, so the network computes a composition of functions: h (1) = f (1) ( x ) = φ ( W (1) x + b (1) ) h (2) = f (2) ( h (1) ) = φ ( W (2) h (1) + b (2) ) . . . y = f ( L ) ( h ( L − 1) ) Or more simply: y = f ( L ) ◦ · · · ◦ f (1) ( x ) . Neural nets provide modularity: we can implement each layer’s computations as a black box. Intro ML (UofT) CSC311-Lec4 12 / 51

  13. Feature Learning Last layer: If task is regression: choose y = f ( L ) ( h ( L − 1) ) = ( w ( L ) ) ⊤ h ( L − 1) + b ( L ) If task is binary classification: choose y = f ( L ) ( h ( L − 1) ) = σ (( w ( L ) ) ⊤ h ( L − 1) + b ( L ) ) So neural nets can be viewed as a way of learning features: The goal: Intro ML (UofT) CSC311-Lec4 13 / 51

  14. Feature Learning Suppose we’re trying to classify images of handwritten digits. Each image is represented as a vector of 28 × 28 = 784 pixel values. Each first-layer hidden unit computes φ ( w ⊤ i x ). It acts as a feature detector . We can visualize w by reshaping it into an image. Here’s an example that responds to a diagonal stroke. Intro ML (UofT) CSC311-Lec4 14 / 51

  15. Feature Learning Here are some of the features learned by the first hidden layer of a handwritten digit classifier: Unlike hard-coded feature maps (e.g., in polynomial regression), features learned by neural networks adapt to patterns in the data. Intro ML (UofT) CSC311-Lec4 15 / 51

  16. Expressivity In Lecture 3, we introduced the idea of a hypothesis space H , which is the set of input-output mappings that can be represented by some model. Suppose we are deciding between two models A, B with hypothesis spaces H A , H B . If H B ⊆ H A , then A is more expressive than B . A can represent any function f in H B . Some functions (XOR) can’t be represented by linear classifiers. Are deep networks more expressive? Intro ML (UofT) CSC311-Lec4 16 / 51

  17. Expressivity—Linear Networks Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input ◮ We call this a linear layer Any sequence of linear layers can be equivalently represented with a single linear layer. y = W (3) W (2) W (1) x � �� � � W ′ ◮ Deep linear networks can only represent linear functions. ◮ Deep linear networks are no more expressive than linear regression. Intro ML (UofT) CSC311-Lec4 17 / 51

  18. Expressive Power—Non-linear Networks Multilayer feed-forward neural nets with nonlinear activation functions are universal function approximators : they can approximate any function arbitrarily well, i.e., for any f : X → T there is a sequence f i ∈ H with f i → f . This has been shown for various activation functions (thresholds, logistic, ReLU, etc.) ◮ Even though ReLU is “almost” linear, it’s nonlinear enough. Intro ML (UofT) CSC311-Lec4 18 / 51

  19. Multilayer Perceptrons Designing a network to classify XOR: Assume hard threshold activation function Intro ML (UofT) CSC311-Lec4 19 / 51

  20. Multilayer Perceptrons h 1 computes I [ x 1 + x 2 − 0 . 5 > 0] ◮ i.e. x 1 OR x 2 h 2 computes I [ x 1 + x 2 − 1 . 5 > 0] ◮ i.e. x 1 AND x 2 y computes I [ h 1 − h 2 − 0 . 5 > 0] ≡ I [ h 1 + (1 − h 2 ) − 1 . 5 > 0] ◮ i.e. h 1 AND (NOT h 2 ) = x 1 XOR x 2 Intro ML (UofT) CSC311-Lec4 20 / 51

  21. Expressivity Universality for binary inputs and targets: Hard threshold hidden units, linear output Strategy: 2 D hidden units, each of which responds to one particular input configuration Only requires one hidden layer, though it needs to be extremely wide. Intro ML (UofT) CSC311-Lec4 21 / 51

  22. Expressivity What about the logistic activation function? You can approximate a hard threshold by scaling up the weights and biases: y = σ ( x ) y = σ (5 x ) This is good: logistic units are differentiable, so we can train them with gradient descent. Intro ML (UofT) CSC311-Lec4 22 / 51

  23. Expressivity—What is it good for? Universality is not necessarily a golden ticket. ◮ You may need a very large network to represent a given function. ◮ How can you find the weights that represent a given function? Expressivity can be bad: if you can learn any function, overfitting is potentially a serious concern! ◮ Recall the polynomial feature mappings from Lecture 2. Expressivity increases with the degree M , eventually allowing multiple perfect fits to the training data. This motivated L 2 regularization. Do neural networks overfit and how can we regularize them? Intro ML (UofT) CSC311-Lec4 23 / 51

  24. Regularization and Overfitting for Neural Networks The topic of overfitting (when & how it happens, how to regularize, etc.) for neural networks is not well-understood, even by researchers! ◮ In principle, you can always apply L 2 regularization. ◮ You will learn more in CSC413. A common approach is early stopping, or stopping training early, because overfitting typically increases as training progresses. Unlike L 2 regularization, we don’t add an explicit R ( θ ) term to our cost. Intro ML (UofT) CSC311-Lec4 24 / 51

  25. Training neural networks with backpropagation Intro ML (UofT) CSC311-Lec4 25 / 51

  26. Recap: Gradient Descent Recall: gradient descent moves opposite the gradient (the direction of steepest descent) Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers Conceptually, not any different from what we’ve seen so far — just higher dimensional and harder to visualize! We want to define a loss L and compute the gradient of the cost d J / d w , which is the vector of partial derivatives. ◮ This is the average of d L / d w over all the training examples, so in this lecture we focus on computing d L / d w . Intro ML (UofT) CSC311-Lec4 26 / 51

  27. Univariate Chain Rule Let’s now look at how we compute gradients in neural networks. We’ve already been using the univariate Chain Rule. Recall: if f ( x ) and x ( t ) are univariate functions, then d tf ( x ( t )) = d f d d x d t . d x Intro ML (UofT) CSC311-Lec4 27 / 51

  28. Univariate Chain Rule Recall: Univariate logistic least squares model z = wx + b y = σ ( z ) L = 1 2( y − t ) 2 Let’s compute the loss derivatives ∂ L ∂w , ∂ L ∂b Intro ML (UofT) CSC311-Lec4 28 / 51

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend