15-388/688 - Practical Data Science: Deep learning J. Zico Kolter - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Deep learning J. Zico Kolter Carnegie Mellon University Fall 2019 1

Outline Recent history in machine learning Machine learning with neural networks Training neural networks Specialized neural network architectures Deep learning in data science 2

AlexNet “AlexNet” (Krizhevsky et al., 2012), winning entry of ImageNet 2012 competition with a Top-5 error rate of 15.3% (next best system with highly engineered features based got 26.1% error) 4

AlphaGo 5

Google Translate In November 2016, Google transitioned it’s translation service to a deep-learning- based system, dramatically improved translation quality in many settings Kilimanjaro is 19,710 feet of the Kilimanjaro is a mountain of 19,710 mountain covered with snow, and it is feet covered with snow and is said to be the highest mountain in Africa. said that the highest mountain in The summit of the west is called Africa. Top of the west, “Ngaje Ngai” in the Maasai language, has been referred “Ngaje Ngai” in Masai, the house of to as the house of God. The top close to God. Near the top of the west there the west, there is a dry, frozen carcass is a dry and frozen dead body of leopard. No one has ever explained of a leopard. Whether the leopard had what leopard wanted at that altitude. what the demand at that altitude, there is no that nobody explained. https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html 6

Neural networks for machine learning The term “neural network” largely refers to the hypothesis class part of a machine learning algorithm: 1. Hypothesis: non-linear hypothesis function, which involve compositions of multiple linear operators (e.g. matrix multiplications) and elementwise non- linear functions 2. Loss: “Typical” loss functions for classification and regression: logistic, softmax (multiclass logistic), hinge, squared error, absolute error 3. Optimization: Gradient descent, or more specifically, a variant called stochastic gradient descent we will discuss shortly 8

Linear hypotheses and feature learning Until now, we have (mostly) considered machine learning algorithms that linear hypothesis class ℎ 휃 𝑦 = 𝜄 푇 𝜚 𝑦 where 𝜚: ℝ 푛 → ℝ 푘 denotes some set of typically non-linear features Example: polynomials, radial basis functions, custom features like TFIDF (in many domains every 10 years or so there would be new feature types) The performance of these algorithms depends crucially on coming up with good features Key question: can we come up with an algorithm that will automatically learn the features themselves? 9

Feature learning, take one Instead of a simple linear classifier, let’s consider a two-stage hypothesis class where one linear function creates the features and another produces the final hypothesis ℎ 휃 𝑦 = 𝑋 2 𝜚 𝑦 + 𝑐 2 = 𝑋 2 𝑋 1 𝑦 + 𝑐 1 + 𝑐 2 , 𝑋 1 ∈ ℝ 푘×푛 , 𝑐 1 ∈ ℝ 푘 , 𝑋 2 ∈ ℝ 1×푘 , 𝑐 2 ∈ ℝ 𝜄 = But there is a problem: ℎ 휃 𝑦 = 𝑋 2 𝑋 1 𝑦 + 𝑐 1 + 𝑐 2 = ̃ 𝑋𝑦 + ̃ 𝑐 i.e., we are still just using a linear classifier (the apparent added complexity is actually not changing the underlying hypothesis function) 10

Neural networks Neural networks are a simple extension of this idea, where we additionally apply a non-linear function after each linear transformation ℎ 휃 𝑦 = 𝑔 2 𝑋 2 𝑔 1 𝑋 1 𝑦 + 𝑐 1 + 𝑐 2 where 𝑔 1 , 𝑔 2 : ℝ → ℝ are a non-linear function (applied elementwise) Common choices of 𝑔 푖 : Hyperbolic tangent: 𝑔 𝑦 = tanh 𝑦 = 푒 2푥 −1 푒 2푥 +1 1 Sigmoid: 𝑔 𝑦 = 𝜏 𝑦 = 1+푒 −푥 Rectified linear unit (ReLU): 𝑔 𝑦 = max 𝑦, 0 11

Illustrating neural networks We can illustrate the form of neural networks using figures like the following W 1 , b 1 z 1 x 1 W 2 , b 2 z 2 x 2 y . . . . . . x n z k Middle layer 𝑨 is referred to as the hidden layer or activations These are the learned features, nothing in the data prescribed what values they should take, left up to algorithm to decide 12

Deep learning “Deep learning” refers (almost always) to machine learning using neural network models with multiple hidden layers z 2 z 3 z 4 z 1 = x z 5 . . . = h θ ( x ) . . . . . . . . . W 4 , b 4 W 1 , b 1 W 2 , b 2 W 3 , b 3 Hypothesis function for 𝑙 -layer network 𝑨 푖+1 = 𝑔 푖 𝑋 푖 𝑨 푖 + 𝑐 푖 , 𝑨 1 = 𝑦, ℎ 휃 𝑦 = 𝑨 푘 (note the 𝑨 푖 here refers to a vector, not an entry into vector) 13

Properties of neural networks A neural network will a single hidden layers (and enough hidden units) is a universal function approximator , can approximate any function over inputs In practice, not that relevant (similar to how polynomials can fit any function), and the more important aspect is that they appear to work very well in practice for many domains The hypothesis ℎ 휃 𝑦 is not a convex function of parameters 𝜄 = {𝑋 푖 , 𝑐 푖 } , so we have possibility of local optima Architectural choices (how many layers, how they are connected, etc), become important algorithmic design choices (i.e. hyperparameters) 14

Why use deep networks Motivation from circuit theory: many function can be represented more efficiently using deep networks (e.g., parity function requires 𝑃(2 𝑜 ) hidden units with single hidden layer, 𝑃 𝑜 with 𝑃(log 𝑜) layers • But not clear if deep learning really learns these types of network Motivation from biology: brain appears to use multiple levels of interconnected neurons • But despite the name, the connection between neural networks and biology is extremely weak Motivation from practice: works much better for many domains • Hard to argue with results 15

Why now? Better models and algorithms Lots of Lots of data computing power 16

Poll: Benefits of deep networks What advantages would you expect of applying a deep network to some machine learning problem versus a (pure) linear classifier? 1. Less chance of overfitting data 2. Can capture more complex prediction functions 3. Better test set performance when the number of data points is small 4. Better training set performance when number of data points is small 5. Better test set performance when number of data points in large 17

Neural networks for machine learning Hypothesis function: neural network Loss function: “traditional” loss, e.g. logistic loss for binary classification: ℓ ℎ 휃 𝑦 , 𝑧 = log 1 + exp −𝑧 ⋅ ℎ 휃 𝑦 Optimization: How do we solve the optimization problem 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ 휃 푖=1 Just use gradient descent as normal (or rather, a version called stochastic gradient descent) 19

Stochastic gradient descent Key challenge for neural networks: often have very large number of samples, computing gradients can be computationally intensive. Traditional gradient descent computes the gradient with respect to the sum over all examples, then adjusts the parameters in this direction 푚 𝛼 휃 ℓ(ℎ 휃 𝑦 푖 , 𝑧 푖 𝜄 ≔ 𝜄 − 𝛽 ∑ 푖=1 Alternative approach, stochastic gradient descent (SGD): adjust parameters based upon just one sample 𝜄 ≔ 𝜄 − 𝛽𝛼 휃 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 and then repeat these updates for all samples 20

Gradient descent vs. SGD Gradi Gr dient de descent, repe peat: • For 𝑗 = 1, … , 𝑛 : 𝑕 푖 ← 𝛼 휃 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 • Update parameters: 푚 𝑕 푖 𝜄 ← 𝜄 − 𝛽 ∑ 푖=1 St Stochastic gradient descent, repeat: • For 𝑗 = 1, … , 𝑛 : 𝜄 ← 𝜄 − 𝛼 휃 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 In practice, stochastic gradient descent uses a small collection of samples, not just one, called a minibatch 21

Computing gradients: backpropagation So, how do we compute the gradient 𝛼 휃 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 ? Remember 𝜄 here denotes a set of parameters, so we’re really computing gradients with respect to all elements of that set This is accomplished via the backpropagation algorithm We won’t cover the algorithm in detail, but backpropagation is just an application of the (multivariate) chain rule from calculus, plus “caching” intermediate terms that, for instance, occur in the gradient of both 𝑋 1 and 𝑋 2 22

Training neural networks in practice The other good news is also that you will rarely need to implement backpropagation yourself Many libraries provides methods for you to just specify the neural network “forward” pass, and automatically compute the necessary gradients Examples: Tensorflow, PyTorch You’ll use one of these a bit on the homework 23

15-388/688 - Practical Data Science: Deep learning J. Zico Kolter - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Deep learning J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Recent history in machine learning Machine learning with neural networks Training neural networks Specialized neural network

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Intro to Machine Learning & Linear Regression J. Zico

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Recommender systems J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Jupyter notebook lab J. Zico Kolter Carnegie Mellon

On commensurability of fibrations on a hyperbolic 3-manifold Hidetoshi Masai Tokyo institute of

Hviz: HTTP(S) Traffic Aggregation and Visualization for Network Forensics David Gugelmann, Fabian

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and

POLI 437: International Relations of Latin America This week Corruption and the Odebrecht

connecting those who need it most ICT/IoT Deployments for Health Care & Disaster Relief ICTP

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

Dehn filling of a Hyperbolic 3-manifold Maria Trnkov Department of Mathematics University of

Make up my mind about The Melbourne Cup T H I N K K I N D A U S T R A L I A H U M A N E E D

15-388/688 - Practical Data Science: Deep learning J. Zico Kolter - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Deep learning J. Zico Kolter Carnegie Mellon University Fall 2019 1 Outline Recent history in machine learning Machine learning with neural networks Training neural networks Specialized neural network

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Intro to Machine Learning &amp; Linear Regression J. Zico

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Recommender systems J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Jupyter notebook lab J. Zico Kolter Carnegie Mellon

On commensurability of fibrations on a hyperbolic 3-manifold Hidetoshi Masai Tokyo institute of

Hviz: HTTP(S) Traffic Aggregation and Visualization for Network Forensics David Gugelmann, Fabian

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and

POLI 437: International Relations of Latin America This week Corruption and the Odebrecht

connecting those who need it most ICT/IoT Deployments for Health Care &amp; Disaster Relief ICTP

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

Dehn filling of a Hyperbolic 3-manifold Maria Trnkov Department of Mathematics University of

Make up my mind about The Melbourne Cup T H I N K K I N D A U S T R A L I A H U M A N E E D

15-388/688 - Practical Data Science: Intro to Machine Learning & Linear Regression J. Zico

connecting those who need it most ICT/IoT Deployments for Health Care & Disaster Relief ICTP