BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer - PowerPoint PPT Presentation

Image: Jose-Luis Olivares BBM406 Fundamentals of   Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass Aykut Erdem // Hacettepe University // Fall 2019

        Last time… Linear Discriminant Function • Linear discriminant function for a vector x   y ( x ) = w T x + w 0 where w is called weight vector, and w 0 is a bias. • The classification function is   C ( x ) = sign ( w T x + w 0 ) where step function sign(·) is defined as ( + 1 , a > 0 sign ( a ) = − 1 , a < 0 slide by Ce Liu 3

Last time… Properties of Linear Discriminant Functions • The decision surface, shown in red, x 2 y > 0 is perpendicular to w , and its y = 0 displacement from the origin is R 1 y < 0 controlled by the bias paramete r R 2 w 0 .   • The signed orthogonal distance of x w a general point x from the decision y ( x ) k w k surface is given by y ( x )/|| w ||   x ? • y ( x ) gives a signed measure of the x 1 perpendicular distance r of the � w 0 point x from the decision surface k w k • y( x ) = 0 for x on the decision surface. The normal distance from the origin to the decision surface is w T x k w k = � w 0 k w k slide by Ce Liu • So w 0 determines the location of the decision surface. the decision surface. 4

Last time… Multiple Classes: Simple Extension • One-versus-the-rest classifier: classify C k and samples not in C k . • One-versus-one classifier: classify every pair of classes. C 3 C 1 ? R 1 R 3 R 1 C 1 ? R 2 C 3 C 1 R 2 R 3 C 2 C 2 not C 1 slide by Ce Liu C 2 not C 2 5

Last time… Multiple Classes: K-Class Discriminant • A single K -class discriminant comprising K linear functions y k ( x ) = w T k x + w k 0 • Decision function C ( x ) = k , if y k ( x ) > y j ( x ) 8 j 6 = k • The decision boundary between class C k and C j is given by y k ( x ) = y j ( x ) C C ( w k � w j ) T x + ( w k 0 � w j 0 ) = 0 slide by Ce Liu 6

  Last time… Fisher’s Linear Discriminant • Pursue the optimal linear projection on which the two classes can be maximally separated   y = w T x A way to view a linear • The mean vectors of the two classes   classification model is in terms of m 1 = 1 m 2 = 1 X X dimensionality x n , x n N 1 N 2 reduction. n ∈ C 1 n ∈ C 2 Within-class variance = w T S B w J ( w ) = Between-class variance w T S W w 4 4 2 2 0 0 − 2 − 2 slide by Ce Liu − 2 2 6 − 2 2 6 Di ff erence of means Fisher’s Linear Discriminant 7

8 Last time… Linear classification slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Interactive web demo time…. slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/ 11

Last time… Perceptron x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output X slide by Alex Smola f ( x ) = w i x i = h w, x i i 12

This Week • Multi-layer perceptron   • Forward Pass • Backward Pass   13

Introduction 14

A brief history of computers 1970s 1980s 1990s 2000s 2010s Data 10 2 10 3 10 11 10 5 10 8 RAM ? 1MB 100MB 10GB 1TB CPU ? 10MF 1GF 100GF 1PF GPU deep kernel   deep • Data grows   nets methods nets at higher exponent • Moore’s law (silicon) vs. Kryder’s law (disks) slide by Alex Smola • Early algorithms data bound, now CPU/RAM bound 15

Not linearly separable data • Some datasets are not linearly separable ! - e.g. XOR problem   • Nonlinear separation is trivial slide by Alex Smola 16

Addressing non-linearly separable data • Two options: - Option 1: Non-linear features - Option 2: Non-linear classifiers slide by Dhruv Batra 17

Option 1 — Non-linear features • Choose non-linear features, e.g., - Typical linear features: w 0 + Σ i w i x i - Example of non-linear features: • Degree 2 polynomials, w 0 + Σ i w i x i + Σ ij w ij x i x j • Classifier h w ( x ) still linear in parameters w - As easy to learn - Data is linearly separable in higher dimensional spaces - Express via kernels slide by Dhruv Batra 18

Option 2 — Non-linear classifiers • Choose a classifier h w ( x ) that is non-linear in parameters w , e.g., - Decision trees, neural networks,… • More general than linear classifiers • But, can often be harder to learn (non-convex optimization required) • Often very useful (outperforms linear classifiers) • In a way, both ideas are related slide by Dhruv Batra 19

Biological Neurons • Soma (CPU)   Cell body - combines signals   • Dendrite (input bus)   Combines the inputs from   several other nerve cells   • Synapse (interface)   Interface and parameter store between neurons   • Axon (cable)   May be up to 1m long and will transport the slide by Alex Smola activation signal to neurons at di ff erent locations 20

Recall: The Neuron Metaphor • Neurons - accept information from multiple inputs, - transmit information to other neurons. • Multiply inputs by weights along edges • Apply some function to the set of inputs at each node slide by Dhruv Batra 21

Types of Neuron 1 θ 1 θ 0 θ 2 f ( ~ x, ✓ ) X θ D y = θ 0 + x i θ i i Linear Neuron slide by Dhruv Batra 22

Types of Neuron 1 θ 1 θ 0 θ 2 f ( ~ x, ✓ ) X θ D y = θ 0 + x i θ i i Linear Neuron 1 θ 1 θ 0 θ 2 f ( ~ x, ✓ ) X z = θ 0 + x i θ i θ D slide by Dhruv Batra i ⇢ 1 if z ≥ 0 Perceptron y = 23 0 otherwise

Types of Neuron X 1 z = θ 0 + x i θ i θ 1 θ 0 i 1 y = θ 2 f ( ~ x, ✓ ) 1 + e − z 1 X θ D y = θ 0 + x i θ i θ 1 θ 0 i Linear Neuron θ 2 f ( ~ x, ✓ ) 1 θ D θ 1 θ 0 Logistic Neuron θ 2 f ( ~ x, ✓ ) X z = θ 0 + x i θ i θ D slide by Dhruv Batra i ⇢ 1 if z ≥ 0 Perceptron y = 24 0 otherwise

Types of Neuron X 1 z = θ 0 + x i θ i θ 1 θ 0 i 1 y = θ 2 f ( ~ x, ✓ ) 1 + e − z 1 X θ D y = θ 0 + x i θ i θ 1 θ 0 i Linear Neuron θ 2 f ( ~ x, ✓ ) 1 θ D θ 1 θ 0 Logistic Neuron θ 2 f ( ~ x, ✓ ) • Potentially more. Requires a convex X z = θ 0 + x i θ i θ D loss function for gradient descent slide by Dhruv Batra i training. ⇢ 1 if z ≥ 0 Perceptron y = 25 0 otherwise

Limitation • A single “neuron” is still a linear decision boundary • What to do? • Idea: Stack a bunch of them together! slide by Dhruv Batra 26

Nonlinearities via Layers • Cascade neurons together • The output from one layer is the input to the next • Each layer has its own sets of weights y 1 i = k ( x i , x ) Kernels y 1 i ( x ) = σ ( h w 1 i , x i ) y 2 ( x ) = σ ( h w 2 , y 1 i ) Deep Nets slide by Alex Smola optimize all weights 27

Nonlinearities via Layers y 1 i ( x ) = σ ( h w 1 i , x i ) y 2 i ( x ) = σ ( h w 2 i , y 1 i ) y 3 ( x ) = σ ( h w 3 , y 2 i ) slide by Alex Smola 28

    Representational Power • Neural network with at least one hidden layer is a universal approximator (can represent any function).   Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper   slide by Raquel Urtasun, Richard Zemel, Sanja Fidler • The capacity of the network increases with more hidden units and more hidden layers 29

A simple example • Consider a neural network 0 1 2 3 4 5 6 7 8 9 with two layers of neurons. - neurons in the top layer represent known shapes. - neurons in the bottom layer represent pixel intensities.   • A pixel gets to vote if it has ink on it. - Each inked pixel can vote for several di ff erent shapes.   𝑦 � 𝑦 � • The shape that gets the ¡ ¡𝑔(∑𝑥 � 𝑦 � ) slide by Geoffrey Hinton 𝑦 � most votes wins. … … 𝑦 � 30 �

How to display the weights 1 2 3 4 5 6 7 8 9 0 The input image Give each output unit its own “map” of the input image and display the weight coming from each pixel in the location of that pixel in the map. Use a black or white blob with the area representing the slide by Geoffrey Hinton magnitude of the weight and the color representing the sign. 31

How to learn the weights 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. slide by Geoffrey Hinton 32

1 2 3 4 5 6 7 8 9 0 The image slide by Geoffrey Hinton 33

BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer - PowerPoint PPT Presentation

Image: Jose-Luis Olivares BBM406 Fundamentals of Machine Learning Lecture 11: Multi-layer Perceptron Forward Pass Aykut Erdem // Hacettepe University // Fall 2019 Last time Linear Discriminant Function Linear

BBM406 Fundamentals of Machine Learning Lecture 1: Course outline and logistics An overview

BBM406 Fundamentals of Machine Learning Lecture 23: Dimensionality Reduction Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 6: Learning theory Probability Review Aykut

BBM406 Fundamentals of Machine Learning Lecture 18: Decision Trees Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 9: Logistic Regression Discriminative vs.

BBM406 Fundamentals of Machine Learning Lecture 13: Introduction to Deep Learning Aykut

BBM406 Fundamentals of Machine Learning Lecture 7: Probability Review (contd.) Maximum

BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron

BBM406 Fundamentals of Machine Learning Lecture 19: What is Ensemble Learning? Bagging

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

BBM406 Fundamentals of Machine Learning Lecture 2: Machine Learning by Examples, Nearest

BBM406 Fundamentals of Machine Learning Lecture 20: AdaBoost Aykut Erdem // Hacettepe

BBM406 Fundamentals of Machine Learning Lecture 15: Support Vector Machines Aykut Erdem //

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

BBM406 Fundamentals of Machine Learning Lecture 12: Computational Graph Backpropagation

BBM406 Fundamentals of Machine Learning Lecture 14: Deep Convolutional Networks Aykut Erdem

Nerve cell model and asymptotic expansion Yasushi ISHIKAWA [Department of Mathematics, Ehime

Out line Neural net wor ks Percept r on Neural Net works Supervised learning

Cognitive Neuroscience Philipp Koehn 7 February 2019 Philipp Koehn Artificial Intelligence:

Buddhas Brain: Strengthening the Neural Foundations of Mindfulness and Compassion Leading

On Fourier and Wavelets: On Fourier and Wavelets: Representation, Approximation and

Data Centric Systems and Networking (DCSN) Session 1: Introduction to R212 Eiko Yoneki Systems

Machine Learning and Rendering Alex Keller, Director of Research Machine Learning and Rendering

Carol Lambe Head of Commissioning and Delivery HFCCG In 2013/14 an extensive review of MSK