Machine Learning 2007: Lecture 8 Instructor: Tim van Erven - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/˜erven/teaching/0708/ml/ October 31, 2007 1 / 31

Overview Organisational Organisational Matters ● Matters Linear Functions as Inner Products ● Linear Functions as Inner Products Neural Networks ● Neural Networks ✦ The Perceptron Gradient Descent ✦ General Neural Networks Gradient Descent ● ✦ Convex Functions ✦ Gradient Descent in One Variable ✦ Gradient Descent in More Variables ✦ Optimizing Perceptron Weights 2 / 31

Course Organisation Final Exam: Organisational Matters Linear Functions as You have to enroll for the final exam on tisvu (when possible.) ● Inner Products The final exam will be more difficult than the intermediate ● Neural Networks exam. Gradient Descent Mitchell: Read: Chapter 4, sections 4.1 – 4.4. ● 3 / 31

Course Organisation Final Exam: Organisational Matters Linear Functions as You have to enroll for the final exam on tisvu (when possible.) ● Inner Products The final exam will be more difficult than the intermediate ● Neural Networks exam. Gradient Descent Mitchell: Read: Chapter 4, sections 4.1 – 4.4. ● This Lecture: Explanation of linear functions as inner products is needed to ● understand Mitchell. Neural networks are in Mitchell. I have some extra pictures. ● Convex functions are not discussed in Mitchell. ● I will give more background on gradient descent. ● 3 / 31

Linear Functions as Inner Products Linear Function: Organisational Matters Linear Functions as h w ( x ) = w 0 + w 1 x 1 + . . . + w d x d Inner Products Neural Networks x = ( x 1 , . . . , x d ) ⊤ is a d -dimensional feature vector. Gradient Descent ● w = ( w 0 , w 1 , . . . , w d ) ⊤ is a d + 1 -dimensional weight vector. ● 5 / 31

Linear Functions as Inner Products Linear Function: Organisational Matters Linear Functions as h w ( x ) = w 0 + w 1 x 1 + . . . + w d x d Inner Products Neural Networks x = ( x 1 , . . . , x d ) ⊤ is a d -dimensional feature vector. Gradient Descent ● w = ( w 0 , w 1 , . . . , w d ) ⊤ is a d + 1 -dimensional weight vector. ● As an Inner Product (a standard trick): We may change x into a d + 1 -dimensional vector x ′ by adding an imaginary extra feature x 0 , which always has value 1 : x ′ = (1 , x 1 , . . . , x d ) ⊤ x = ( x 1 , . . . , x d ) ⊤ ⇒ d � w i x ′ i = � w , x ′ � h w ( x ) = i =0 5 / 31

Linear Functions as Inner Products Linear Function: Organisational Matters Linear Functions as h w ( x ) = w 0 + w 1 x 1 + . . . + w d x d Inner Products Neural Networks x = ( x 1 , . . . , x d ) ⊤ is a d -dimensional feature vector. Gradient Descent ● w = ( w 0 , w 1 , . . . , w d ) ⊤ is a d + 1 -dimensional weight vector. ● As an Inner Product (a standard trick): We may change x into a d + 1 -dimensional vector x ′ by adding an imaginary extra feature x 0 , which always has value 1 : x ′ = (1 , x 1 , . . . , x d ) ⊤ x = ( x 1 , . . . , x d ) ⊤ ⇒ d � w i x ′ i = � w , x ′ � h w ( x ) = i =0 Mitchell writes w · x ′ for � w , x ′ � . ● 5 / 31

Artificial Neurons An Artificial Neuron: Organisational Matters Linear Functions as An (artificial) neuron is some function h that gets a feature vector Inner Products x as input and outputs a (single) label y . Neural Networks The Perceptron: Gradient Descent The most famous type of (artificial) neuron is the perceptron: � 1 if w 0 + w 1 x 1 + . . . w d x d > 0 , h w ( x ) = − 1 otherwise. Applies a threshold to a linear function of x . ● Has parameters w . ● 7 / 31

Different Views of The Perceptron Organisational Simple Neural Network: Mitchell’s Drawing: Matters Linear Functions as Inner Products x1 Neural Networks x2 Gradient Descent y1 x3 x4 OUTPUT INPUTS NEURONS OUTPUTS � 1 if w 0 + w 1 x 1 + . . . w d x d > 0 , Equation: h w ( x ) = − 1 otherwise. 8 / 31

Different Views of The Perceptron Organisational Simple Neural Network: Mitchell’s Drawing: Matters Linear Functions as Inner Products x1 Neural Networks x2 Gradient Descent y1 x3 x4 OUTPUT INPUTS NEURONS OUTPUTS � 1 if w 0 + w 1 x 1 + . . . w d x d > 0 , Equation: h w ( x ) = − 1 otherwise. One of the most simple neural networks consists of just one ● perceptron neuron. A perceptron does classification . ● 8 / 31

Different Views of The Perceptron Organisational Simple Neural Network: Mitchell’s Drawing: Matters Linear Functions as Inner Products x1 Neural Networks x2 Gradient Descent y1 x3 x4 OUTPUT INPUTS NEURONS OUTPUTS � 1 if w 0 + w 1 x 1 + . . . w d x d > 0 , Equation: h w ( x ) = − 1 otherwise. One of the most simple neural networks consists of just one ● perceptron neuron. A perceptron does classification . ● The network has no hidden units, and just one output. ● It may have any number of inputs. ● 8 / 31

Decision Boundary of the Perceptron Decision boundary: w 0 + w 1 x 1 + . . . + w d x d = 0 Organisational Matters Linear Functions as This is where the perceptron changes its output y from − 1 (-) ● Inner Products to +1 (+) if we change x a little bit. Neural Networks For d = 2 this decision boundary is always a line. ● Gradient Descent 9 / 31

Decision Boundary of the Perceptron Decision boundary: w 0 + w 1 x 1 + . . . + w d x d = 0 Organisational Matters Linear Functions as This is where the perceptron changes its output y from − 1 (-) ● Inner Products to +1 (+) if we change x a little bit. Neural Networks For d = 2 this decision boundary is always a line. ● Gradient Descent Representing Boolean Functions ( − 1 = false, 1 = true): AND OR x2 x2 3 3 2 2 − 1 + + 1 + x1 x1 −3 −2 −1 1 2 3 −3 −2 −1 1 2 3 − −1 − − −1 + −2 −2 −3 −3 w 0 = − 0 . 8 , w 1 = 0 . 5 , w 2 = 0 . 5 w 0 = 0 . 3 , w 1 = 0 . 5 , w 2 = 0 . 5 Wrong in Mitchell! 9 / 31

Perceptron Cannot Represent Exclusive Or Organisational Exclusive Or: Matters x2 Linear Functions as 3 Inner Products 2 Neural Networks Gradient Descent + − 1 x1 −3 −2 −1 1 2 3 − + −1 −2 −3 There exists no line that separates the inputs with label ‘-’ ● from the inputs with label ‘+’. They are not linearly separable . 10 / 31

Perceptron Cannot Represent Exclusive Or Organisational Exclusive Or: Matters x2 Linear Functions as 3 Inner Products 2 Neural Networks Gradient Descent + − 1 x1 −3 −2 −1 1 2 3 − + −1 −2 −3 There exists no line that separates the inputs with label ‘-’ ● from the inputs with label ‘+’. They are not linearly separable . The decision boundary for the perceptron is always a line. ● Hence a perceptron can never implement the ‘exclusive or’ ● function, whichever weights we choose! 10 / 31

Artificial Neural Networks Organisational x1 Matters y1 Linear Functions as x2 Inner Products y2 x3 Neural Networks y3 Gradient Descent x4 y4 x5 x6 HIDDEN OUTPUT INPUTS NEURONS NEURONS OUTPUTS We can create an (artificial) neural network (NN) by ● connecting neurons together. We hook up our feature vector x to the input neurons in the ● network. We get a label vector y from the output neurons. 12 / 31

Artificial Neural Networks Organisational x1 Matters y1 Linear Functions as x2 Inner Products y2 x3 Neural Networks y3 Gradient Descent x4 y4 x5 x6 HIDDEN OUTPUT INPUTS NEURONS NEURONS OUTPUTS We can create an (artificial) neural network (NN) by ● connecting neurons together. We hook up our feature vector x to the input neurons in the ● network. We get a label vector y from the output neurons. The parameters of the neural network w consist of all the ● parameters of the neurons in the network taken together in one big vector. 12 / 31

NN Example: ALVINN Organisational Sharp Straight Sharp Matters Left Ahead Right Linear Functions as Inner Products 30 Output Units Neural Networks Gradient Descent 4 Hidden Units 30x32 Sensor Input Retina 13 / 31

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven - PowerPoint PPT Presentation

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/erven/teaching/0708/ml/ October 31, 2007 1 / 31 Overview Organisational Organisational Matters Matters Linear Functions as Inner

Machine Learning 2007: Lecture 2 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 7 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 3 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Lecture 4 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Machine Learning 2007: Slides 1 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob

Learning Faster from Easy Data II Wouter Koolen Tim van Erven Aim of the Workshop

Follow the leader if you can, Hedge if you must Tim van Erven NIPS, 2013 Joint work with:

The Catch-up Phenomenon in Bayesian and MDL Model Selection Tim van Erven www.timvanerven.nl 23

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Making Regional Forecasts Add Up 1,2 Tim van Erven Joint work with: Jairo Cugliari 2 1 2

x ? Machine Learning 5/4/20 Tim Althoff, UW CS547: Machine Learning for Big Data,

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

CS 188: Artificial Intelligence Optimization and Neural Nets Instructors: Brijen Thananjeyan and

CS 4100: Artificial Intelligence Optimization and Neural Nets Jan-Willem van de Meent,

Lecture 16: Sociology of Science, 2 1 First phase: Merton. The role of credit in driving

REACH Fantasy Statistics #1 How and why to calcu lcula late with ithin in- su subje ject

IAML: Optimization Charles Sutton and Victor Lavrenko School of Informatics Semester 1 1 / 24

CSCI 446: Artificial Intelligence Optimization and Neural Nets Instructors: Michele Van Dyne

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of

X Example? In such cases, we can use local search algorithms Keep a single