Reminder: Linear Classifiers CS 188: Artificial Intelligence - PowerPoint PPT Presentation

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets § Inputs are feature values § Each feature has a weight § Sum is the activation § If the activation is: w 1 f 1 w 2 S § Positive, output +1 >0? f 2 w 3 § Negative, output -1 f 3 Instructors: Pieter Abbeel and Dan Klein --- University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] How to get probabilistic decisions? Best w? § Activation: § Maximum likelihood estimation: z = w · f ( x ) § If very positive à want probability going to 1 z = w · f ( x ) X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max § If very negative à want probability going to 0 z = w · f ( x ) w w i § Sigmoid function 1 P ( y ( i ) = +1 | x ( i ) ; w ) = with: 1 + e − w · f ( x ( i ) ) 1 1 φ ( z ) = P ( y ( i ) = − 1 | x ( i ) ; w ) = 1 − 1 + e − z 1 + e − w · f ( x ( i ) ) = Logistic Regression

Multiclass Logistic Regression Best w? § Multi-class linear classification § Maximum likelihood estimation: § A weight vector for each class: X log P ( y ( i ) | x ( i ) ; w ) § Score (activation) of a class y: max ll ( w ) = max w w § Prediction w/highest score wins: i e w y ( i ) · f ( x ( i ) ) § How to make the scores into probabilities? P ( y ( i ) | x ( i ) ; w ) = with: y e w y · f ( x ( i ) ) P e z 1 e z 2 e z 3 z 1 , z 2 , z 3 → e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 , e z 1 + e z 2 + e z 3 = Multi-Class Logistic Regression original activations softmax activations This Lecture Hill Climbing § Recall from CSPs lecture: simple, general idea § Start wherever § Optimization § Repeat: move to the best neighboring state § If no neighbors better than current, quit § i.e., how do we solve: X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max § What’s particularly tricky when hill-climbing for multiclass w w logistic regression? i • Optimization over a continuous space • Infinitely many neighbors! • How to do this efficiently?

1-D Optimization 2-D Optimization g ( w ) g ( w 0 ) w w 0 § Could evaluate and g ( w 0 − h ) g ( w 0 + h ) § Then step in best direction ∂ g ( w 0 ) g ( w 0 + h ) − g ( w 0 − h ) § Or, evaluate derivative: = lim ∂ w 2 h h → 0 § Tells which direction to step into Source: offconvex.org Gradient Ascent Gradient Ascent § Idea: § Perform update in uphill direction for each coordinate § Start somewhere § The steeper the slope (i.e. the higher the derivative) the bigger the step § Repeat: Take a step in the gradient direction for that coordinate § E.g., consider: § Updates: § Updates in vector notation: with: = gradient Figure source: Mathworks

What is the Steepest Direction? Gradient in n dimensions  ∂ g  ∂ w 1 ∂ g   ∂ w 2 r g = g ( w + ∆ ) ≈ g ( w ) + ∂ g ∆ 1 + ∂ g   § First-Order Taylor Expansion: ∆ 2   ∂ w 1 ∂ w 2 · · ·   ∂ g § Steepest Descent Direction: ∂ w n § Recall: à " # ∂ g § Hence, solution: Gradient direction = steepest direction! ∂ w 1 r g = ∂ g ∂ w 2 Optimization Procedure: Gradient Ascent Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max § init w w w i § for iter = 1, 2, … § init w : learning rate --- tweaking parameter that needs to be § § for iter = 1, 2, … α chosen carefully § How? Try multiple choices § Crude rule of thumb: update changes about 0.1 – 1 % w

Stochastic Gradient Ascent on the Log Likelihood Objective Mini-Batch Gradient Ascent on the Log Likelihood Objective X log P ( y ( i ) | x ( i ) ; w ) X log P ( y ( i ) | x ( i ) ; w ) max ll ( w ) = max max ll ( w ) = max w w w w i i Observation: once gradient on one training example has been Observation: gradient over small set of training examples (=mini-batch) computed, might as well incorporate before computing next one can be computed in parallel, might as well do that instead of a single one § init § init w w § for iter = 1, 2, … § for iter = 1, 2, … § pick random j § pick random subset of training examples J How about computing all the derivatives? Neural Networks § We’ll talk about that once we covered neural networks, which are a generalization of logistic regression

Multi-class Logistic Regression Deep Neural Network = Also learn the features! § = special case of neural network f 1 (x) f 1 (x) z 1 s z 1 s o o f 2 (x) f 2 (x) f f t t z 2 z 2 f 3 (x) f 3 (x) m m a a x x … … z 3 z 3 f K (x) f K (x) Deep Neural Network = Also learn the features! Deep Neural Network = Also learn the features! x 1 x 1 f 1 (x) s s x 2 o x 2 o f 2 (x) f f … … t t x 3 x 3 f 3 (x) m m a a x x … … … … … … … … … … x L x L f K (x) g = nonlinear activation function g = nonlinear activation function

Common Activation Functions Deep Neural Network: Also Learn the Features! § Training the deep neural network is just like logistic regression: just w tends to be a much, much larger vector J à just run gradient ascent + stop when log likelihood of hold-out data starts to decrease [source: MIT 6.S191 introtodeeplearning.com] Universal Function Approximation Theorem* Neural Networks Properties § Theorem (Universal Function Approximators). A two-layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy. § Practical considerations § Can be seen as learning the features § In words: Given any continuous function f(x), if a 2-layer neural § Large number of neurons network has enough hidden units, then there is a choice of § Danger for overfitting § (hence early stopping!) weights that allow it to closely approximate f(x). Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function”

Universal Function Approximation Theorem* Fun Neural Net Demo Site § Demo-site: § http://playground.tensorflow.org/ Cybenko (1989) “Approximations by superpositions of sigmoidal functions” Hornik (1991) “Approximation Capabilities of Multilayer Feedforward Networks” Leshno and Schocken (1991) ”Multilayer Feedforward Networks with Non-Polynomial Activation Functions Can Approximate Any Function” How about computing all the derivatives? How about computing all the derivatives? § Derivatives tables: n But neural net f is never one of those? n No problem: CHAIN RULE: f ( x ) = g ( h ( x )) If Then f 0 ( x ) = g 0 ( h ( x )) h 0 ( x ) à Derivatives can be computed by following well-defined procedures [source: http://hyperphysics.phy-astr.gsu.edu/hbase/Math/derfunc.html

Automatic Differentiation Summary of Key Ideas § Optimize probability of label given input § Automatic differentiation software § Continuous optimization § e.g. Theano, TensorFlow, PyTorch, Chainer § Gradient ascent: § Only need to program the function g(x,y,w) § Compute steepest uphill direction = gradient (= just vector of partial derivatives) § Take step in the gradient direction § Can automatically compute all derivatives w.r.t. all entries in w § Repeat (until held-out data accuracy starts to drop = “early stopping”) § This is typically done by caching info during forward computation pass § Deep neural nets of f, and then doing a backward pass = “backpropagation” § Last layer = still logistic regression § Now also many more layers before this last layer § Autodiff / Backpropagation can often be done at computational cost § = computing the features comparable to the forward pass § à the features are learned rather than hand-designed § Universal function approximation theorem § Need to know this exists neural net is large enough § If § Then neural net can represent any continuous mapping from input to output with arbitrary accuracy § How this is done? -- outside of scope of CS188 § But remember: need to avoid overfitting / memorizing the training data à early stopping! § Automatic differentiation gives the derivatives efficiently (how? = outside of scope of 188) How well does it work? Computer Vision

Object Detection Manual Feature Design Features and Generalization Features and Generalization Image HoG [HoG: Dalal and Triggs, 2005]

Performance Performance graph credit Matt graph credit Matt Zeiler, Clarifai Zeiler, Clarifai Performance Performance AlexNet AlexNet graph credit Matt graph credit Matt Zeiler, Clarifai Zeiler, Clarifai

Performance MS COCO Image Captioning Challenge AlexNet Karpathy & Fei-Fei, 2015; Donahue et al., 2015; Xu et al, 2015; many more graph credit Matt Zeiler, Clarifai Visual QA Challenge Speech Recognition Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh graph credit Matt Zeiler, Clarifai

Machine Translation Next: More Neural Net Applications! Google Neural Machine Translation (in production)

Reminder: Linear Classifiers CS 188: Artificial Intelligence - PowerPoint PPT Presentation

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets Inputs are feature values Each feature has a weight Sum is the activation If the activation is: w 1 f 1 w 2 S Positive, output +1 >0? f 2

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Outline CS 188: Artificial Intelligence Generative vs. Discriminative Binary Linear

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Linear Classifiers CS 188: Artificial Intelligence Perceptrons and Logistic Regression Pieter

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Reminder CS 188: Artificial Intelligence Only a very small fraction of AI is about making

Administrivia CS 188: Artificial Intelligence Reminder: Spring 2006 Drop-in Python/Unix

CS 188: Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel & Dan Klein

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Lecture 12: Computational Graph Backpropagation Aykut Erdem March 2016 Hacettepe

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The

Reminder: Linear Classifiers CS 188: Artificial Intelligence - PowerPoint PPT Presentation

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets Inputs are feature values Each feature has a weight Sum is the activation If the activation is: w 1 f 1 w 2 S Positive, output +1 >0? f 2

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Outline CS 188: Artificial Intelligence Generative vs. Discriminative Binary Linear

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Linear Classifiers CS 188: Artificial Intelligence Perceptrons and Logistic Regression Pieter

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Reminder CS 188: Artificial Intelligence Only a very small fraction of AI is about making

Administrivia CS 188: Artificial Intelligence Reminder: Spring 2006 Drop-in Python/Unix

CS 188: Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel &amp; Dan Klein

Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu

Neural Acceleration for GPU Throughput Processors Hardik Sharma Jongse Park Amir Yazdanbakhsh

CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &amp;

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

Lecture 12: Computational Graph Backpropagation Aykut Erdem March 2016 Hacettepe

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The

CS 188: Artificial Intelligence Perceptrons and Logistic Regression Pieter Abbeel & Dan Klein

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &