CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec7 1 / 28

Today So far in the course we have adopted a modular perspective, in which the model, loss function, optimizer, and regularizer are specified separately. Today we will begin putting together a probabilistic interpretation of the choice of model and loss, and introduce the concept of maximum likelihood estimation. Let’s start with a simple biased coin example. ◮ You flip a coin N = 100 times and get outcomes { x 1 , . . . , x N } where x i ∈ { 0 , 1 } and x i = 1 is interpreted as heads H . ◮ Suppose you had N H = 55 heads and N T = 45 tails. ◮ What is the probability it will come up heads if we flip again? Let’s design a model for this scenario, fit the model. We can use the fit model to predict the next outcome. Intro ML (UofT) CSC311-Lec7 2 / 28

Model? The coin is possibly loaded. So, we can assume that one coin flip outcome x is a Bernoulli random variable for some currently unknown parameter θ ∈ [0 , 1]. p ( x = 1 | θ ) = θ and p ( x = 0 | θ ) = 1 − θ or more succinctly p ( x | θ ) = θ x (1 − θ ) 1 − x It’s sensible to assume that { x 1 , . . . , x N } are independent and identically distributed (i.i.d.) Bernoullis. Thus the joint probability of the outcome { x 1 , . . . , x N } is N � θ x i (1 − θ ) 1 − x i p ( x 1 , ..., x N | θ ) = i =1 Intro ML (UofT) CSC311-Lec7 3 / 28

Loss? We call the probability mass (or density for continuous) of the observed data the likelihood function (as a function of the parameters θ ): N � θ x i (1 − θ ) 1 − x i L ( θ ) = i =1 We usually work with log-likelihoods: N � ℓ ( θ ) = x i log θ + (1 − x i ) log(1 − θ ) i =1 How can we choose θ ? Good values of θ should assign high probability to the observed data. This motivates the maximum likelihood criterion, that we should pick the parameters that maximize the likelihood: ˆ θ ML = max θ ∈ [0 , 1] ℓ ( θ ) Intro ML (UofT) CSC311-Lec7 4 / 28

Maximum Likelihood Estimation for the Coin Example Remember how we found the optimal solution to linear regression by setting derivatives to zero? We can do that again for the coin example. � N � d ℓ d θ = d � x i log θ + (1 − x i ) log(1 − θ ) d θ i =1 = d d θ ( N H log θ + N T log(1 − θ )) = N H − N T θ 1 − θ where N H = � i x i and N T = N − � i x i . Setting this to zero gives the maximum likelihood estimate: N H ˆ θ ML = . N H + N T Intro ML (UofT) CSC311-Lec7 5 / 28

Maximum Likelihood Estimation Notice, in the coin example we are actually minimizing cross-entropies! ˆ θ ML = max θ ∈ [0 , 1] ℓ ( θ ) = min θ ∈ [0 , 1] − ℓ ( θ ) N � = min − x i log θ − (1 − x i ) log(1 − θ ) θ ∈ [0 , 1] i =1 This is an example of maximum likelihood estimation. ◮ define a model that assigns a probability (or has a probability density at) to a dataset ◮ maximize the likelihood (or minimize the neg. log-likelihood). Many examples we’ve considered fall in this framework! Let’s consider classification again. Intro ML (UofT) CSC311-Lec7 6 / 28

Generative vs Discriminative Two approaches to classification: Discriminative approach: estimate parameters of decision boundary/class separator directly from labeled examples. ◮ Model p ( t | x ) directly (logistic regression models) ◮ Learn mappings from inputs to classes (linear/logistic regression, decision trees etc) ◮ Tries to solve: How do I separate the classes? Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier). ◮ Model p ( x | t ) ◮ Apply Bayes Rule to derive p ( t | x ). ◮ Tries to solve: What does each class ”look” like? Key difference: is there a distributional assumption over inputs? Intro ML (UofT) CSC311-Lec7 7 / 28

A Generative Model: Bayes Classifier Aim to classify text into spam/not-spam (yes c=1; no c=0) Example: “You are one of the very few who have been selected as a winners for the free $1000 Gift Card.” Use bag-of-words features, get binary vector x for each email Vocabulary: ◮ “a”: 1 ◮ ... ◮ “car”: 0 ◮ “card”: 1 ◮ ... ◮ “win”: 0 ◮ “winner”: 1 ◮ “winter”: 0 ◮ ... ◮ “you”: 1 Intro ML (UofT) CSC311-Lec7 8 / 28

Bayes Classifier Given features x = [ x 1 , x 2 , · · · , x D ] T we want to compute class probabilities using Bayes Rule: Pr. words given class � �� = p ( x , c ) p ( x | c ) p ( c ) p ( c | x ) = p ( x ) p ( x ) � �� Pr. class given words More formally posterior = Class likelihood × prior Evidence How can we compute p ( x ) for the two class case? (Do we need to?) p ( x ) = p ( x | c = 0) p ( c = 0) + p ( x | c = 1) p ( c = 1) To compute p ( c | x ) we need: p ( x | c ) and p ( c ) Intro ML (UofT) CSC311-Lec7 9 / 28

Na¨ ıve Bayes Assume we have two classes: spam and non-spam. We have a dictionary of D words, and binary features x = [ x 1 , . . . , x D ] saying whether each word appears in the e-mail. If we define a joint distribution p ( c, x 1 , . . . , x D ), this gives enough information to determine p ( c ) and p ( x | c ). Problem: specifying a joint distribution over D + 1 binary variables requires 2 D +1 − 1 entries. This is computationally prohibitive and would require an absurd amount of data to fit. We’d like to impose structure on the distribution such that: ◮ it can be compactly represented ◮ learning and inference are both tractable Intro ML (UofT) CSC311-Lec7 10 / 28

Na¨ ıve Bayes Na¨ ıve assumption: Na¨ ıve Bayes assumes that the word features x i are conditionally independent given the class c . ◮ This means x i and x j are independent under the conditional distribution p ( x | c ). ◮ Note: this doesn’t mean they’re independent. ◮ Mathematically, p ( c, x 1 , . . . , x D ) = p ( c ) p ( x 1 | c ) · · · p ( x D | c ) . Compact representation of the joint distribution ◮ Prior probability of class: p ( c = 1) = π (e.g. spam email) ◮ Conditional probability of word feature given class: p ( x j = 1 | c ) = θ jc (e.g. word ”price” appearing in spam) ◮ 2 D + 1 parameters total (before 2 D +1 − 1) Intro ML (UofT) CSC311-Lec7 11 / 28

Bayes Nets We can represent this model using an directed graphical model, or Bayesian network: This graph structure means the joint distribution factorizes as a product of conditional distributions for each variable given its parent(s). Intuitively, you can think of the edges as reflecting a causal structure. But mathematically, this doesn’t hold without additional assumptions. Intro ML (UofT) CSC311-Lec7 12 / 28

Na¨ ıve Bayes: Learning The parameters can be learned efficiently because the log-likelihood decomposes into independent terms for each feature. N N � � � � log p ( c ( i ) , x ( i ) ) = p ( x ( i ) | c ( i ) ) p ( c ( i ) ) ℓ ( θ ) = log i =1 i =1 N D � � � � p ( c ( i ) ) p ( x ( i ) | c ( i ) ) = log j i =1 j =1 � � N D � � log p ( c ( i ) ) + log p ( x ( i ) | c ( i ) ) = j i =1 j =1 N D N � � � log p ( c ( i ) ) log p ( x ( i ) | c ( i ) ) = + j i =1 j =1 i =1 � �� Bernoulli log-likelihood Bernoulli log-likelihood of labels for feature x j Each of these log-likelihood terms depends on different sets of parameters, so they can be optimized independently. Intro ML (UofT) CSC311-Lec7 13 / 28

Na¨ ıve Bayes: Learning We can handle these terms separately. For the prior we maximize: � N i =1 log p ( c ( i ) ) This is a minor variant of our coin flip example. Let p ( c ( i ) = 1)= π . Note p ( c ( i ) ) = π c ( i ) (1 − π ) 1 − c ( i ) . Log-likelihood: N N N � � c ( i ) log π + � log p ( c ( i ) ) = (1 − c ( i ) ) log(1 − π ) i =1 i =1 i =1 Obtain MLEs by setting derivatives to zero: I[ c ( i ) = 1] � i 1 = # spams in dataset π = ˆ N total # samples Intro ML (UofT) CSC311-Lec7 14 / 28

Na¨ ıve Bayes: Learning Each θ jc ’s can be treated separately: maximize � N i =1 log p ( x ( i ) j | c ( i ) ) This is (again) a minor variant of our coin flip example. x ( i ) jc (1 − θ jc ) 1 − x ( i ) Let θ jc = p ( x ( i ) = 1 | c ). Note p ( x ( i ) j . j j | c ) = θ j Log-likelihood: N N c ( i ) � � � � log p ( x ( i ) | c ( i ) ) = x ( i ) log θ j 1 + (1 − x ( i ) j ) log(1 − θ j 1 ) j j i =1 i =1 N � � � (1 − c ( i ) ) x ( i ) log θ j 0 + (1 − x ( i ) + j ) log(1 − θ j 0 ) j i =1 Obtain MLEs by setting derivatives to zero: = 1 & c ( i ) = c ] � I[ x ( i ) i 1 #word j appears in spams j ˆ for c = 1 θ jc = = � I[ c ( i ) = c ] i 1 # spams in dataset Intro ML (UofT) CSC311-Lec7 15 / 28

Na¨ ıve Bayes: Inference We predict the category by performing inference in the model. Apply Bayes’ Rule: p ( c ) � D j =1 p ( x j | c ) p ( c ) p ( x | c ) p ( c | x ) = c ′ p ( c ′ ) p ( x | c ′ ) = � � c ′ p ( c ′ ) � D j =1 p ( x j | c ′ ) We need not compute the denominator if we’re simply trying to determine the most likely c . Shorthand notation: D � p ( c | x ) ∝ p ( c ) p ( x j | c ) j =1 For input x , predict by comparing the values of p ( c ) � D j =1 p ( x j | c ) for different c (e.g. choose the largest). Intro ML (UofT) CSC311-Lec7 16 / 28

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec7 1 / 28 Today So far in the course we have adopted

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression,

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Modelling and Verification Lecture 1 Lecturer: Luca Aceto luca@ru.is or luca.aceto@gmail.com

Generalized Probit Model in Design of Dose Finding Experiments Yuehui Wu Valerii V. Fedorov

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence

Elimination of binary choice sequences Tatsuji Kawai Japan Advanced Institute of Science and

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Dynamic Programming Greedy. Build up a solution incrementally, myopically optimizing

Search Marco Chiarandini Department of Mathematics & Computer Science University of Southern

Overview Multi-Attribute Probabilistic Choice Models Probabilistic choice models Florian

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic - PowerPoint PPT Presentation

CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec7 1 / 28 Today So far in the course we have adopted

311: It was here then it was gone and now its back 311 call center closed doors at the

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

CSC 311: Introduction to Machine Learning Lecture 1 - Introduction and Nearest Neighbors Roger

CSC 311: Introduction to Machine Learning Lecture 4 - Neural Networks Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 2 - Linear Methods for Regression, Optimization

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees &amp; Bias-Variance

CSC 311: Introduction to Machine Learning Lecture 3 - Linear Classifiers, Logistic Regression,

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

CSC 311: Introduction to Machine Learning Lecture 8 - Probabilistic Models Pt. II, PCA Roger

CSC Effectiveness Review CSC Effectiveness Review Team October 2018 ICANN63 Need for Review of

CMPSC 311- Introduction to Systems Programming Module: Systems Programming Professor Patrick

CMPSC 311- Introduction to Systems Programming Module: Studying Professor Patrick McDaniel Fall

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Modelling and Verification Lecture 1 Lecturer: Luca Aceto luca@ru.is or luca.aceto@gmail.com

Generalized Probit Model in Design of Dose Finding Experiments Yuehui Wu Valerii V. Fedorov

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence

Elimination of binary choice sequences Tatsuji Kawai Japan Advanced Institute of Science and

Probabilistic Models Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Dynamic Programming Greedy. Build up a solution incrementally, myopically optimizing

Search Marco Chiarandini Department of Mathematics &amp; Computer Science University of Southern

Overview Multi-Attribute Probabilistic Choice Models Probabilistic choice models Florian

CSC 311: Introduction to Machine Learning Lecture 5 - Decision Trees & Bias-Variance

Search Marco Chiarandini Department of Mathematics & Computer Science University of Southern