CSC 311: Introduction to Machine Learning
Lecture 7 - Probabilistic Models Roger Grosse Chris Maddison Juhan Bae Silviu Pitis
University of Toronto, Fall 2020
Intro ML (UofT) CSC311-Lec7 1 / 28
CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic - - PowerPoint PPT Presentation
CSC 311: Introduction to Machine Learning Lecture 7 - Probabilistic Models Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec7 1 / 28 Today So far in the course we have adopted
Intro ML (UofT) CSC311-Lec7 1 / 28
◮ You flip a coin N = 100 times and get outcomes {x1, . . . , xN} where
◮ Suppose you had NH = 55 heads and NT = 45 tails. ◮ What is the probability it will come up heads if we flip again? Let’s
Intro ML (UofT) CSC311-Lec7 2 / 28
Intro ML (UofT) CSC311-Lec7 3 / 28
Intro ML (UofT) CSC311-Lec7 4 / 28
Intro ML (UofT) CSC311-Lec7 5 / 28
◮ define a model that assigns a probability (or has a probability
◮ maximize the likelihood (or minimize the neg. log-likelihood).
Intro ML (UofT) CSC311-Lec7 6 / 28
◮ Model p(t|x) directly (logistic regression models) ◮ Learn mappings from inputs to classes (linear/logistic regression,
◮ Tries to solve: How do I separate the classes?
◮ Model p(x|t) ◮ Apply Bayes Rule to derive p(t|x). ◮ Tries to solve: What does each class ”look” like?
Intro ML (UofT) CSC311-Lec7 7 / 28
◮ “a”: 1 ◮ ... ◮ “car”: 0 ◮ “card”: 1 ◮ ... ◮ “win”: 0 ◮ “winner”: 1 ◮ “winter”: 0 ◮ ... ◮ “you”: 1 Intro ML (UofT) CSC311-Lec7 8 / 28
Intro ML (UofT) CSC311-Lec7 9 / 28
◮ it can be compactly represented ◮ learning and inference are both tractable Intro ML (UofT) CSC311-Lec7 10 / 28
◮ This means xi and xj are independent under the conditional
◮ Note: this doesn’t mean they’re independent. ◮ Mathematically,
◮ Prior probability of class: p(c = 1) = π (e.g. spam email) ◮ Conditional probability of word feature given class:
◮ 2D + 1 parameters total (before 2D+1 − 1) Intro ML (UofT) CSC311-Lec7 11 / 28
Intro ML (UofT) CSC311-Lec7 12 / 28
N
N
N
D
j
N
D
j
N
D
N
j
for feature xj
Intro ML (UofT) CSC311-Lec7 13 / 28
N
N
N
Intro ML (UofT) CSC311-Lec7 14 / 28
j
j .
N
j
N
j
j ) log(1 − θj1)
N
j
j ) log(1 − θj0)
for c = 1
Intro ML (UofT) CSC311-Lec7 15 / 28
Intro ML (UofT) CSC311-Lec7 16 / 28
◮ Compute co-occurrence counts of each feature with the labels. ◮ Requires only one pass through the data!
◮ Cheap because of the model structure. (For more general models,
Intro ML (UofT) CSC311-Lec7 17 / 28
Intro ML (UofT) CSC311-Lec7 18 / 28
◮ The prior distribution p(θ), which encodes our beliefs about the
◮ The likelihood p(D | θ), same as in maximum likelihood Intro ML (UofT) CSC311-Lec7 19 / 28
Intro ML (UofT) CSC311-Lec7 20 / 28
◮ We can choose an uninformative prior, which assumes as little as
◮ But our experience tells us 0.5 is more likely than 0.99. One
◮ This notation for proportionality lets us ignore the normalization
Intro ML (UofT) CSC311-Lec7 21 / 28
◮ The expectation E[θ] = a/(a + b) (easy to derive). ◮ The distribution gets more peaked when a and b are large. ◮ The uniform distribution is the special case where a = b = 1.
Intro ML (UofT) CSC311-Lec7 22 / 28
◮ The reason this works is that the prior and likelihood have the same
Intro ML (UofT) CSC311-Lec7 23 / 28
Intro ML (UofT) CSC311-Lec7 24 / 28
Intro ML (UofT) CSC311-Lec7 25 / 28
Intro ML (UofT) CSC311-Lec7 26 / 28
Intro ML (UofT) CSC311-Lec7 27 / 28
Intro ML (UofT) CSC311-Lec7 28 / 28