CMSC 691 Probabilistic and Statistical Models of Learning - - PowerPoint PPT Presentation

cmsc 691
SMART_READER_LITE
LIVE PREVIEW

CMSC 691 Probabilistic and Statistical Models of Learning - - PowerPoint PPT Presentation

CMSC 691 Probabilistic and Statistical Models of Learning Probabilities, Common Distributions, and Maximum Likelihood Estimation Outline Basics of Learning Probability Maximum Likelihood Estimation What does it mean to learn? Chris has


slide-1
SLIDE 1

CMSC 691 Probabilistic and Statistical Models of Learning Probabilities, Common Distributions, and Maximum Likelihood Estimation

slide-2
SLIDE 2

Outline

Basics of Learning Probability Maximum Likelihood Estimation

slide-3
SLIDE 3

Chris has just begun taking a machine learning course Pat, the instructor has to ascertain if Chris has “learned” the topics covered, at the end of the course What is a “reasonable” exam?

(Bad) Choice 1: History of pottery

Chris’s performance is not indicative of what was learned in ML

(Bad) Choice 2: Questions answered during lectures

Open book?

A good test should test ability to answer “related” but “new” questions on the exam

What does it mean to learn?

Generalization

slide-4
SLIDE 4

Model, parameters and hyperparameters

Model: mathematical formulation of system (e.g., classifier) Parameters: primary “knobs” of the model that are set by a learning algorithm Hyperparameter: secondary “knobs”

http://www.uiparade.com/wp-content/uploads/2012/01/ui-design-pure-css.jpg

slide-5
SLIDE 5
slide-6
SLIDE 6

score( )

slide-7
SLIDE 7

scoreθ( )

scoring model

  • bjective

F(θ)

slide-8
SLIDE 8

scoring model

  • bjective

F(θ)

(implicitly) dependent on the

  • bserved data X=

scoreθ( )

slide-9
SLIDE 9

Machine Learning Framework: Learning

instance 1 instance 2 instance 3 instance 4 Machine Learning Predictor Extra-knowledge

Evaluator

score

instances are typically examined independently Gold/correct labels

give feedback to the predictor

scoreθ(X)

scoring model

  • bjective

F(θ)

slide-10
SLIDE 10

F(θ) θ F’(θ)

derivative

  • f F wrt θ

θ*

How do we optimize? Follow the derivative/gradient

  • f our training score function

Set t = 0 Pick a starting value θt Until converged:

  • 1. Get value y t = F(θ t)
  • 2. Get derivative g t = F’(θ t)
  • 3. Get scaling factor ρ t
  • 4. Set θ t+1 = θ t + ρ t *g t
  • 5. Set t += 1

θ0 y0 θ1 y1 θ2 y2 y3 θ3 g0 g1 g2

Gradient Ascent

slide-11
SLIDE 11

Outline

Basics of Learning Probability Maximum Likelihood Estimation

slide-12
SLIDE 12

Probability Topics (High-Level)

Basics of Probability: Prereqs Philosophy of Probability, and Terminology Useful Quantities and Inequalities

slide-13
SLIDE 13

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-14
SLIDE 14

(Most) Probability Axioms

p(everything) = 1 p(φ) = 0 p(A) ≤ p(B), when A ⊆ B p(A ∪ B) = p(A) + p(B), when A ∩ B = φ

everything A B

p(A ∪ B) = p(A) + p(B) – p(A ∩ B) p(A ∪ B) ≠ p(A) + p(B)

slide-15
SLIDE 15

Probabilities and Random Variables

Random variables: variables that represent the possible outcomes of some random “process”

slide-16
SLIDE 16

Probabilities and Random Variables

Random variables: variables that represent the possible outcomes of some random “process” Example #1: A (weighted) coin that can come up heads or tails

X is a random variable denoting the possible

  • utcomes

X=HEADS or X=TAILS

slide-17
SLIDE 17

Probabilities and Random Variables

Random variables: variables that represent the possible

  • utcomes of some random “process”

Example #1: A (weighted) coin that can come up heads or tails

X is a random variable denoting the possible outcomes X=HEADS or X=TAILS

Example #2: Measuring the amount of snow that fell in the last storm

Y is a random variable denoting the amount snow that fell, in inches Y=0, or Y=0.5, or Y=1.0495928591, or Y=10, or …

slide-18
SLIDE 18

Probabilities and Random Variables

Random variables: variables that represent the possible

  • utcomes of some random “process”

Example #1: A (weighted) coin that can come up heads or tails

X is a random variable denoting the possible outcomes X=HEADS or X=TAILS

Example #2: Measuring the amount of snow that fell in the last storm

Y is a random variable denoting the amount snow that fell, in inches Y=0, or Y=0.5, or Y=1.0495928591, or Y=10, or …

DISCRETE random variable CONTINUOUS random variable

slide-19
SLIDE 19

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values)

slide-20
SLIDE 20

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF)

slide-21
SLIDE 21

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF) The values that PMF/PDF can take are 0 ≤ p(X=k) ≤ 1 p(X=k) ≥ 0

slide-22
SLIDE 22

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF) The values that PMF/PDF can take are 0 ≤ p(X=k) ≤ 1 p(X=k) ≥ 0 We “add” with Sums (∑) Integrals (∫ )

slide-23
SLIDE 23

Random Variables

If X is a… Discrete random variable Continuous random variable The values k that X can take are Discrete: finite or countably infinite (e.g., integers) Continuous: uncountably infinite (e.g., real values) The function that gives the relative likelihood of a value p(X=k) is a probability mass function (PMF) probability density function (PDF) The values that PMF/PDF can take are 0 ≤ p(X=k) ≤ 1 p(X=k) ≥ 0 We “add” with Sums (∑) Integrals (∫ ) Our PMF/PDF satisfies p(everything)=1 by ෍

𝑙

𝑞(𝑌 = 𝑙) = 1 ∫ 𝑞 𝑦 𝑒𝑦 = 1

slide-24
SLIDE 24

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-25
SLIDE 25

Joint Probability

Probability that multiple things “happen together”

everything A B Joint probability

slide-26
SLIDE 26

Joint Probability

Probability that multiple things “happen together” p(x,y), p(x,y,z), p(x,y,w,z) Symmetric: p(x,y) = p(y,x)

everything A B Joint probability

slide-27
SLIDE 27

Joint Probability

Probability that multiple things “happen together” p(x,y), p(x,y,z), p(x,y,w,z) Symmetric: p(x,y) = p(y,x) Form a table based of

  • utcomes: sum across cells = 1

everything A B Joint probability p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

slide-28
SLIDE 28

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-29
SLIDE 29

Marginal(ized) Probability: The Discrete Case

y Consider the mutually exclusive ways that different values of x could occur with y

Q: How do write this in terms of joint probabilities?

x=1 & y x=2 & y x=3 & y x=4 & y

slide-30
SLIDE 30

Marginal(ized) Probability: The Discrete Case

y

x=1 & y

𝑞 𝑧 = ෍

𝑦

𝑞(𝑦, 𝑧)

Consider the mutually exclusive ways that different values of x could occur with y

x=2 & y x=3 & y x=4 & y

slide-31
SLIDE 31

Marginal(ized) Probability: The Discrete Case

y

x=1 & y

𝑞 𝑧 = ෍

𝑦

𝑞(𝑦, 𝑧)

Consider the mutually exclusive ways that different values of x could occur with y

x=2 & y x=3 & y x=4 & y

Q: What is p(y=1)?

p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

slide-32
SLIDE 32

Marginal(ized) Probability: The Discrete Case

y

x=1 & y

𝑞 𝑧 = ෍

𝑦

𝑞(𝑦, 𝑧)

Consider the mutually exclusive ways that different values of x could occur with y

x=2 & y x=3 & y x=4 & y

Q: What is p(y=1)?

p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

A: 0.56

slide-33
SLIDE 33

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-34
SLIDE 34

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are the results of flipping the same coin twice in succession independent?

slide-35
SLIDE 35

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are the results of flipping the same coin twice in succession independent? A: Yes (assuming no weird effects)

slide-36
SLIDE 36

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are X and Y independent?

p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

slide-37
SLIDE 37

Probabilistic Independence

Independence: when events can occur and not impact the probability of

  • ther events

Formally: p(x,y) = p(x)*p(y) Generalizable to > 2 random variables

Q: Are X and Y independent?

p(x,y) Y=0 Y=1 X=“cat” .04 .32 X=“dog” .2 .04 X=“bird” .1 .1 X=“human” .1 .1

A: No (find the marginal probabilities of p(x) and p(y))

slide-38
SLIDE 38

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-39
SLIDE 39

Conditional Probability

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍)

Conditional Probabilities are Probabilities

slide-40
SLIDE 40

Conditional Probability

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = marginal probability of Y

slide-41
SLIDE 41

Conditional Probability

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍) 𝑞 𝑍 = ∫ 𝑞(𝑌, 𝑍)𝑒𝑌

slide-42
SLIDE 42

Revisiting Marginal Probability: The Discrete Case

y x1 & y x2 & y x3 & y x4 & y

𝑞 𝑧 = ෍

𝑦

𝑞(𝑦, 𝑧) = ෍

𝑦

𝑞 𝑦 𝑞 𝑧 𝑦)

slide-43
SLIDE 43

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-44
SLIDE 44

Deriving Bayes Rule

Start with conditional p(X | Y)

slide-45
SLIDE 45

Deriving Bayes Rule

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍)

Solve for p(x,y)

slide-46
SLIDE 46

Deriving Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

𝑞 𝑌 𝑍) = 𝑞(𝑌, 𝑍) 𝑞(𝑍)

Solve for p(x,y)

𝑞 𝑌, 𝑍 = 𝑞 𝑌 𝑍)𝑞(𝑍)

p(x,y) = p(y,x)

slide-47
SLIDE 47

Bayes Rule

𝑞 𝑌 𝑍) = 𝑞 𝑍 𝑌) ∗ 𝑞(𝑌) 𝑞(𝑍)

posterior probability likelihood prior probability marginal likelihood (probability)

slide-48
SLIDE 48

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-49
SLIDE 49

Probability Chain Rule

𝑞 𝑦1, 𝑦2, … , 𝑦𝑇 = 𝑞 𝑦1 𝑞 𝑦2 𝑦1)𝑞 𝑦3 𝑦1, 𝑦2) ⋯ 𝑞 𝑦𝑇 𝑦1, … , 𝑦𝑗 = ෑ

𝑗 𝑇

𝑞 𝑦𝑗 𝑦1, … , 𝑦𝑗−1)

extension of Bayes rule

slide-50
SLIDE 50

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-51
SLIDE 51

Distribution Notation

If X is a R.V. and G is a distribution:

  • 𝑌 ∼ 𝐻 means X is distributed according to

(“sampled from”) 𝐻

slide-52
SLIDE 52

Distribution Notation

If X is a R.V. and G is a distribution:

  • 𝑌 ∼ 𝐻 means X is distributed according to

(“sampled from”) 𝐻

  • 𝐻 often has parameters 𝜍 = (𝜍1, 𝜍2, … , 𝜍𝑁)

that govern its “shape”

  • Formally written as 𝑌 ∼ 𝐻(𝜍)
slide-53
SLIDE 53

Distribution Notation

If X is a R.V. and G is a distribution:

  • 𝑌 ∼ 𝐻 means X is distributed according to

(“sampled from”) 𝐻

  • 𝐻 often has parameters 𝜍 = (𝜍1, 𝜍2, … , 𝜍𝑁) that

govern its “shape”

  • Formally written as 𝑌 ∼ 𝐻(𝜍)

i.i.d. If 𝑌1, X2, … , XN are all independently sampled

from 𝐻(𝜍), they are independently and identically distributed

slide-54
SLIDE 54

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Bernoulli: A single draw

  • Binary R.V.: 0 (failure) or 1 (success)
  • 𝑌 ∼ Bernoulli(𝜍)
  • 𝑞 𝑌 = 1 = 𝜍, 𝑞 𝑌 = 0 = 1 − 𝜍
  • Generally, 𝑞 𝑌 = 𝑙 = 𝜍𝑙 1 − 𝑞 1−𝑙
slide-55
SLIDE 55

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Bernoulli: A single draw

  • Binary R.V.: 0 (failure) or 1 (success)
  • 𝑌 ∼ Bernoulli(𝜍)
  • 𝑞 𝑌 = 1 = 𝜍, 𝑞 𝑌 = 0 = 1 − 𝜍
  • Generally, 𝑞 𝑌 = 𝑙 = 𝜍𝑙 1 − 𝑞 1−𝑙

Binomial: Sum of N iid Bernoulli draws

  • Values X can take: 0, 1, …, N
  • Represents number of successes
  • 𝑌 ∼ Binomial(𝑂, 𝜍)
  • 𝑞 𝑌 = 𝑙 =

𝑂 𝑙 𝜍𝑙 1 − 𝜍 𝑂−𝑙

slide-56
SLIDE 56

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Categorical: A single draw

  • Finite R.V. taking one of K values: 1, 2, …, K
  • 𝑌 ∼ Cat 𝜍 , 𝜍 ∈ ℝ𝐿
  • 𝑞 𝑌 = 1 = 𝜍1, 𝑞 𝑌 = 2 = 𝜍2, … 𝑞(

) 𝑌 = 𝐿 = 𝜍𝐿

  • Generally, 𝑞 𝑌 = 𝑙 = ς𝑘 𝜍𝑘

𝟐[𝑙=𝑘]

  • 1 𝑑 = ቊ1, 𝑑 is true

0, 𝑑 is false Multinomial: Sum of N iid Categorical draws

  • Vector of size K representing how often

value k was drawn

  • 𝑌 ∼ Multinomial 𝑂, 𝜍 , 𝜍 ∈ ℝ𝐿
slide-57
SLIDE 57

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Poisson

  • Discrete R.V. taking any integer that is >= 0
  • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ

is the “rate”

  • 𝑞 𝑌 = 𝑙 =

𝜇𝑙 exp(−𝜇) 𝑙!

PMF

slide-58
SLIDE 58

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Normal

  • Real R.V. taking any real number
  • 𝑌 ∼ Normal 𝜈, 𝜏 , 𝜈 is the mean, 𝜏 is

the standard deviation

  • 𝑞 𝑌 = 𝑦 =

1 2𝜌𝜏 exp( − 𝑦−𝜈 2 2𝜏2

)

https://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/192 0px-Normal_Distribution_PDF.svg.png

𝑞 𝑌 = 𝑦

slide-59
SLIDE 59

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma

Multivariate Normal

  • Real vector R.V. 𝑌 ∈ ℝ𝑙
  • 𝑌 ∼ Normal 𝜈, Σ , 𝜈 ∈ ℝ𝐿 is

the mean, Σ ∈ ℝ𝐿×𝐿 is the covariance

  • 𝑞 𝑌 = 𝑦 ∝ exp(−(

) 𝑦 − 𝜈 𝑈Σ(𝑦 − \mu))

slide-60
SLIDE 60

Common Distributions

Bernoulli/Binomial Categorical/Multinomial Poisson Normal Gamma Gamma

  • Real R.V. taking any positive real number
  • 𝑌 ∼ Gamma 𝑙, 𝜄 , 𝑙 > 0 is the “shape” (how

skewed it is), 𝜄 > 0 is the “scale” (how spread

  • ut the distribution is)
  • 𝑞 𝑌 = 𝑦 =

𝑦𝑙−1exp(−𝑙

𝜄 )

𝜄𝑙Γ(𝑙)

https://en.wikipedia.org/wiki/Gamma_distribution#/media/File:Gamma_distribution_pdf.svg
slide-61
SLIDE 61

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-62
SLIDE 62

Expected Value of a Random Variable

𝑌 ~ 𝑞 ⋅

random variable

slide-63
SLIDE 63

Expected Value of a Random Variable

𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

random variable expected value (distribution p is implicit)

slide-64
SLIDE 64

Expected Value: Example

1 2 3 4 5 6

uniform distribution of number of cats I have

1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

slide-65
SLIDE 65

Expected Value: Example

1 2 3 4 5 6

uniform distribution of number of cats I have

1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

Q: What common distribution is this?

slide-66
SLIDE 66

Expected Value: Example

1 2 3 4 5 6

uniform distribution of number of cats I have

1/6 * 1 + 1/6 * 2 + 1/6 * 3 + 1/6 * 4 + 1/6 * 5 + 1/6 * 6 = 3.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

Q: What common distribution is this? A: Categorical

slide-67
SLIDE 67

Expected Value: Example 2

1 2 3 4 5 6

non-uniform distribution of number of cats a normal cat person has

1/2 * 1 + 1/10 * 2 + 1/10 * 3 + 1/10 * 4 + 1/10 * 5 + 1/10 * 6 = 2.5 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞 𝑦

slide-68
SLIDE 68

Expected Value of a Function of a Random Variable

𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) =? ? ?

slide-69
SLIDE 69

Expected Value of a Function of a Random Variable

𝑌 ~ 𝑞 ⋅ 𝔽 𝑌 = ෍

𝑦

𝑦 𝑞(𝑦) 𝔽 𝑔(𝑌) = ෍

𝑦

𝑔(𝑦) 𝑞 𝑦

slide-70
SLIDE 70

Expected Value of Function: Example

1 2 3 4 5 6

non-uniform distribution of number of cats I start with

What if each cat magically becomes two? 𝑔 𝑙 = 2𝑙 𝔽 𝑔(𝑌) = ෍

𝑦

𝑔(𝑦) 𝑞 𝑦

slide-71
SLIDE 71

Expected Value of Function: Example

1 2 3 4 5 6

non-uniform distribution of number of cats I start with

1/2 * 21 + 1/10 * 22 + 1/10 * 23 + 1/10 * 24 + 1/10 * 25 + 1/10 * 26 = 13.4 What if each cat magically becomes two? 𝑔 𝑙 = 2𝑙 𝔽 𝑔(𝑌) = ෍

𝑦

𝑔(𝑦) 𝑞 𝑦 = ෍

𝑦

2𝑦𝑞(𝑦)

slide-72
SLIDE 72

Probability Prerequisites

Basic probability axioms and definitions Joint probability Marginal probability Probabilistic Independence Definition of conditional probability Bayes rule Probability chain rule Common distributions Expected Value (of a function) of a Random Variable

slide-73
SLIDE 73

Example Problem: ITILA Ex. 2.3

➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease. Q: If Jo’s test is positive, what is the probability Jo has the disease?

slide-74
SLIDE 74

Example Problem: ITILA Ex. 2.3

Q: If Jo’s test is positive, what is the probability Jo has the disease? ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease.

𝑞 𝑏 = 1 𝑐 = 1)

slide-75
SLIDE 75

Example Problem: ITILA Ex. 2.3

Q: If Jo’s test is positive, what is the probability Jo has the disease? ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease.

𝑞 𝑏 = 1 𝑐 = 1) = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1)

𝑞 𝑏 = 1 = 0.01 marginal

  • f a
slide-76
SLIDE 76

Example Problem: ITILA Ex. 2.3

Q: If Jo’s test is positive, what is the probability Jo has the disease? ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease.

𝑞 𝑏 = 1 𝑐 = 1) = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1)

𝑞 𝑏 = 1 = 0.01 marginal

  • f a

Conditionals p(b|a) 𝑞(𝑐 = 1|𝑏 = 1) = 0.95 𝑞(𝑐 = 0|𝑏 = 0) = 0.95

slide-77
SLIDE 77

Example Problem: ITILA Ex. 2.3

Q: If Jo’s test is positive, what is the probability Jo has the disease? ➢ Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a (a=1: Jo has the disease; a=0 o/w) and the test result by b. ➢ The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0). ➢ The test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. ➢ The final piece of background information is that 1% of people of Jo’s age and background have the disease.

𝑞 𝑏 = 1 𝑐 = 1) = 𝑞 𝑐 = 1 𝑏 = 1 𝑞(𝑏 = 1) 𝑞(𝑐 = 1) = .95 ∗ .01 .95 ∗ .01 + .05 ∗ .99 = 0.16

𝑞 𝑏 = 1 = 0.01 marginal

  • f a

Conditionals p(b|a) 𝑞(𝑐 = 1|𝑏 = 1) = 0.95 𝑞(𝑐 = 0|𝑏 = 0) = 0.95

slide-78
SLIDE 78

Probability Topics (High-Level)

Basics of Probability: Prereqs Philosophy of Probability, and Terminology Useful Quantities and Inequalities

slide-79
SLIDE 79

A Bit of Philosophy and Terminology

What is a probability? Core terminology

– Support/domain – Partition function

Some principles

– Generative story – Forward probability – Inverse probability

slide-80
SLIDE 80

Kinds of Statistics

Descriptive Confirmatory Predictive

The average grade on this assignment is 83.

slide-81
SLIDE 81

Interpretations of Probability

Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)

slide-82
SLIDE 82

Camps of Probability

Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)

Frequentists Bayesians

(my grouping, not too far off though)

slide-83
SLIDE 83

Camps of Probability

Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)

Frequentists Bayesians ML People

(my grouping, not too far off though)

slide-84
SLIDE 84

Camps of Probability

Past performance 58% of the past 100 flips were heads Hypothetical performance If I flipped the coin in many parallel universes… Subjective strength of belief Would pay up to 58 cents for chance to win $1 Output of some computable formula? p(heads) vs q(heads)

Frequentists Bayesians ML People

(my grouping, not too far off though)

“You cannot do inference without making assumptions.”

– ITILA, 2.2, pg 26

slide-85
SLIDE 85

What do we know before we see the data, and how does that influence our modeling decisions?

General ML Consideration: Inductive Bias

Courtesy Hamed Pirsiavash

slide-86
SLIDE 86

General ML Consideration: Inductive Bias

A C B D Partition these into two groups…

Courtesy Hamed Pirsiavash

What do we know before we see the data, and how does that influence our modeling decisions?

slide-87
SLIDE 87

General ML Consideration: Inductive Bias

A C B D Partition these into two groups

Courtesy Hamed Pirsiavash

Who selected red vs. blue?

What do we know before we see the data, and how does that influence our modeling decisions?

slide-88
SLIDE 88

General ML Consideration: Inductive Bias

A C B D Partition these into two groups

Courtesy Hamed Pirsiavash

Who selected red vs. blue? Who selected vs. ?

What do we know before we see the data, and how does that influence our modeling decisions?

slide-89
SLIDE 89

General ML Consideration: Inductive Bias

A C B D Partition these into two groups

Courtesy Hamed Pirsiavash

Who selected red vs. blue? Who selected vs. ?

What do we know before we see the data, and how does that influence our modeling decisions?

Tip: Remember how your own biases/interpretation are influencing your approach

slide-90
SLIDE 90

Some Terminology

Support

– The valid values a R.V. can take on – The values over which a pmf/pdf is defined

slide-91
SLIDE 91

Some Terminology

Support

– The valid values a R.V. can take on – The values over which a pmf/pdf is defined

Partition function/normalization function

– The function (or constant) that ensures a p{m,d}f sums to 1

slide-92
SLIDE 92

Some Terminology

Support

– The valid values a R.V. can take on – The values over which a pmf/pdf is defined

Partition function/normalization function

– The function (or constant) that ensures a p{m,d}f sums to 1

Q: What is the support for a Poisson R.V.?

slide-93
SLIDE 93

Some Terminology

Support

– The valid values a R.V. can take on – The values over which a pmf/pdf is defined

Partition function/normalization function

– The function (or constant) that ensures a p{m,d}f sums to 1

Poisson

  • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ

is the “rate”

  • 𝑞 𝑌 = 𝑙 =

𝜇𝑙 exp(−𝜇) 𝑙!

Q: What is the support for a Poisson R.V.?

PMF

slide-94
SLIDE 94

Some Terminology

Support

– The valid values a R.V. can take on – The values over which a pmf/pdf is defined

Partition function/normalization function

– The function (or constant) that ensures a p{m,d}f sums to 1

Poisson

  • 𝑌 ∼ Poisson 𝜇 , 𝜇 ∈ ℝ

is the “rate”

  • 𝑞 𝑌 = 𝑙 =

𝜇𝑙 exp(−𝜇) 𝑙!

Q: What is the partition function/constant?

PMF

slide-95
SLIDE 95

Some More Terminology

(Generative) Probabilistic Modeling Generative Story Forward probability (ITILA) Inverse probability (ITILA)

slide-96
SLIDE 96

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

slide-97
SLIDE 97

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y)

slide-98
SLIDE 98

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y)

Q/678 Recap: Where have we used p(x,y)?

slide-99
SLIDE 99

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y)

Q/678 Recap: Where have we used p(x,y)? A: Linear Discriminant Analysis

slide-100
SLIDE 100

What is (Generative) Probabilistic Modeling?

So far, we’ve (mostly)

had labeled data pairs (x, y), and built classifiers p(y | x)

What if we want to model both x and y together? p(x, y) Or what if we only have data but no labels? p(x)

Q: Where have we used p(x,y)? A: Linear Discriminant Analysis

slide-101
SLIDE 101

Generative Stories

“A useful way to develop probabilistic models is to tell a generative story. This is a fictional story that explains how you believe your training data came into existence.” --- CIML Ch 9.5

slide-102
SLIDE 102

Generative Stories

Generative stories are most often used with joint models p(x, y)…. but despite their name, generative stories are applicable to both generative and conditional models

“A useful way to develop probabilistic models is to tell a generative story. This is a fictional story that explains how you believe your training data came into existence.” --- CIML Ch 9.5

slide-103
SLIDE 103

p(x, y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating:

  • 1. Directly

2.

slide-104
SLIDE 104

p(x, y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating:

  • 1. Directly
  • 2. Using Bayes rule: p(x, y) = p(x | y)p(y)

Using Bayes rule transparently provides a generative story for how our data x and labels y are generated

slide-105
SLIDE 105

p(x,y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating: 1. Directly 2. Using Bayes rule: p(x, y) = p(x | y)p(y) Using Bayes rule transparently provides a generative story for how our data x and labels y are generated p(y | x) is the conditional distribution Two main options for estimating: 1. Directly: used when you only care about making the right prediction

Examples: perceptron, logistic regression, neural networks (we’ve covered)

2.

slide-106
SLIDE 106

p(x,y) vs. p(y | x): Models of our Data

p(x, y) is the joint distribution Two main options for estimating: 1. Directly 2. Using Bayes rule: p(x, y) = p(x | y)p(y) Using Bayes rule transparently provides a generative story for how our data x and labels y are generated p(y | x) is the conditional distribution Two main options for estimating: 1. Directly: used when you only care about making the right prediction

Examples: perceptron, logistic regression, neural networks (we’ve covered)

2. Estimate the joint

slide-107
SLIDE 107

Example: Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

slide-108
SLIDE 108

Example: Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

slide-109
SLIDE 109

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂:

Generative Story

slide-110
SLIDE 110

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

slide-111
SLIDE 111

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story “for each” loop becomes a product Calculate 𝑞 𝑥𝑗 according to provided distribution

slide-112
SLIDE 112

Generative Story for Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story a probability distribution over 6 sides of the die ෍

𝑙=1 6

𝜄𝑙 = 1 0 ≤ 𝜄𝑙 ≤ 1, ∀𝑙 “for each” loop becomes a product Calculate 𝑞 𝑥𝑗 according to provided distribution

slide-113
SLIDE 113

Some More Terminology

(Generative) Probabilistic Modeling Generative Story Forward probability (ITILA) Inverse probability (ITILA)

slide-114
SLIDE 114

Forward & Inverse Probabilities

Forward Probability

  • Given some data that is

“generated” according to some generative model, compute a data-dependent distribution or

  • ther quantity
  • Involves probabilistic

computation for things produced by the story

  • Example (ITILA Ex 2.4): Urn

problem

– Urn with B black and W white balls. For N draws with replacement, find distribution over 𝑜𝑐 (the number

  • f times a black ball is drawn)
slide-115
SLIDE 115

Forward & Inverse Probabilities

Forward Probability

  • Given some data that is

“generated” according to some generative model, compute a data-dependent distribution or

  • ther quantity
  • Involves probabilistic

computation for things produced by the story

  • Example (ITILA Ex 2.4): Urn

problem

– Urn with B black and W white

  • balls. For N draws with

replacement, find distribution

  • ver 𝑜𝑐 (the number of times a

black ball is drawn)

Inverse Probability

  • Given some data that is “generated”

according to some generative model, compute the conditional (posterior) probability of an unobserved variable in the model

  • The typical ML learning/inference

problem

slide-116
SLIDE 116

Forward & Inverse Probabilities

Forward Probability

  • Given some data that is

“generated” according to some generative model, compute a data-dependent distribution or

  • ther quantity
  • Involves probabilistic

computation for things produced by the story

  • Example (ITILA Ex 2.4): Urn

problem

– Urn with B black and W white

  • balls. For N draws with

replacement, find distribution

  • ver 𝑜𝑐 (the number of times a

black ball is drawn)

Inverse Probability

  • Given some data that is “generated”

according to some generative model, compute the conditional (posterior) probability of an unobserved variable in the model

  • The typical ML learning/inference

problem

  • Rely of Bayes rule

– 𝑞 latent obs ∝ 𝑞 obs latent 𝑞(latent)

  • Example (ITILA Ex 2.6)

– Multiple urns, each with their own number of black and white balls – N balls are drawn, but the selected urn is unobserved/not given

slide-117
SLIDE 117

Probability Topics (High-Level)

Basics of Probability: Prereqs Philosophy of Probability, and Terminology Useful Quantities and Inequalities

slide-118
SLIDE 118

Probabilistic Quantities

  • Many quantities involve expectations
  • Difficulty level varies:

– Sometimes, they’re easy to compute – Sometimes, they look hard to compute but are easy – Sometimes, they’re hard to compute

slide-119
SLIDE 119

Probabilistic Quantities

  • Many quantities involve

expectations

  • Difficulty level varies:

– Sometimes, they’re easy to compute – Sometimes, they look hard to compute but are easy – Sometimes, they’re hard to compute Exponential family formalism helps here (we’ll come back to this later)

slide-120
SLIDE 120

Entropy 𝐼 𝑌 = 𝔽𝑞 − log 𝑞(𝑌)

slide-121
SLIDE 121

Entropy 𝐼 𝑌 = 𝔽𝑞 − log 𝑞(𝑌) = ෍

𝑦

𝑞(𝑦) log 𝑞(𝑦)

Discrete RV Marginalize over the support of p

slide-122
SLIDE 122

Entropy 𝐼 𝑌 = 𝔽𝑞 − log 𝑞(𝑌) = ෍

𝑦

𝑞(𝑦) log 𝑞(𝑦) = න

𝑦

𝑒𝑞(𝑦) log 𝑞(𝑦)

Discrete RV

  • Contin. RV
slide-123
SLIDE 123

Entropy

  • 𝐼 𝑌 ≥ 0
  • By convention, For any x s.t. 𝑞 𝑦 = 0,

𝑞 𝑦 log 𝑞 𝑦 = 0

  • Sometimes written as H(p)
  • Low entropy → “peaky” distribution
  • High entropy → more uniform distribution

𝐼 𝑌 = 𝔽𝑞 − log 𝑞(𝑌)

slide-124
SLIDE 124

Entropy

  • 𝐼 𝑌 ≥ 0
  • By convention, For any x s.t. 𝑞 𝑦 = 0,

𝑞 𝑦 log 𝑞 𝑦 = 0

  • Sometimes written as H(p)
  • Low entropy → “peaky” distribution
  • High entropy → more uniform

distribution

𝐼 𝑌 = 𝔽𝑞 − log 𝑞(𝑌)

Ex: If p is a Bernoulli distribution, what is H(p)?

slide-125
SLIDE 125

Joint Entropy

For 𝑌, 𝑍~𝑞: 𝐼 𝑌, 𝑍 = 𝔽𝑞 − log 𝑞(𝑌, 𝑍)

Q: If X & Y are independent, what is H(X,Y)?

slide-126
SLIDE 126

Joint Entropy

Q: If X & Y are independent, what is H(X,Y)? A: H(X) + H(Y)

For 𝑌, 𝑍~𝑞: 𝐼 𝑌, 𝑍 = 𝔽𝑞 − log 𝑞(𝑌, 𝑍)

slide-127
SLIDE 127

Kullback-Leibler (KL) Divergence

  • Measures how

“dissimilar” two distributions are

  • 𝐸KL(𝑞| 𝑟 ≥ 0

– 𝐸KL = 0 iff p == q – Higher 𝐸KL → more dissimilar

  • KL is not symmetric

– 𝐸KL(𝑞| 𝑟 ≠ 𝐸KL(𝑟| 𝑞 𝐸KL(𝑞| 𝑟 = 𝔽𝑞 log 𝑞(𝑦) 𝑟(𝑦)

slide-128
SLIDE 128

Kullback-Leibler (KL) Divergence

  • Measures how

“dissimilar” two distributions are

  • 𝐸KL(𝑞| 𝑟 ≥ 0

– 𝐸KL = 0 iff p == q – Higher 𝐸KL → more dissimilar

  • KL is not symmetric

– 𝐸KL(𝑞| 𝑟 ≠ 𝐸KL(𝑟| 𝑞 𝐸KL(𝑞| 𝑟 = 𝔽𝑞 log 𝑞 𝑦 𝑟 𝑦 = ෍

𝑦

𝑞(𝑦) log 𝑞 𝑦 𝑟 𝑦 = න

𝑦

𝑒𝑞(𝑦) log 𝑞 𝑦 𝑟 𝑦

Discrete RV

  • Contin. RV
slide-129
SLIDE 129

Kullback-Leibler (KL) Divergence

  • Measures how

“dissimilar” two distributions are

  • 𝐸KL(𝑞| 𝑟 ≥ 0

– 𝐸KL = 0 iff p == q – Higher 𝐸KL → more dissimilar

  • KL is not symmetric

– 𝐸KL(𝑞| 𝑟 ≠ 𝐸KL(𝑟| 𝑞

𝐸KL(𝑞| 𝑟 = 𝔽𝑞 log 𝑞 𝑦 𝑟 𝑦 = ෍

𝑦

𝑞(𝑦) log 𝑞 𝑦 𝑟 𝑦 = න

𝑦

𝑒𝑞(𝑦) log 𝑞 𝑦 𝑟 𝑦

Discrete RV

  • Contin. RV

Ex 1: 𝐸KL(𝑞| 𝑟 if p & q are both distributions for rolling a die; one is uniform, one has low entropy

slide-130
SLIDE 130

Kullback-Leibler (KL) Divergence

  • Measures how

“dissimilar” two distributions are

  • 𝐸KL(𝑞| 𝑟 ≥ 0

– 𝐸KL = 0 iff p == q – Higher 𝐸KL → more dissimilar

  • KL is not symmetric

– 𝐸KL(𝑞| 𝑟 ≠ 𝐸KL(𝑟| 𝑞

𝐸KL(𝑞| 𝑟 = 𝔽𝑞 log 𝑞 𝑦 𝑟 𝑦 = ෍

𝑦

𝑞(𝑦) log 𝑞 𝑦 𝑟 𝑦 = න

𝑦

𝑒𝑞(𝑦) log 𝑞 𝑦 𝑟 𝑦

Discrete RV

  • Contin. RV

Ex 1: 𝐸KL(𝑞| 𝑟 if p & q are both distributions for rolling a die; one is uniform, one has low entropy Ex 2: 𝐸KL(𝑞| 𝑟 if p & q are both Gamma distributions

slide-131
SLIDE 131

Outline

Basics of Learning Probability Maximum Likelihood Estimation

slide-132
SLIDE 132

Learning: Maximum Likelihood Estimation (MLE)

Core concept in intro statistics:

  • Observe some data 𝒴
  • Compute some distribution 𝑕(𝒴) to {predict,

explain, generate} 𝒴

  • Assume 𝑕 is controlled by parameters 𝜚, i.e.,

𝑕𝜚(𝒴)

– Sometimes written 𝑕(𝒴; 𝜚)

  • Learning appropriate value(s) of 𝜚 allows you to

GENERALIZE about 𝒴

slide-133
SLIDE 133

Learning: Maximum Likelihood Estimation (MLE)

Core concept in intro statistics:

  • Observe some data 𝒴
  • Compute some distribution

𝑕(𝒴) to {predict, explain, generate} 𝒴

  • Assume 𝑕 is controlled by

parameters 𝜚, i.e., 𝑕𝜚(𝒴)

– Sometimes written 𝑕(𝒴; 𝜚)

  • Learning appropriate

value(s) of 𝜚 allows you to

GENERALIZE about 𝒴

How do we “learn

appropriate value(s)

  • f 𝜚?”

Many different options: a common one is maximum likelihood estimation (MLE)

  • Find values 𝜚 s.t.

𝑕𝜚(𝒴 = {𝑦1, … , 𝑦𝑂}) is maximized

  • Independence assumptions

are very useful here!

  • Logarithms are also useful!
slide-134
SLIDE 134

Learning: Maximum Likelihood Estimation (MLE)

Core concept in intro statistics:

  • Observe some data 𝒴
  • Compute some distribution

𝑕(𝒴) to {predict, explain, generate} 𝒴

  • Assume 𝑕 is controlled by

parameters 𝜚, i.e., 𝑕𝜚(𝒴)

– Sometimes written 𝑕(𝒴; 𝜚)

  • MLE: Find values 𝜚 s.t.

𝑕𝜚(𝒴 = {𝑦1, … , 𝑦𝑂}) is maximized Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

slide-135
SLIDE 135

Learning: Maximum Likelihood Estimation (MLE)

Core concept in intro statistics:

  • Observe some data 𝒴
  • Compute some distribution

𝑕(𝒴) to {predict, explain, generate} 𝒴

  • Assume 𝑕 is controlled by

parameters 𝜚, i.e., 𝑕𝜚(𝒴)

– Sometimes written 𝑕(𝒴; 𝜚)

  • MLE: Find values 𝜚 s.t.

𝑕𝜚(𝒴 = {𝑦1, … , 𝑦𝑂}) is maximized Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all others max

𝜚 ෍ 𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗)

slide-136
SLIDE 136

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers

max

𝜚

𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) Q: Why is taking logarithms

  • kay?

Q: What other assumptions,

  • r decisions, do we need to

make?

slide-137
SLIDE 137

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers, but all from g

max

𝜚 ෍ 𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) Q: Why is taking logarithms

  • kay?

Q: What other assumptions, or decisions, do we need to make? 𝑦𝑗 is positive, real-valued. What’s a faithful probability distribution for 𝑦𝑗?

  • Normal? ✘
  • Gamma? ✓
  • Exponential? ✓
  • Bernoulli? ✘
  • Poisson? ✘
slide-138
SLIDE 138

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers, but all from g

max

𝜚 ෍ 𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) Q: Why is taking logarithms

  • kay?

Q: What other assumptions, or decisions, do we need to make? 𝑦𝑗 is positive, real-valued. What’s a faithful probability distribution for 𝑦𝑗?

  • Normal? ✘
  • Gamma? ✓
  • Exponential? ✓
  • Bernoulli? ✘
  • Poisson? ✘

𝑞 𝑌 = 𝑦 = 𝑦𝑙−1exp(−𝑙 𝜄 ) 𝜄𝑙Γ(𝑙)

slide-139
SLIDE 139

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all others, but all from g max

𝜚

𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) Q: Why is taking logarithms okay? Q: What other assumptions, or decisions, do we need to make? 𝑦𝑗 is positive, real-valued. What’s a faithful/nice-to-compute-and- good-enough probability distribution for 𝑦𝑗?

  • Normal? ✘ ✓
  • Gamma? ✓ ?
  • Exponential? ✓ ?
  • Bernoulli? ✘ ✘
  • Poisson? ✘ ✘

𝑞 𝑌 = 𝑦 = 1 2𝜌𝜏 exp(− 𝑦 − 𝜈 2 2𝜏2 )

slide-140
SLIDE 140

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that

𝑕 correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers, but all from g

max

𝜚

𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) 𝑦𝑗 ~Normal 𝜈, 𝜏2 max

(𝜈,𝜏2) ෍ 𝑗=1 𝑂

log Normal𝜈,𝜏2(𝑦𝑗) =

slide-141
SLIDE 141

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that

𝑕 correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers, but all from g

max

𝜚

𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) 𝑦𝑗 ~Normal 𝜈, 𝜏2 max

(𝜈,𝜏2) ෍ 𝑗=1 𝑂

log Normal𝜈,𝜏2(𝑦𝑗) = max

(𝜈,𝜏2) ෍ 𝑗=1 𝑂

− 𝑦𝑗 − 𝜈 2 𝜏2 − 𝑂 log 𝜏 = 𝐺

slide-142
SLIDE 142

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that

𝑕 correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers, but all from g

max

𝜚

𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) 𝑦𝑗 ~Normal 𝜈, 𝜏2 max

(𝜈,𝜏2) ෍ 𝑗=1 𝑂

log Normal𝜈,𝜏2(𝑦𝑗) = max

(𝜈,𝜏2) ෍ 𝑗=1 𝑂

− 𝑦𝑗 − 𝜈 2 𝜏2 − 𝑂 log 𝜏 = 𝐺 Q: How do we find 𝜈, 𝜏2?

slide-143
SLIDE 143

MLE Snowfall Example

Example: How much does it snow?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • Goal: learn 𝜚 such that 𝑕

correctly models, as accurately as possible, the amount of snow likely

  • Assumption: each 𝑦𝑗 is

independent from all

  • thers, but all from g

max

𝜚 ෍ 𝑗=1 𝑂

log 𝑕𝜚(𝑦𝑗) 𝑦𝑗 ~Normal 𝜈, 𝜏2 max

(𝜈,𝜏2) ෍ 𝑗=1 𝑂

log Normal𝜈,𝜏2(𝑦𝑗) = max

(𝜈,𝜏2)෍ 𝑗=1 𝑂

− 𝑦𝑗 − 𝜈 2 𝜏2 − 𝑂 log 𝜏 = 𝐺 Q: How do we find 𝜈, 𝜏2? A: Differentiate and find that ො 𝜈 = ∑𝑗 𝑦𝑗 𝑂 𝜏2 = ∑𝑗 𝑦𝑗 − ො 𝜈 2 𝑂

slide-144
SLIDE 144

Learning: Maximum Likelihood Estimation (MLE)

Central to machine learning:

  • Observe some data (𝒴, 𝒵)
  • Compute some function 𝑔(𝒴) to {predict, explain,

generate} 𝒵

  • Assume 𝑔 is controlled by parameters 𝜄, i.e., 𝑔

𝜄(𝒴)

– Sometimes written 𝑔(𝒴; 𝜄)

slide-145
SLIDE 145

Learning: Maximum Likelihood Estimation (MLE)

Central to machine learning:

  • Observe some data (𝒴, 𝒵)
  • Compute some function 𝑔(𝒴) to {predict, explain,

generate} 𝒵

  • Assume 𝑔 is controlled by parameters 𝜄, i.e., 𝑔

𝜄(𝒴)

– Sometimes written 𝑔(𝒴; 𝜄)

  • Parameters are learned to minimize error (loss) ℓ

min

𝜄

ℓ(𝒵∗, 𝑔

𝜄 𝒴 )

slide-146
SLIDE 146

Learning: Maximum Likelihood Estimation (MLE)

Central to machine learning:

  • Observe some data (𝒴, 𝒵)
  • Compute some function 𝑔(𝒴) to {predict, explain,

generate} 𝒵

  • Assume 𝑔 is controlled by parameters 𝜄, i.e., 𝑔

𝜄(𝒴)

– Sometimes written 𝑔(𝒴; 𝜄)

  • Parameters are learned to minimize error (loss) ℓ

Seen in CMSC 678: linear regression, Naïve Bayes, logistic regression, neural networks, SVMs, PCA, k-means, …

slide-147
SLIDE 147

Learning: Maximum Likelihood Estimation (MLE)

Central to machine learning:

  • Observe some data (𝒴, 𝒵)
  • Compute some function 𝑔(𝒴) to {predict, explain,

generate} 𝒵

  • Assume 𝑔 is controlled by parameters 𝜄, i.e., 𝑔

𝜄(𝒴)

– Sometimes written 𝑔(𝒴; 𝜄)

  • Parameters are learned to minimize error (loss) ℓ

Seen in CMSC 678: linear regression, Naïve Bayes, logistic regression, neural networks, SVMs, PCA, k-means, … We’ll get back to this in more depth on Wednesday

slide-148
SLIDE 148

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • If we assume the
  • utput of 𝑔 is a

probability distribution

  • n 𝒵|𝒴…

➢𝑔 𝒴 → {𝑞(yes|𝒴), 𝑞(no|𝒴)}

  • Then re: 𝜄, {predicting,

explaining, generating} 𝒵 means… what?

slide-149
SLIDE 149

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • If we assume the
  • utput of 𝑔 is a

probability distribution

  • n 𝒵|𝒴…
  • Then re: 𝜄, {predicting,

explaining, generating} 𝒵 means… what?

slide-150
SLIDE 150

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • If we assume the
  • utput of 𝑔 is a

probability distribution

  • n 𝒵|𝒴…
  • Then re: 𝜄, {predicting,

explaining, generating} 𝒵 means finding a value for 𝜄 that maximizes the probability of 𝒵 given 𝒴, according to 𝑔

slide-151
SLIDE 151

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • If we assume the output of

𝑔 is a probability distribution on 𝒵|𝒴…

  • Then re: 𝜄, {predicting,

explaining, generating} 𝒵 means finding a value for 𝜄 that maximizes the probability of 𝒵 given 𝒴, according to 𝑔 max

𝜄

𝑔

𝜄(𝑦) → max 𝜄

𝑞(𝒵|𝒴)

slide-152
SLIDE 152

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • If we assume the output of 𝑔

is a probability distribution

  • n 𝒵|𝒴…
  • Then re: 𝜄, {predicting,

explaining, generating} 𝒵 means finding a value for 𝜄 that maximizes the probability of 𝒵 given 𝒴, according to 𝑔 max

𝜄

𝑔

𝜄(𝑦) → max 𝜄

𝑞(𝒵|𝒴)

We’ll get back to this in more depth in next few days

slide-153
SLIDE 153

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

The 678 approach focused most on 𝒵 What if we also care about 𝒴?

slide-154
SLIDE 154

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • Assume 𝑔 is a probability

distribution on 𝒵|𝒴

  • [Change] Assume there is 𝑕,

a probability distribution on 𝒴

slide-155
SLIDE 155

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • Assume 𝑔 is a probability

distribution on 𝒵|𝒴

  • Assume there is 𝑕, a

probability distribution on 𝒴

  • We also need to learn the

distribution g

slide-156
SLIDE 156

Learning: Maximum Likelihood Estimation (MLE)

Example: Can I sleep in the next time it snows/is school canceled?

  • 𝒴 = 𝑦1, 𝑦2, … , 𝑦𝑂 are

snowfall values from the previous N storms

  • 𝒵 = 𝑧1, 𝑧2, … , 𝑧𝑂 are

closure results from the previous N storms

  • Goal: learn 𝜄 such that 𝑔

correctly predicts, as accurately as possible, if UMBC will close in the next storm:

– 𝑧𝑜+1

from 𝑦𝑜+1

  • Assume 𝑔 is a probability

distribution on 𝒵|𝒴

  • Assume there is 𝑕, a

probability distribution on 𝒴

  • We also need to learn the

distribution g

Core design problem: how does f use g? This is task-dependent!

slide-157
SLIDE 157

Outline

Basics of Learning Probability Maximum Likelihood Estimation

slide-158
SLIDE 158

Extended examples of MLE

slide-159
SLIDE 159

Learning Parameters for the Die Model

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do?

slide-160
SLIDE 160

Learning Parameters for the Die Model

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

Q: Why is maximizing log- likelihood a reasonable thing to do? A: Develop a good model for what we observe

slide-161
SLIDE 161

Learning Parameters for the Die Model: Maximum Likelihood (Intuition)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

p(1) = ? p(3) = ? p(5) = ? p(2) = ? p(4) = ? p(6) = ?

If you observe these 9 rolls… …what are “reasonable” estimates for p(w)?

slide-162
SLIDE 162

Learning Parameters for the Die Model: Maximum Likelihood (Intuition)

p(1) = 2/9 p(3) = 1/9 p(5) = 1/9 p(2) = 1/9 p(4) = 3/9 p(6) = 1/9 maximum likelihood estimates

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

maximize (log-) likelihood to learn the probability parameters

If you observe these 9 rolls… …what are “reasonable” estimates for p(w)?

slide-163
SLIDE 163

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

𝑥1 = 1 𝑥2 = 5 𝑥3 = 4 ⋯

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝑞𝜄(𝑥𝑗) = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood

slide-164
SLIDE 164

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)?

slide-165
SLIDE 165

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

for roll 𝑗 = 1 to 𝑂: 𝑥𝑗 ∼ Cat(𝜄)

Generative Story

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗

Maximize Log-likelihood Q: What’s an easy way to maximize this, as written exactly (even without calculus)? A: Just keep increasing 𝜄𝑙 (we know 𝜄 must be a distribution, but it’s not specified)

slide-166
SLIDE 166

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℒ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 s. t. ෍

𝑙=1 6

𝜄𝑙 = 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

solve using Lagrange multipliers

slide-167
SLIDE 167

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜖ℱ 𝜄 𝜖𝜄𝑙 = ෍

𝑗:𝑥𝑗=𝑙

1 𝜄𝑥𝑗 − 𝜇 𝜖ℱ 𝜄 𝜖𝜇 = − ෍

𝑙=1 6

𝜄𝑙 + 1

slide-168
SLIDE 168

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜄𝑙 = ∑𝑗:𝑥𝑗=𝑙 1 𝜇

  • ptimal 𝜇 when ෍

𝑙=1 6

𝜄𝑙 = 1

slide-169
SLIDE 169

Learning Parameters for the Die Model: Maximum Likelihood (Math)

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

N different (independent) rolls

ℱ 𝜄 = ෍

𝑗

log 𝜄𝑥𝑗 − 𝜇 ෍

𝑙=1 6

𝜄𝑙 − 1

Maximize Log-likelihood (with distribution constraints)

(we can include the inequality constraints 0 ≤ 𝜄𝑙, but it complicates the problem and, right now, is not needed)

𝜄𝑙 = ∑𝑗:𝑥𝑗=𝑙 1 ∑𝑙 ∑𝑗:𝑥𝑗=𝑙 1 = 𝑂𝑙 𝑂

  • ptimal 𝜇 when ෍

𝑙=1 6

𝜄𝑙 = 1

slide-170
SLIDE 170

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

slide-171
SLIDE 171

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

⋯ 𝑨1 = 𝑈 𝑨2 = 𝐼

First flip a coin…

add complexity to better explain what we see

slide-172
SLIDE 172

Example: Conditionally Rolling a Die

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

𝑥1 = 1 𝑥2 = 5 ⋯ 𝑨1 = 𝑈 𝑨2 = 𝐼

First flip a coin… …then roll a different die depending on the coin flip

slide-173
SLIDE 173

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

add complexity to better explain what we see

If you observe the 𝑨𝑗 values, this is easy!

slide-174
SLIDE 174

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for die when coin comes up heads 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for die when coin comes up tails for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

slide-175
SLIDE 175

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for H die 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Second: Generative Story → Objective

ℱ 𝜄 = ෍

𝑗 𝑜

(log 𝜇𝑨𝑗 + log 𝛿𝑥𝑗

(𝑨𝑗)) −𝜃 ෍

𝑙=1 2

𝜇𝑙 − 1 − ෍

𝑙 2

𝜀𝑙 ෍

𝑘 6

𝛿𝑘

(𝑙) − 1

Lagrange multiplier constraints

slide-176
SLIDE 176

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for H die 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Second: Generative Story → Objective

ℱ 𝜄 = ෍

𝑗 𝑜

(log 𝜇𝑨𝑗 + log 𝛿𝑥𝑗

(𝑨𝑗)) −𝜃 ෍

𝑙=1 2

𝜇𝑙 − 1 − ෍

𝑙=1 2

𝜀𝑙 ෍

𝑘=1 6

𝛿𝑘

(𝑙) − 1

slide-177
SLIDE 177

Learning in Conditional Die Roll Model: Maximize (Log-)Likelihood

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

If you observe the 𝑨𝑗 values, this is easy!

First: Write the Generative Story

𝜇 = distribution over coin (z) 𝛿(𝐼) = distribution for H die 𝑥𝑗 ~ Cat 𝛿(𝑨𝑗) 𝛿(𝑈) = distribution for T die for item 𝑗 = 1 to 𝑂: 𝑨𝑗 ~ Bernoulli 𝜇

Second: Generative Story → Objective

ℱ 𝜄 = ෍

𝑗 𝑜

(log 𝜇𝑨𝑗 + log 𝛿𝑥𝑗

(𝑨𝑗)) −𝜃 ෍

𝑙=1 2

𝜇𝑙 − 1 − ෍

𝑙=1 2

𝜀𝑙 ෍

𝑘=1 6

𝛿𝑘

(𝑙) − 1

But if you don’t observe the 𝑨𝑗 values, this is not easy!