[PPT] - 15-780 Graduate Artificial Intelligence: Probabilistic modeling J. PowerPoint Presentation

SLIDE 1

15-780 – Graduate Artificial Intelligence: Probabilistic modeling

J. Zico Kolter (this lecture) and Nihar Shah

Carnegie Mellon University Spring 2020

1

SLIDE 2

Outline

Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models

2

SLIDE 3

Outline

Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models

3

SLIDE 4

Probability in AI

Basic idea: the real world is probabilistic (at least at the level we can observe it), and our reasoning about it needs to be too The shift from “logical” to “probabilistic” AI systems (circa 80s, 90s) represented a revolution in AI Probabilistic approaches are now intertwined with virtually all areas of AI, though research e.g. in “pure” probabilistic graphical models has declined a bit in recent years in favor of neural-network-based generative models

4

SLIDE 5

Example: topic modeling

Can we learn about the content of text documents just be reading through them and see what sorts of words “co-occur” Figure from (Blei et al., 2011) demonstrates words and topics recovered from reading 17,000 Science articles

5

“Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic

rganisms

diseases data genes life resistance computers sequence

rigin

bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4

“Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic

rganisms

diseases data genes life resistance computers sequence

rigin

bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations

1 8 16 26 36 46 56 66 76 86 96 Topics Probability 0.0 0.1 0.2 0.3 0.4

SLIDE 6

Example: biological networks

Can we automatically determine how the presence or absence of some proteins in a cell affect others? Figure from (Sachs et al., 2005) shows automatically inferred protein probability network, which captured most of the known interactions using data-driven methods (far less manual effort than previous methods)

6

SLIDE 7

Outline

Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models

7

SLIDE 8

Basics of probability

A probability space is a tuple (Ω, ℱ, 𝑄 ) where

The sample space Ω is a set of outcomes
The event space ℱ is a 𝜏-algebra of subsets of Ω
The probability measure 𝑄 : ℱ → [0,1] is a countably additive positive

measure with 𝑄 Ω = 1 A random variable is a measurable function 𝑌: Ω → ℝ (or ℝ푛 for a random vector), such that for all Borel sets 𝐶, 𝑄 𝑌 ∈ 𝐶 = 𝑄 ( 𝜕 ∈ Ω 𝑌 𝜕 ∈ 𝐶 })

8

SLIDE 9

Basics of probability

A probability space is a tuple (Ω, ℱ, 𝑄 ) where

The sample space Ω is a set of outcomes
The event space ℱ is a 𝜏-algebra of subsets of Ω
The probability measure 𝑄 : ℱ → [0,1] is a countably additive positive

measure with 𝑄 Ω = 1 A random variable is a measurable function 𝑌: Ω → ℝ (or ℝ푛 for a random vector), such that for all Borel sets 𝐶, 𝑄 𝑌 ∈ 𝐶 = 𝑄 ( 𝜕 ∈ Ω 𝑌 𝜕 ∈ 𝐶 }) 😲😴😲😴

9

SLIDE 10

Random variables

A random variable (informally) is a variable whose value is not initial known Instead, these variables can take on different values (including a possibly infinite number), and must take on exactly one of these values, each with an associated probability, which all together sum to one “Weather” takes values sunny, rainy, cloudy, snowy 𝑞 Weather = sunny = 0.3 𝑞 Weather = rainy = 0.2 … Slightly different notation for continuous random variables, which we will discuss shortly

10

SLIDE 11

Notation for random variables

In this lecture, we use upper case letters, 𝑌푖 to denote random variables For a random variable 𝑌푖 taking values 1,2,3 𝑞 𝑌푖 = 0.1 0.5 0.4 represents the set of probabilities for each value that 𝑌푖 can take on (this is a function mapping values of 𝑌푖 to numbers that sum to one) Conversely, we will use lower case 𝑦푖 to denote a specific value of 𝑌푖 (i.e., for above example 𝑦푖 ∈ 1,2,3 ), and 𝑞 𝑌푖 = 𝑦푖 or just 𝑞 𝑦푖 refers to a number (the corresponding entry of 𝑞 𝑌푖 )

11

SLIDE 12

Examples of probability notation

Given two random variables: 𝑌1 with values in {1,2,3} and 𝑌2 with values in 1,2 : 𝑞(𝑌1, 𝑌2) refers to the joint distribution, i.e., a set of 6 possible values for each setting of variables, i.e. a function mapping 1,1 , 1,2 , 2,1 , … to corresponding probabilities) 𝑞(𝑦1, 𝑦2) is a number: probability that 𝑌1 = 𝑦1 and 𝑌2 = 𝑦2 𝑞(𝑌1, 𝑦2) is a set of 3 values, the probabilities for all values of 𝑌1 for the given value 𝑌2 = 𝑦2, i.e., it is a function mapping 0,1,2 to numbers (note: not probability distribution, it will not sum to one)

We generally call all of these terms factors (functions mapping values to numbers, even if they do not sum to one)

12

SLIDE 13

Operations on probabilities/factors

We can perform operations on probabilities/factors by performing the operation on every corresponding value in the probabilities/factors For example, given three random variables 𝑌1, 𝑌2, 𝑌3: 𝑞 𝑌1, 𝑌2

p 𝑞 𝑌2, 𝑌3

denotes a factor over 𝑌1, 𝑌2, 𝑌3 (i.e., a function over all possible combinations

f values these three random variables can take), where the value for 𝑦1, 𝑦2, 𝑦3 is

given by 𝑞 𝑦1, 𝑦2

p 𝑞 𝑦2, 𝑦3

13

SLIDE 14

Conditional probability

The conditional probability 𝑞 𝑌1 𝑌2 (the conditional probability of 𝑌1 given 𝑌2) is defined as 𝑞 𝑌1 𝑌2 = 𝑞 𝑌1, 𝑌2 𝑞 𝑌2 Can also be written 𝑞 𝑌1, 𝑌2 = 𝑞 𝑌1 𝑌2)𝑞(𝑌2) More generally, leads to the chain rule: 𝑞 𝑌1, … , 𝑌푛 = ∏

푖=1 푛

𝑞 𝑌푖 𝑌1, … 𝑌푖−1

14

SLIDE 15

Marginalization

For random variables 𝑌1, 𝑌2 with joint distribution 𝑞 𝑌1, 𝑌2 𝑞 𝑌1 = ∑

푥2

𝑞 𝑌1, 𝑦2 = ∑

푥2

𝑞 𝑌1 𝑦2 𝑞 𝑦2 Generalizes to joint distributions over multiple random variables 𝑞 𝑌1, … , 𝑌푖 = ∑

푥푖+1,…,푥푛

𝑞 𝑌1, … , 𝑌푖, 𝑦푖+1, … , 𝑦푛 For 𝑞 to be a probability distribution, the marginalization over all variables must be one ∑

푥1,…,푥푛

𝑞 𝑦1, … , 𝑦푛 = 1

15

SLIDE 16

Bayes’ rule

A straightforward manipulation of probabilities: 𝑞 𝑌1 𝑌2 = 𝑞 𝑌1, 𝑌2 𝑞 𝑌2 = 𝑞 𝑌2 𝑌1)𝑞(𝑌1) 𝑞 𝑌2 = 𝑞 𝑌2 𝑌1)𝑞(𝑌1) ∑푥1 𝑞(𝑌2|𝑦1) 𝑞 𝑦1 Poll: I want to know if I have come down with a rare strain of flu (occurring in only 1/10,000 people). There is an “accurate” test for the flu: if I have the flu, it will tell me I have 99% of the time, and if I do not have it, it will tell me I do not have it 99% of the time. I go to the doctor and test positive. What is the probability I have this flu?

≈ 99%
≈ 10%
≈ 1%
≈ 0.1%

16

SLIDE 17

Independence

We say that random variables 𝑌1 and 𝑌2 are (marginally) independent if their joint distribution is the product of their marginals 𝑞 𝑌1, 𝑌2 = 𝑞 𝑌1 𝑞 𝑌2 Equivalently, can also be stated as the condition that 𝑞 𝑌1 𝑌2) = 𝑞 𝑌1, 𝑌2 𝑞 𝑌2 = 𝑞 𝑌1 𝑞 𝑌2 𝑞 𝑌2 = 𝑞 𝑌1 and similarly 𝑞 𝑌2 𝑌1 = 𝑞 𝑌2

17

SLIDE 18

Conditional independence

We say that random variables 𝑌1 and 𝑌2 are conditionally independent given 𝑌3, if 𝑞 𝑌1, 𝑌2|𝑌3 = 𝑞 𝑌1 𝑌3 𝑞 𝑌2 𝑌3) Again, can be equivalently written: 𝑞 𝑌1 𝑌2, X3 = 𝑞 𝑌1, 𝑌2 𝑌3 𝑞 𝑌2 𝑌3 = 𝑞 𝑌1 𝑌3 𝑞 𝑌2 𝑌3) 𝑞 𝑌2 𝑌3 = 𝑞(𝑌1|𝑌3) And similarly 𝑞 𝑌2 𝑌1, 𝑌3 = 𝑞 𝑌2 𝑌3 Important: Marginal independence does not imply conditional independence or vice versa

18

SLIDE 19

Expectation

The expectation of a random variable is denoted: 𝐅 𝑌 = ∑

푥

𝑦 ⋅ 𝑞 𝑦 where we use upper case 𝑌 to emphasize that this is a function of the entire random variable (but unlike 𝑞(𝑌) is a number) Note that this only makes sense when the values that the random variable takes on are numerical (i.e., We can’t ask for the expectation of the random variable “Weather”) Also generalizes to conditional expectation: 𝐅 𝑌1|𝑦2 = ∑

푥1

𝑦1 ⋅ 𝑞 𝑦1|𝑦2

19

SLIDE 20

Rules of expectation

Expectation of sum is always equal to sum of expectations (even when variables are not independent): 𝐅 𝑌1 + 𝑌2 = ∑

푥1,푥2

𝑦1 + 𝑦2 𝑞(𝑦1, 𝑦2) = ∑

푥1

𝑦1 ∑

푥2

𝑞 𝑦1, 𝑦2 + ∑

푥2

𝑦2 ∑

푥1

𝑞 𝑦1, 𝑦2 = ∑

푥1

𝑦1𝑞 𝑦1 + ∑

푥2

𝑦2𝑞 𝑦2 = 𝐅 𝑌1 + 𝐅 𝑌2 If 𝑦1, 𝑦2 independent, expectation of products is product of expectations 𝐅 𝑌1𝑌2 = ∑

푥1,푥2

𝑦1𝑦2 𝑞 𝑦1, 𝑦2 = ∑

푥1,푥2

𝑦1𝑦2 𝑞 𝑦1 𝑞 𝑦2 = ∑

푥1

𝑦1𝑞 𝑦1 ∑

푥2

𝑦2𝑞 𝑦2 = 𝐅 𝑌1 𝐅 𝑌2

20

SLIDE 21

Variance

Variance of a random variable is the expectation of the variable minus its expectation, squared 𝐖𝐛𝐬 𝑌 = 𝐅 𝑌 − 𝐅 𝑌

2

= ∑

푥

𝑦 − 𝐅 𝑌

2𝑞 𝑦

= 𝐅 𝑌2 − 2𝑌𝐅 𝑌 + 𝐅 𝑌 2 = 𝐅 𝑌2 − 𝐅 𝑌 2 Generalizes to covariance between two random variables 𝐃𝐩𝐰 𝑌1, 𝑌2 = 𝐅 𝑌1 − 𝐅 𝑌1 𝑌2 − 𝐅 𝑌2 = 𝐅 𝑌1𝑌2 − 𝐅 𝑌1 𝐅[𝑌2]

21

SLIDE 22

Infinite random variables

All the math above works the same for discrete random variables that can take on an infinite number of values (I’m talking about countably infinite values here) The only difference is that 𝑞(𝑌) (obviously) cannot be specified by an explicit dictionary mapping variable values to probabilities, need to specify the functional form that produces probabilities To be a probability, we still must have ∑푥 𝑞 𝑦 = 1 Example: 𝑄 𝑌 = 𝑙 = 1 2

푘

, 𝑙 = 1, … , ∞

22

SLIDE 23

Continuous random variables

For random variables taking on continuous values (we’ll only consider real-valued distributions), we need some slightly different mechanisms As with infinite discrete variables, the distribution 𝑞(𝑌) needs to be specified as a function: here is referred to as a probability density function (PDF) and it must integrate to

ne ∫

ℝ 𝑞 𝑦 𝑒𝑦 = 1

For any interval 𝑏, 𝑐 , we have that 𝑞 𝑏 ≤ 𝑦 ≤ 𝑐 = ∫

푎 푏 𝑞 𝑦 𝑒𝑦 (with similar

generalization to multi-dimensional random variables) Can also be specified by their cumulative distribution function (CDF), 𝐺 𝑏 = 𝑞 𝑦 ≤ 𝑏 = ∫

−∞ 푎

𝑞(𝑦)

23

SLIDE 24

Outline

Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models

24

SLIDE 25

Bernoulli distribution

A simple distribution over binary {0,1} random variables 𝑞 𝑌 = 1; 𝜚 = 𝜚, 𝑄 𝑌 = 0; 𝜚 = 1 − 𝜚 where 𝜚 ∈ [0,1] is the parameter that governs the distribution Expectation is just 𝐅 𝑦 = 𝜚 (but not very common to refer to it this way, since this would imply that the {0,1} terms are actual real-valued numbers)

25

SLIDE 26

Categorical distribution

This is the discrete distribution we’ve mainly considered so far, a distribute over finite discrete elements with each probability specified Written generically as: 𝑞 𝑌 = 𝑗; 𝜚 = 𝜚푖 where 𝜚1, … 𝜚푘 ∈ [0,1] are the parameters of the distribution (the probability of each random variable, must sum to one) Note: we could actually parameterize just using 𝜚1, … 𝜚푘−1, since this would determine the last elements Unless the actual numerical value of the 𝑗’s are relevant, it doesn’t make sense to take expectations of a categorical random variable

26

SLIDE 27

Distribution over real-valued numbers, empirically the most common distribution in all of machine learning (not in data itself, necessarily), the standard “bell curve”: Probability density function: 𝑞 𝑦; 𝜈, 𝜏2 = 1 2𝜌𝜏2 1/2 exp − 𝑦 − 𝜈 2 2𝜏2 ≡ 𝒪 𝑦; 𝜈, 𝜏2 with parameters 𝜈 ∈ ℝ (mean) and 𝜏2 ∈ ℝ+ (variance)

Gaussian distribution

27

𝜈 = 0 𝜏2 = 1

SLIDE 28

Multivariate Gaussians

The Gaussian distribution is one of the few distributions that generalizes nicely to higher dimensions Gaussian distribution over random vectors 𝑦 ∈ ℝ푛 𝑞 𝑦; 𝜈, Σ = 1 2𝜌Σ 1/2 exp − 𝑦 − 𝜈 푇 Σ−1 𝑦 − 𝜈 where 𝜈 ∈ ℝ푛 is mean and Σ ∈ ℝ푛×푛 is covariance matrix, and ⋅ denotes the determinant of a matrix Many extremely nice properties: marginal and conditional distributions are also multivariate Gaussian

28

SLIDE 29

Exponential distribution

A one-sided Laplace distribution, often used to model arrival times Probability density function: 𝑞 𝑦; 𝜇 = 𝜇 exp −𝜇𝑦 with parameter 𝜇 ∈ ℝ+ (mean/variance 𝐅 𝑌 = 1/𝜇, 𝐖𝐛𝐬 𝑦 = 1/𝜇2)

29

𝜇 = 1

SLIDE 30

Outline

Probability in AI Background on probability Common distributions Maximum likelihood estimation Probabilistic graphical models

30

SLIDE 31

Estimating the parameters of distributions

We’re moving now from probability to statistics The basic question: given some data 𝑦 1 , … , 𝑦 푚 , how do I find a distribution that captures this data “well”? In general (if we can pick from the space of all distributions), this is a hard question, but if we pick from a particular parameterized family of distributions 𝑞 𝑌; 𝜄 , the question is (at least a little bit) easier Question becomes: how do I find parameters 𝜄 of this distribution that fit the data?

31

SLIDE 32

Maximum likelihood estimation

Given a distribution 𝑞 𝑌; 𝜄 , and a collection of observed (independent) data points 𝑦 1 , … , 𝑦 푚 , the probability of observing this data is simply 𝑞 𝑦 1 , … , 𝑦 푚 ; 𝜄 = ∏

푖=1 푚

𝑞 𝑦 푖 ; 𝜄 Basic idea of maximum likelihood estimation (MLE): find the parameters that maximize the probability of the observed data maximize

휃

∏

푖=1 푚

𝑞 𝑦 푖 ; 𝜄 ≡ maximize

휃

ℓ 𝜄 = ∑

푖=1 푚

log 𝑞 𝑦 푖 ; 𝜄 where ℓ 𝜄 is called the log likelihood of the data Seems “obvious”, but there are many other ways of fitting parameters

32

SLIDE 33

Parameter estimation for Bernoulli

Simple example: Bernoulli distribution 𝑞 𝑌 = 1; 𝜚 = 𝜚, 𝑞 𝑌 = 0; 𝜚 = 1 − 𝜚 Given observed data 𝑦 1 , … , 𝑦 푚 , the “obvious” answer is: ̂ 𝜚 = #1’s # Total = ∑푖=1

푚 𝑦 푖

𝑛 But why is this the case? Maybe there are other estimates that are just as good, i.e.? 𝜚 = ∑푖=1

푚 𝑦 푖 + 1

𝑛 + 2

33

SLIDE 34

MLE for Bernoulli

Maximum likelihood solution for Bernoulli given by maximize

휙

∏

푖=1 푚

𝑞 𝑦 푖 ; 𝜚 = maximize

휙

∏

푖=1 푚

𝜚푥 푖 1 − 𝜚 1−푥 푖 Taking the negative log of the optimization objective (just to be consistent with our usual notation of optimization as minimization) maximize

휙

ℓ 𝜚 = ∑

푖=1 푚

𝑦 푖 log 𝜚 + 1 − 𝑦 푖 log 1 − 𝜚 Derivative with respect to 𝜚 is given by 𝑒 𝑒𝜚 ℓ 𝜚 = ∑

푖=1 푚

𝑦 푖 𝜚 − 1 − 𝑦 푖 1 − 𝜚 = ∑푖=1

푚 𝑦 푖

𝜚 − ∑푖=1

푚 (1 − 𝑦 푖 )

1 − 𝜚

34

SLIDE 35

MLE for Bernoulli, continued

Setting derivative to zero gives: ∑푖=1

푚 𝑦 푖

𝜚 − ∑푖=1

푚 (1 − 𝑦 푖 )

1 − 𝜚 ≡ 𝑏 𝜚 − 𝑐 1 − 𝜚 = 0 ⟹ 1 − 𝜚 𝑏 = 𝜚𝑐 ⟹ 𝜚 = 𝑏 𝑏 + 𝑐 = ∑푖=1

푚 𝑦 푖

𝑛 So, we have shown that the “natural” estimate of 𝜚 actually corresponds to the maximum likelihood estimate

35

SLIDE 36

MLE for Gaussian, briefly

For Gaussian distribution 𝑞 𝑦; 𝜈, 𝜏2 = 2𝜌𝜏2 −1/2 exp − 1/2 𝑦 − 𝜈 2/𝜏2 Log likelihood given by: ℓ 𝜈, 𝜏2 = −𝑛 1 2 log 2𝜌𝜏2 − 1 2 ∑

푖=1 푚

𝑦 푖 − 𝜈 2 𝜏2 Derivatives (see if you can derive these fully):

𝑒 𝑒𝜈 ℓ 𝜈, 𝜏2 = − 1 2 ∑

푖=1 푚 𝑦 푖 − 𝜈

𝜏2 = 0 ⟹ 𝜈 = 1 𝑛 ∑

푖=1 푚

𝑦 푖 𝑒 𝑒𝜏2 ℓ 𝜈, 𝜏2 = − 𝑛 2𝜏2 + 1 2 ∑

푖=1 푚

𝑦 푖 − 𝜈 2 𝜏2 2 = 0 ⟹ 𝜏2 = 1 𝑛 ∑

푖=1 푚

𝑦 푖 − 𝜈 2

36

SLIDE 37

Machine learning via maximum likelihood

Many machine learning algorithms (specifically the loss function component) can be interpreted as maximum likelihood estimation Logistic regression: minimize

휃

∑

푖=1 푚

log(1 + exp −𝑧 푖 ⋅ ℎ휃 𝑦 푖 Softmax (multiclass logistic) regression: minimize

휃

∑

푖=1 푚

log ∑

푗=1 푘

exp ℎ휃 𝑦 푖

푗 − ℎ휃 𝑦 푖 푇 𝑧 푖

Where did these come from?

37

SLIDE 38

Logistic model

Suppose our random variable 𝑍 takes on values in {−1, +1} and we want to model the condition distribution 𝑞(𝑍 |𝑌) as a function of 𝜄푇 𝑦 for some parameters 𝜄 Since probabilities must be positive, let’s look at the distribution 𝑞 𝑧 = +1 𝑦; 𝜄 ∝ exp 𝜄푇 𝑦 , 𝑞 𝑧 = −1 𝑦; 𝜄 ∝ 1 Then, because the actual probability values need to sum to one 𝑞 𝑧 = +1|𝑦; 𝜄 = exp 𝜄푇 𝑦 1 + exp 𝜄푇 𝑦 = 1 1 + exp −𝜄푇 𝑦 This last term is called a logistic (or sigmoid) function 𝜏 𝑨 = 1/(1 + exp(−𝑨))

38

SLIDE 39

Logistic probability model

Under linear logistic model we can write the likelihood as 𝑞 𝑧 𝑦; 𝜄 = 𝜏(𝑧 ⋅ 𝜄푇 𝑦) Maximum likelihood estimate for 𝜄 is then given by maximize

휃

∑

푖=1 푚

log 𝑞(𝑧 푖 |𝑦 푖 ; 𝜄) ≡ maximize

휃

∑

푖=1 푚

log 1 1 + exp −𝑧 푖 ⋅ 𝜄푇 𝑦 푖 ≡ minimize

휃

∑

푖=1 푚

log(1 + exp −𝑧 푖 ⋅ ℎ휃 𝑦 푖

39

SLIDE 40

Softmax regression model

If instead 𝑍 takes on values in {1, … , 𝑙}, with 𝑞 𝑧 = 𝑘 𝑦; 𝜄 ∝ exp 𝜄푗

푇 𝑦

Then 𝑞 𝑧 = 𝑘 𝑦; 𝜄 = exp 𝜄푗

푇 𝑦

∑푙=1

푘

exp 𝜄푙

푇 𝑦

log 𝑞 𝑧 = 𝑘 𝑦; 𝜄 = 𝜄푗

푇 𝑦 − log ∑ 푙=1 푘

exp 𝜄푙

푇 𝑦

⟹ minimize

휃

∑

푖=1 푚

log ∑

푙=1 푘

exp 𝜄푙

푇 𝑦 푖

− 𝜄푦 푖

푇 𝑦 푖

40

SLIDE 41

Least squares

In linear regression, assume 𝑧 = 𝜄푇 𝑦 + 𝜗, 𝜗 ∼ 𝒪 0, 𝜏2 ⟺ 𝑞 𝑧 𝑦; 𝜄 = 𝒪 𝜄푇 𝑦, 𝜏2 Then the maximum likelihood estimate is given by maximize

휃

∑

푖=1 푚

log 𝑞 𝑧 푖 𝑦 푖 ; 𝜄) ≡ minimize

휃

∑

푖=1 푚

𝑧 푖 − 𝜄푇 𝑦 푖

2

i.e., the least-squares loss function can be viewed as MLE under Gaussian errors Other approaches possible too: absolute loss function can be viewed as MLE under Laplace errors

41