Machine Learning Lecture 01-1: Basics of Probability Theory Nevin - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Lecture 01-1: Basics of Probability Theory Nevin - - PowerPoint PPT Presentation

Machine Learning Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 52


slide-1
SLIDE 1

Machine Learning

Lecture 01-1: Basics of Probability Theory Nevin L. Zhang lzhang@cse.ust.hk

Department of Computer Science and Engineering The Hong Kong University of Science and Technology

Nevin L. Zhang (HKUST) Machine Learning 1 / 52

slide-2
SLIDE 2

Basic Concepts in Probability Theory

Outline

1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability

Bayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 2 / 52

slide-3
SLIDE 3

Basic Concepts in Probability Theory

Random Experiments

Probability associated with a random experiment — a process with uncertain outcomes Often kept implicit In machine learning, we often assume that data are generated by a hypothetical process (or a model), and task is to determine the structure and parameters of the model from data.

Nevin L. Zhang (HKUST) Machine Learning 3 / 52

slide-4
SLIDE 4

Basic Concepts in Probability Theory

Sample Space

Sample space (aka population) Ω: Set of possible outcomes and a random experiment. Example: Rolling two dices. Elements in a sample space are outcomes.

Nevin L. Zhang (HKUST) Machine Learning 4 / 52

slide-5
SLIDE 5

Basic Concepts in Probability Theory

Events

Event: A subset of the sample space. Example: The two results add to 4.

Nevin L. Zhang (HKUST) Machine Learning 5 / 52

slide-6
SLIDE 6

Basic Concepts in Probability Theory

Probability Weight Function

A probability weight P(ω) is assigned to each outcome. In Machine Learning, we often need to determine the probability weights,

  • r related parameters, from data. This task is called parameter learning.

Nevin L. Zhang (HKUST) Machine Learning 6 / 52

slide-7
SLIDE 7

Basic Concepts in Probability Theory

Probability measure

Probability P(E) of an event E: P(E) =

ω∈E P(ω)

A probability measure is a mapping from the set of events to [0, 1] P : 2Ω → [0, 1] that satisfies Kolmogorov’s axioms:

1 P(Ω) = 1. 2 P(A) ≥ 0 ∀A ⊆ Ω 3 Additivity: P(A ∪ B) = P(A) + P(B) if A ∩ B = ∅.

In a more advanced treatment of Probability Theory, we would start with the concept of probability measure, instead of probability weights.

Nevin L. Zhang (HKUST) Machine Learning 7 / 52

slide-8
SLIDE 8

Basic Concepts in Probability Theory

Random Variables

A random variable is a function over the sample space.

Example: X = sum of the two results. X((2, 5)) = 7; X((3, 1)) = 4)

Why is it random? The experiment. Domain of a random variable: Set of all its possible values. ΩX = {2, 3, . . . , 12}

Nevin L. Zhang (HKUST) Machine Learning 8 / 52

slide-9
SLIDE 9

Basic Concepts in Probability Theory

Random Variables and Event

A random variable X taking a specific value x is an event: ΩX=x = {ω ∈ Ω|X(ω) = x} ΩX=4 = {(1, 3), (2, 2, )(3, 1)}.

Nevin L. Zhang (HKUST) Machine Learning 9 / 52

slide-10
SLIDE 10

Basic Concepts in Probability Theory

Probability Mass Function (Distribution)

Probability mass function P(X): ΩX → [0, 1] P(X = x) = P(ΩX=x) P(X = 4) = P({(1, 3), (2, 2, )(3, 1)}) = 3

36.

If X is continuous, we have a density function p(X).

Nevin L. Zhang (HKUST) Machine Learning 10 / 52

slide-11
SLIDE 11

Interpretation of Probability

Outline

1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability

Bayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 11 / 52

slide-12
SLIDE 12

Interpretation of Probability

Frequentist interpretation

Probabilities are long term relative frequencies. Example:

X is result of coin tossing. ΩX = {H, T} P(X=H) = 1/2 means that

the relative frequency of getting heads will almost surely approach 1/2 as the number of tosses goes to infinite.

Justified by the Law of Large Numbers:

Xi: result of the i-th tossing; 1 – H, 0 — T Law of Large Numbers: lim

n→∞

n

i=1 Xi

n = 1 2 with probability 1

The frequentist interpretation is meaningful only when experiment can be repeated under the same condition.

Nevin L. Zhang (HKUST) Machine Learning 12 / 52

slide-13
SLIDE 13

Interpretation of Probability

Bayesian interpretation

Probabilities are logically consistent degrees of beliefs. Applicable when experiment not repeatable. Depends on a person’s state of knowledge. Example: “probability that Suez canal is longer than the Panama canal”.

Doesn’t make sense under frequentist interpretation. Subjectivist: degree of belief based on state of knowledge

Primary school student: 0.5 Me: 0.8 Geographer: 1 or 0

Arguments such as Dutch book are used to explain why one’s probability beliefs must satisfy Kolmogorov’s axioms.

Nevin L. Zhang (HKUST) Machine Learning 13 / 52

slide-14
SLIDE 14

Interpretation of Probability

Interpretations of Probability

Now both interpretations are accepted. In practice, subjective beliefs and statistical data complement each other.

We rely on subjective beliefs (prior probabilities) when data are scarce. As more and more data become available, we rely less and less on subjective beliefs. Often, we also use prior probabilities to impose some bias on the kind

  • f results we want from a machine learning algorithm.

The subjectivist interpretation makes concepts such as conditional independence easy to understand.

Nevin L. Zhang (HKUST) Machine Learning 14 / 52

slide-15
SLIDE 15

Univariate Probability Distributions

Outline

1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability

Bayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 15 / 52

slide-16
SLIDE 16

Univariate Probability Distributions

Binomial and Bernoulli Distributions

Suppose we toss a coin n times. At each time, the probability of getting a head is θ. Let X be the number of heads. Then X follows the binomial distribution, written as X ∼ Bin(n, θ): Bin(X = k|n, θ) = n

k

  • θk(1 − θ)n−k

if 0 ≤ k ≤ n if k < 0 or k > n If n = 1, then X follows the Bernoulli distribution, written as X ∼ Ber(θ) Ber(X = x|θ) = θ if x = 1 1 − θ if x = 0

Nevin L. Zhang (HKUST) Machine Learning 16 / 52

slide-17
SLIDE 17

Univariate Probability Distributions

Multinomial Distribution

Suppose we toss a K-sided die n times. At each time, the probability

  • f getting result j is θj. Let θ = (θ1, . . . , θK)⊤.

Let x = (x1, ..., xK) be a random vector, where xj is the number of times side j of the die occurs. Then x follows the multinomial distribution, written as x ∼ Multi(n, θ) Multi(x|n, θ) =

  • n

x1, . . . , xK K

  • j=1

θxj

k ,

where

  • n

x1, . . . , xK

  • =

n! x1! . . . xK! is the multinomial coefficient

Nevin L. Zhang (HKUST) Machine Learning 17 / 52

slide-18
SLIDE 18

Univariate Probability Distributions

Categorical Distribution

In the previous slide, if n = 1, x = (x1, ..., xK) has one component being 1 and the others are 0. In other words, it is a one-hot vector. In this case, x follows the categorical distribution, written as x ∼ Cat(θ) Cat(x|θ) =

K

  • j=1

θ1(xj=1)

j

, where 1(xj = 1) is the indicator function, whose value is 1 when xj = 1 and 0 otherwise.

Nevin L. Zhang (HKUST) Machine Learning 18 / 52

slide-19
SLIDE 19

Univariate Probability Distributions

Gaussian (Normal) Distribution

The most widely used distribution in statistics and machine learning is the Gaussian or normal distribution. Its probability density is given by N(x|µ, σ2) = 1 √ 2πσ2 exp

  • −(x − µ)2

2σ2

  • Here µ = E[X] is the mean (and mode), and σ2 = var[X] is the

variance

Nevin L. Zhang (HKUST) Machine Learning 19 / 52

slide-20
SLIDE 20

Multivariate Probability

Outline

1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability

Bayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 20 / 52

slide-21
SLIDE 21

Multivariate Probability

Joint probability mass function

Probability mass function of a random variable X: P(X) : ΩX → [0, 1] P(X = x) = P(ΩX=x). Suppose there are n random variables X1, X2, . . . , Xn. A joint probability mass function, P(X1, X2, . . . , Xn), over those random variables is:

a function defined on the Cartesian product of their state spaces:

n

  • i=1

ΩXi → [0, 1] P(X1 = x1, X2 = x2, . . . , Xn = xn) = P(ΩX1=x1 ∩ ΩX2=x2 ∩ . . . ∩ ΩXn=xn).

Nevin L. Zhang (HKUST) Machine Learning 21 / 52

slide-22
SLIDE 22

Multivariate Probability

Joint probability mass function

Example:

Population: Apartments in Hong Kong rental market. Random variables: (of a random selected apartment)

Monthly Rent: {low (≤ 1k), medium ((1k, 2k]), upper medium((2k, 4k]), high (≥4k)}, Type: {public, private, others}

Joint probability distribution P(Rent, Type): public private

  • thers

low .17 .01 .02 medium .44 .03 .01 upper medium .09 .07 .01 high 0.14 0.1

Nevin L. Zhang (HKUST) Machine Learning 22 / 52

slide-23
SLIDE 23

Multivariate Probability

Multivariate Gaussian Distributions

For continuous variables, the most commonly used joint distribution is the multivariate Gaussian distribution: N(µ, Σ) N(x|µ, Σ) = 1

  • (2π)D|Σ|

exp

  • −(x − µ)⊤Σ−1(x − µ)

2

  • D: dimensionality.

x: vector of D random variables, representing data µ: vector of means Σ: covariance matrix. |Σ| denotes the determinant of Σ.

Nevin L. Zhang (HKUST) Machine Learning 23 / 52

slide-24
SLIDE 24

Multivariate Probability

Multivariate Gaussian Distributions

A 2-D Gaussian distribution. µ: center of contours Σ: orientation and size of contours

Nevin L. Zhang (HKUST) Machine Learning 24 / 52

slide-25
SLIDE 25

Multivariate Probability

Marginal probability

What is the probability of a randomly selected apartment being a public one? (Law of total probability)

P(Type=pulic) = P(Type=public, Rent=low)+P(Type=public, Rent=medium)+ P(Type=public, Rent=upper medium)+ P(Type=public, Rent=high) = .7 P(Type=private) = P(Type=private, Rent=low)+ P(Type=private, Rent=medium)+ P(Type=private, Rent=upper medium)+ P(Type=private, Rent=high)= .25 public private

  • thers

P(Rent) low .17 .01 .02 .2 medium .44 .03 .01 .48 upper medium .09 .07 .01 .17 high 0.14 0.1 .15 P(Type) .7 .25 .05

Called marginal probability because written on the margins.

Nevin L. Zhang (HKUST) Machine Learning 25 / 52

slide-26
SLIDE 26

Multivariate Probability

Conditional probability

For events A and B: P(A|B) = P(A, B) P(B) (= P(A ∩ B) P(B) ) Meaning:

P(A): My probability on A (without any knowledge about B) P(A|B): My probability on event A assuming that I know event B is true.

What is the probability of a randomly selected private apartment having “low” rent? P(Rent=low|Type=private) =P(Rent=Low, Type=private) P(Type=private) = .01/.25=.04 In contrast: P(Rent=low) = 0.2.

Nevin L. Zhang (HKUST) Machine Learning 26 / 52

slide-27
SLIDE 27

Multivariate Probability

Marginal independence

Two random variables X and Y are marginally independent, written X ⊥ Y , if

for any state x of X and any state y of Y , P(X=x|Y =y) = P(X=x), whenever P(Y = y) = 0.

Meaning: Learning the value of Y does not give me any information about X and vice versa.Y contains no information about X and vice versa. Equivalent definition: P(X=x, Y =y) = P(X=x)P(Y =y) Shorthand for the equations: P(X|Y ) = P(X), P(X, Y ) = P(X)P(Y ).

Nevin L. Zhang (HKUST) Machine Learning 27 / 52

slide-28
SLIDE 28

Multivariate Probability

Marginal independence

Examples:

X:result of tossing a fair coin for the first time, Y : result of second tossing of the same coin. X: result of US election, Y : your grades in this course.

Counter example:X – oral presentation grade , Y – project report grade.

Nevin L. Zhang (HKUST) Machine Learning 28 / 52

slide-29
SLIDE 29

Multivariate Probability

Conditional independence

Two random variables X and Y are conditionally independent given a third variable Z,written X ⊥ Y |Z, if P(X=x|Y =y, Z=z) = P(X=x|Z=z) whenever P(Y =y, Z=z) = 0 Meaning: If I know the state of Z already, then learning the state of Y does not give me additional information about X. Y might contain some information about X. However all the information about X contained in Y are also contained in Z. Shorthand for the equation: P(X|Y , Z) = P(X|Z) Equivalent definition: P(X, Y |Z) = P(X|Z)P(Y |Z)

Nevin L. Zhang (HKUST) Machine Learning 29 / 52

slide-30
SLIDE 30

Multivariate Probability

Example of Conditional Independence

There is a bag of 100 coins. 10 coins were made by a malfunctioning machine and are biased toward head. Tossing such a coin results in head 80% of the time. The other coins are fair. Randomly draw a coin from the bag and toss it a few time. Xi: result of the i-th tossing, Y : whether the coin is produced by the malfunctioning machine. The Xi’s are not marginally independent of each other:

If I get 9 heads in first 10 tosses, then the coin is probably a biased

  • coin. Hence the next tossing will be more likely to result in a head than

a tail. Learning the value of Xi gives me some information about whether the coin is biased, which in term gives me some information about Xj.

Nevin L. Zhang (HKUST) Machine Learning 30 / 52

slide-31
SLIDE 31

Multivariate Probability

Example of Conditional Independence

However, they are conditionally independent given Y :

If the coin is not biased, the probability of getting a head in one toss is 1/2 regardless of the results of other tosses. If the coin is biased, the probability of getting a head in one toss is 80% regardless of the results of other tosses. If I already knows whether the coin is biased or not, learning the value

  • f Xi does not give me additional information about Xj.

Here is how the variables are related pictorially. We will return to this picture later.

Nevin L. Zhang (HKUST) Machine Learning 31 / 52

slide-32
SLIDE 32

Multivariate Probability Bayes’ Theorem

Prior, posterior, and likelihood

Three important concepts in Bayesian inference. With respect to a piece of evidence: E Prior probability P(H): belief about a hypothesis before observing evidence.

Example: Suppose 10% of people suffer from Hepatitis B. A doctor’s prior probability about a new patient suffering from Hepatitis B is 0.1.

Posterior probability P(H|E):belief about a hypothesis after

  • btaining the evidence.

If the doctor finds that the eyes of the patient are yellow, his belief about patient suffering from Hepatitis B would be > 0.1.

Nevin L. Zhang (HKUST) Machine Learning 33 / 52

slide-33
SLIDE 33

Multivariate Probability Bayes’ Theorem

Prior, posterior, and likelihood

Suppose a patient is observed to have yellow eyes (E). Consider two possible explanations:

1 The patient has Hepatitis B (H1), 2 The patient does not have Hepatitis B (H2)

Obviously, H1 is a better explanation because P(E|H1) > P(E|H2). To state it another way, we say that H1 is more likely than H2 given E. In general, the likelihood of a hypothesis H given evidence E is a measure

  • f how well H explains E. Mathematically, it is

L(H|E) = P(E|H) In Machine Learning, we often talk about the likelihood of a model M given data D. It is a measure of how well the model M explains the data D. Mathematically, it is L(M|D) = P(D|M)

Nevin L. Zhang (HKUST) Machine Learning 34 / 52

slide-34
SLIDE 34

Multivariate Probability Bayes’ Theorem

Bayes’ Theorem/Bayes Rule

Bayes’ Theorem: relates prior probability, likelihood, and posterior probability: P(H|E) = P(H)P(E|H) P(E) ∝ P(H)L(H|E) where P(E) is normalization constant to ensure

h∈ΩH P(H = h|E) = 1.

That is: posterior ∝ prior × likelihood

Nevin L. Zhang (HKUST) Machine Learning 35 / 52

slide-35
SLIDE 35

Parameter Estimation

Outline

1 Basic Concepts in Probability Theory 2 Interpretation of Probability 3 Univariate Probability Distributions 4 Multivariate Probability

Bayes’ Theorem

5 Parameter Estimation

Nevin L. Zhang (HKUST) Machine Learning 36 / 52

slide-36
SLIDE 36

Parameter Estimation

A Simple Problem

Let X be the result of tossing a thumbtack and ΩX = {H, T}. Data instances: D1 = H, D2 = T, D3 = H, . . . , Dm = H Data set: D = {D1, D2, D3, . . . , Dm} Task: To estimate parameter θ = P(X=H).

Nevin L. Zhang (HKUST) Machine Learning 37 / 52

slide-37
SLIDE 37

Parameter Estimation

Likelihood

Data: D = {H, T, H, T, T, H, T} As possible values of θ, which of the following is the most likely? Why?

θ = 0 θ = 0.01 θ = 0.5

θ = 0 contradicts data because P(D|θ = 0) = 0.It cannot explain the data at all. θ = 0.01 almost contradicts with the data. It does not explain the data well. However, it is more consistent with the data than θ = 0 because P(D|θ = 0.01) > P(D|θ = 0). So θ = 0.5 is more consistent with the data than θ = 0.01 because P(D|θ = 0.5) > P(D|θ = 0.01) It explains the data the best, and is hence the most likely.

Nevin L. Zhang (HKUST) Machine Learning 38 / 52

slide-38
SLIDE 38

Parameter Estimation

Maximum Likelihood Estimation

In general, the larger P(D|θ) is, the more likely the value θ is. Likelihood of parameter θ given data set: L(θ|D) = P(D|θ) The maximum likelihood estimation (MLE) θ∗ is L(θ∗|D) = arg max

θ

L(θ|D). MLE best explains data or best fits data.

Nevin L. Zhang (HKUST) Machine Learning 39 / 52

slide-39
SLIDE 39

Parameter Estimation

i.i.d and Likelihood

Assume the data instances D1, . . . , Dm are independent given θ: P(D1, . . . , Dm|θ) =

m

  • i=1

P(Di|θ) Assume the data instances are identically distributed: P(Di = H) = θ, P(Di = T) = 1−θ for all i (Note: i.i.d means independent and identically distributed) Then L(θ|D) = P(D|θ) = P(D1, . . . , Dm|θ) =

m

  • i=1

P(Di|θ) = θmh(1 − θ)mt (1) where mh is the number of heads and mt is the number of tail. Binomial likelihood.

Nevin L. Zhang (HKUST) Machine Learning 40 / 52

slide-40
SLIDE 40

Parameter Estimation

Example of Likelihood Function

Example: D = {D1 = H, D2T, D3 = H, D4 = H, D5 = T}

L(θ|D) = P(D|θ) = P(D1 = H|θ)P(D2 = T|θ)P(D3 = H|θ)P(D4 = H|θ)P(D5 = T|θ) = θ(1 − θ)θθ(1 − θ) = θ3(1 − θ)2.

Nevin L. Zhang (HKUST) Machine Learning 41 / 52

slide-41
SLIDE 41

Parameter Estimation

Sufficient Statistic

A sufficient statistic is a function s(D) of data that summarizing the relevant information for computing the likelihood. That is s(D) = s(D′) ⇒ L(θ|D) = L(θ|D′) Sufficient statistics tell us all there is to know about data. Since L(θ|D) = θmh(1 − θ)mt, the pair (mh, mt) is a sufficient statistic.

Nevin L. Zhang (HKUST) Machine Learning 42 / 52

slide-42
SLIDE 42

Parameter Estimation

Loglikelihood

Loglikelihood: l(θ|D) = logL(θ|D) = logθmh(1 − θ)mt = mhlogθ + mtlog(1 − θ) Maximizing likelihood is the same as maximizing loglikelihood. The latter is easier. Taking the derivative of dl(θ|D)

and setting it to zero, we get θ∗ = mh mh + mt = mh m MLE is intuitive. It also has nice properties:

E.g. Consistence: θ∗ approaches the true value of θ with probability 1 as m goes to infinity.

Nevin L. Zhang (HKUST) Machine Learning 43 / 52

slide-43
SLIDE 43

Parameter Estimation

Drawback of MLE

Thumbtack tossing:

(mh, mt) = (3, 7). MLE: θ = 0.3.

  • Reasonable. Data suggest that the thumbtack is biased toward tail.

Coin tossing:

Case 1: (mh, mt) = (3, 7). MLE: θ = 0.3.

Not reasonable. Our experience (prior) suggests strongly that coins are fair, hence θ=1/2. The size of the data set is too small to convince us this particular coin is biased. The fact that we get (3, 7) instead of (5, 5) is probably due to randomness.

Case 2: (mh, mt) = (30, 000, 70, 000). MLE: θ = 0.3.

Reasonable. Data suggest that the coin is after all biased, overshadowing our prior.

MLE does not differentiate between those two instances. It doe not take prior information into account.

Nevin L. Zhang (HKUST) Machine Learning 44 / 52

slide-44
SLIDE 44

Parameter Estimation

Two Views on Parameter Estimation

MLE: Assumes that θ is unknown but fixed parameter. Estimates it using θ∗, the value that maximizes the likelihood function Makes prediction based on the estimation: P(Dm+1 = H|D) = θ∗ Bayesian Estimation: Treats θ as a random variable. Assumes a prior probability of θ: p(θ) Uses data to get posterior probability of θ: p(θ|D)

Nevin L. Zhang (HKUST) Machine Learning 45 / 52

slide-45
SLIDE 45

Parameter Estimation

Two Views on Parameter Estimation

Bayesian Estimation: Predicting Dm+1 P(Dm+1 = H|D) =

  • P(Dm+1 = H, θ|D)dθ

=

  • P(Dm+1 = H|θ, D)p(θ|D)dθ

=

  • P(Dm+1 = H|θ)p(θ|D)dθ

=

  • θp(θ|D)dθ.

Full Bayesian: Take expectation over θ. Bayesian MAP: P(Dm+1 = H|D) = θ∗ = arg max p(θ|D)

Nevin L. Zhang (HKUST) Machine Learning 46 / 52

slide-46
SLIDE 46

Parameter Estimation

Calculating Bayesian Estimation

Posterior distribution: p(θ|D) ∝ p(θ)L(θ|D) = θmh(1 − θ)mtp(θ) where the equation follows from (1) To facilitate analysis, assume prior has Beta distribution B(αh, αt) p(θ) ∝ θαh−1(1 − θ)αt−1 Then p(θ|D) ∝ θmh+αh−1(1 − θ)mt+αt−1 (2)

Nevin L. Zhang (HKUST) Machine Learning 47 / 52

slide-47
SLIDE 47

Parameter Estimation

Beta Distribution

The normalization constant for the Beta distribution B(αh, αt) Γ(αt + αh) Γ(αt)Γ(αh) where Γ(.) is the Gamma

  • function. For any integer α,

Γ(α) = (α − 1)!. It is also defined for non-integers. Density function of prior Beta distribution B(αh, αt), p(θ) = Γ(αt + αh) Γ(αt)Γ(αh)θαh−1(1 − θ)αt The hyperparameters αh and αt can be thought of as ”imaginary” counts from our prior experiences. Their sum α = αh+αt is called equivalent sample size. The larger the equivalent sample size, the more confident we are in our prior.

Nevin L. Zhang (HKUST) Machine Learning 48 / 52

slide-48
SLIDE 48

Parameter Estimation

Conjugate Families

Binomial Likelihood: θmh(1 − θ)mt Beta Prior: θαh−1(1 − θ)αt−1 Beta Posterior: θmh+αh−1(1 − θ)mt+αt−1. Beta distributions are hence called a conjugate family for Binomial likelihood. Conjugate families allow closed-form for posterior distribution of parameters and closed-form solution for prediction.

Nevin L. Zhang (HKUST) Machine Learning 49 / 52

slide-49
SLIDE 49

Parameter Estimation

Calculating Prediction

We have P(Dm+1 = H|D) =

  • θp(θ|D)dθ

= c

  • θθmh+αh−1(1 − θ)mt+αt−1dθ

= mh + αh m + α where c is the normalization constant, m=mh+mt, α = αh+αt. Consequently, P(Dm+1 = T|D) = mt + αt m + α After taking data D into consideration, now our updated belief on X=T is mt+αt

m+α .

Nevin L. Zhang (HKUST) Machine Learning 50 / 52

slide-50
SLIDE 50

Parameter Estimation

MLE and Bayesian estimation

As m goes to infinity, P(Dm+1 = H|D) approaches the MLE

mh mh+mt ,which approaches the true value of θ with probability 1.

Coin tossing example revisited:

Suppose αh = αt = 100. Equivalent sample size: 200 In case 1, P(Dm+1 = H|D) = 3 + 100 10 + 100 + 100 ≈ 0.5 Our prior prevails. In case 2, P(Dm+1 = H|D) = 30, 000 + 100 100, 0000 + 100 + 100 ≈ 0.3 Data prevail.

Nevin L. Zhang (HKUST) Machine Learning 51 / 52

slide-51
SLIDE 51

Parameter Estimation

MLE vs Bayesian Estimation

Much of Machine Learning is about parameter estimation. In all case, both MLE and Bayesian estimations can used, although the latter is harder mathematically. In this course, we will focus on MLE.

Nevin L. Zhang (HKUST) Machine Learning 52 / 52