Probability & Information Theory Shan-Hung Wu - - PowerPoint PPT Presentation

probability information theory
SMART_READER_LITE
LIVE PREVIEW

Probability & Information Theory Shan-Hung Wu - - PowerPoint PPT Presentation

Probability & Information Theory Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Prob. & Info. Theory Machine Learning 1 / 76 Outline


slide-1
SLIDE 1

Probability & Information Theory

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 1 / 76

slide-2
SLIDE 2

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 2 / 76

slide-3
SLIDE 3

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 3 / 76

slide-4
SLIDE 4

Random Variables

A random variable x is a variable that can take on different values randomly

E.g., Pr(x = x1) = 0.1, Pr(x = x2) = 0.3, etc. Technically, x is a function that maps events to a real values

Must be coupled with a probability distribution P that specifies how likely each value is

x ∼ P(θ) means “x has distribution P parametrized by θ”

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 4 / 76

slide-5
SLIDE 5

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function Px(x) = Pr(x = x)

E.g., the output of a fair dice has discrete uniform distribution with P(x) = 1/6

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 5 / 76

slide-6
SLIDE 6

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function Px(x) = Pr(x = x)

E.g., the output of a fair dice has discrete uniform distribution with P(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function px(x) ≥ 0

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 5 / 76

slide-7
SLIDE 7

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function Px(x) = Pr(x = x)

E.g., the output of a fair dice has discrete uniform distribution with P(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function px(x) ≥ 0

Is px(x) a probability?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 5 / 76

slide-8
SLIDE 8

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function Px(x) = Pr(x = x)

E.g., the output of a fair dice has discrete uniform distribution with P(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function px(x) ≥ 0

Is px(x) a probability? No, it is “rate of increase in probability at x” Pr(a ≤ x ≤ b) =

  • [a,b] p(x)dx

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 5 / 76

slide-9
SLIDE 9

Probability Mass and Density Functions

If x is discrete, P(x = x) denotes a probability mass function Px(x) = Pr(x = x)

E.g., the output of a fair dice has discrete uniform distribution with P(x) = 1/6

If x is continuous, P(x = x) denotes a probability density function px(x) ≥ 0

Is px(x) a probability? No, it is “rate of increase in probability at x” Pr(a ≤ x ≤ b) =

  • [a,b] p(x)dx

px(x) can be greater than 1 E.g., a continuous uniform distribution within [a,b] has p(x) = 1/b−a if x ∈ [a,b]; 0 otherwise

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 5 / 76

slide-10
SLIDE 10

Marginal Probability

Consider a probability distribution over a set of variables, e.g., P(x,y) The probability distribution over the subset of random variables called the marginal probability distribution: P(x = x) = ∑

y

P(x,y)

  • r
  • p(x,y)dy

Also called the sum rule of probability

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 6 / 76

slide-11
SLIDE 11

Conditional Probability

Conditional density function: P(x = x|y = y) = P(x = x,y = y) P(y = y)

Defined only when P(y = y) > 0

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 7 / 76

slide-12
SLIDE 12

Conditional Probability

Conditional density function: P(x = x|y = y) = P(x = x,y = y) P(y = y)

Defined only when P(y = y) > 0

Product rule of probability: P(x(1),··· ,x(n)) = P(x(1))Πn

i=2P(x(i) |x(1),··· ,x(i−1))

E.g., P(a,b,c) = P(a|b,c)P(b|c)P(c)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 7 / 76

slide-13
SLIDE 13

Independence and Conditional Independence

We say random variables x is independent with y iff P(x|y) = P(x)

Implies P(x,y) = P(x)P(y) Denoted by x ⊥ y

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 8 / 76

slide-14
SLIDE 14

Independence and Conditional Independence

We say random variables x is independent with y iff P(x|y) = P(x)

Implies P(x,y) = P(x)P(y) Denoted by x ⊥ y

We say random variables x is conditionally independent with y given z iff P(x|y,z) = P(x|z)

Implies P(x,y|z) = P(x|z)P(y|z) Denoted by x ⊥ y|z

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 8 / 76

slide-15
SLIDE 15

Expectation

The expectation (or expected value or mean) of some function f with respect to x is the “average” value that f takes on:1 Ex∼P[f(x)] = ∑

x

Px(x)f(x) or

  • px(x)f(x)dx = µf(x)

1The bracket [·] here is used to distinguish the parentheses inside and has nothing to

do with functionals.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 9 / 76

slide-16
SLIDE 16

Expectation

The expectation (or expected value or mean) of some function f with respect to x is the “average” value that f takes on:1 Ex∼P[f(x)] = ∑

x

Px(x)f(x) or

  • px(x)f(x)dx = µf(x)

Expectation is linear: E[af(x)+b] = aE[f(x)]+b for deterministic a and b

1The bracket [·] here is used to distinguish the parentheses inside and has nothing to

do with functionals.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 9 / 76

slide-17
SLIDE 17

Expectation

The expectation (or expected value or mean) of some function f with respect to x is the “average” value that f takes on:1 Ex∼P[f(x)] = ∑

x

Px(x)f(x) or

  • px(x)f(x)dx = µf(x)

Expectation is linear: E[af(x)+b] = aE[f(x)]+b for deterministic a and b E[E[f(x)]] = E[f(x)], as E[f(x)] is deterministic

1The bracket [·] here is used to distinguish the parentheses inside and has nothing to

do with functionals.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 9 / 76

slide-18
SLIDE 18

Expectation over Multiple Variables

Defined over the join probability distribution, e.g., E[f(x,y)] = ∑

x,y

Px,y(x,y)f(x,y) or

  • x,y px,y(x,y)f(x,y)dxdy

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 10 / 76

slide-19
SLIDE 19

Expectation over Multiple Variables

Defined over the join probability distribution, e.g., E[f(x,y)] = ∑

x,y

Px,y(x,y)f(x,y) or

  • x,y px,y(x,y)f(x,y)dxdy

E[f(x)|y = y] =

px|y(x|y)f(x)dx is called the conditional

expectation E[f(x)g(y)] = E[f(x)]E[g(y)] if x and y are independent [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 10 / 76

slide-20
SLIDE 20

Variance

The variance measures how much the values of f deviate from its expected value when seeing different values of x: Var[f(x)] = E

  • (f(x)−E[f(x)])2

= σ2

f(x)

σf(x) is called the standard deviation

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 11 / 76

slide-21
SLIDE 21

Variance

The variance measures how much the values of f deviate from its expected value when seeing different values of x: Var[f(x)] = E

  • (f(x)−E[f(x)])2

= σ2

f(x)

σf(x) is called the standard deviation

Var[f(x)] = E[f(x)2]−E[f(x)]2 [Proof] Var[af(x)+b] = a2Var[f(x)] for deterministic a and b [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 11 / 76

slide-22
SLIDE 22

Covariance I

Covariance gives some sense of how much two values are linearly related to each other Cov[f(x),g(y)] = E[(f(x)−E[f(x)])(g(y)−E[g(y)])]

If sign positive, both variables tend to take on high values simultaneously If sign negative, one variable tend to take on high value while the other taking on low one

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 12 / 76

slide-23
SLIDE 23

Covariance I

Covariance gives some sense of how much two values are linearly related to each other Cov[f(x),g(y)] = E[(f(x)−E[f(x)])(g(y)−E[g(y)])]

If sign positive, both variables tend to take on high values simultaneously If sign negative, one variable tend to take on high value while the other taking on low one

If x and y are independent, then Cov(x,y) = 0 [Proof]

The converse is not true as X and Y may be related in a nonlinear way E.g., y = sin(x) and x ∼ Uniform(−π,π)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 12 / 76

slide-24
SLIDE 24

Covariance II

Var(ax+by) = a2Var(x)+b2Var(y)+2abCov(x,y) [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 13 / 76

slide-25
SLIDE 25

Covariance II

Var(ax+by) = a2Var(x)+b2Var(y)+2abCov(x,y) [Proof]

Var(x+y) = Var(x)+Var(y) if x and y are independent

Cov(ax+b,cy+d) = acCov(x,y) [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 13 / 76

slide-26
SLIDE 26

Covariance II

Var(ax+by) = a2Var(x)+b2Var(y)+2abCov(x,y) [Proof]

Var(x+y) = Var(x)+Var(y) if x and y are independent

Cov(ax+b,cy+d) = acCov(x,y) [Proof] Cov(ax+by,cw+dv) = acCov(x,w)+adCov(x,v)+bcCov(y,w)+bdCov(y,v) [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 13 / 76

slide-27
SLIDE 27

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 14 / 76

slide-28
SLIDE 28

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1,··· ,xd]⊤

Normally, xi’s (attributes or variables or features) are dependent with each other P(x) is a joint distribution of x1,··· ,xd

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 15 / 76

slide-29
SLIDE 29

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1,··· ,xd]⊤

Normally, xi’s (attributes or variables or features) are dependent with each other P(x) is a joint distribution of x1,··· ,xd

The mean of x is defined as µx = E(x) = [µx1,··· ,µxd]⊤

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 15 / 76

slide-30
SLIDE 30

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1,··· ,xd]⊤

Normally, xi’s (attributes or variables or features) are dependent with each other P(x) is a joint distribution of x1,··· ,xd

The mean of x is defined as µx = E(x) = [µx1,··· ,µxd]⊤ The covariance matrix of x is defined as: Σx =      σ2

x1

σx1,x2 ··· σx1,xd σx2,x1 σ2

x2

··· σx2,xd . . . . . . ... . . . σxd,x1 σxd,x2 ··· σ2

xd

    

σxi,xj = Cov(xi,xj) = E[(xi − µxi)(xj − µxj)] = E(xixj)− µxiµxj

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 15 / 76

slide-31
SLIDE 31

Multivariate Random Variables I

A multivariate random variable is denoted by x = [x1,··· ,xd]⊤

Normally, xi’s (attributes or variables or features) are dependent with each other P(x) is a joint distribution of x1,··· ,xd

The mean of x is defined as µx = E(x) = [µx1,··· ,µxd]⊤ The covariance matrix of x is defined as: Σx =      σ2

x1

σx1,x2 ··· σx1,xd σx2,x1 σ2

x2

··· σx2,xd . . . . . . ... . . . σxd,x1 σxd,x2 ··· σ2

xd

    

σxi,xj = Cov(xi,xj) = E[(xi − µxi)(xj − µxj)] = E(xixj)− µxiµxj Σx = Cov(x) = E

  • (x− µx)(x− µx)⊤

= E(xx⊤)− µxµ⊤

x

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 15 / 76

slide-32
SLIDE 32

Multivariate Random Variables II

Σx is always symmetric

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 16 / 76

slide-33
SLIDE 33

Multivariate Random Variables II

Σx is always symmetric Σx is always positive semidefinite [Homework]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 16 / 76

slide-34
SLIDE 34

Multivariate Random Variables II

Σx is always symmetric Σx is always positive semidefinite [Homework] Σx is nonsingular iff it is positive definite

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 16 / 76

slide-35
SLIDE 35

Multivariate Random Variables II

Σx is always symmetric Σx is always positive semidefinite [Homework] Σx is nonsingular iff it is positive definite Σx is singular implies that x has either:

Deterministic/independent/non-linearly dependent attributes causing zero rows, or Redundant attributes causing linear dependency between rows

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 16 / 76

slide-36
SLIDE 36

Derived Random Variables

Let y = f(x;w) = w⊤x be a random variable transformed from x µy = E(w⊤x) = w⊤E(x) = w⊤µx σ2

y = w⊤Σxw [Homework]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 17 / 76

slide-37
SLIDE 37

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 18 / 76

slide-38
SLIDE 38

What Does Pr(x = x) Mean?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 19 / 76

slide-39
SLIDE 39

What Does Pr(x = x) Mean?

1

Bayesian probability: it’s a degree of belief or qualitative levels of certainty

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 19 / 76

slide-40
SLIDE 40

What Does Pr(x = x) Mean?

1

Bayesian probability: it’s a degree of belief or qualitative levels of certainty

2

Frequentist probability: if we can draw samples of x, then the proportion of frequency of samples having the value x is equal to Pr(x = x)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 19 / 76

slide-41
SLIDE 41

Bayes’ Rule

P(y|x) = P(x|y)P(y) P(x) = P(x|y)P(y) ΣyP(x|y = y)P(y = y) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = (likelihoodof y)×(priorof y) evidence

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 20 / 76

slide-42
SLIDE 42

Bayes’ Rule

P(y|x) = P(x|y)P(y) P(x) = P(x|y)P(y) ΣyP(x|y = y)P(y = y) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = (likelihoodof y)×(priorof y) evidence Why is it so important?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 20 / 76

slide-43
SLIDE 43

Bayes’ Rule

P(y|x) = P(x|y)P(y) P(x) = P(x|y)P(y) ΣyP(x|y = y)P(y = y) Bayes’ Rule is so important in statistics (and ML as well) such that each term has a name: posteriorof y = (likelihoodof y)×(priorof y) evidence Why is it so important? E.g., a doctor diagnoses you as having a disease by letting x be “symptom” and y be “disease”

P(x|y) and P(y) may be estimated from sample frequencies more easily

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 20 / 76

slide-44
SLIDE 44

Point Estimation

Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 21 / 76

slide-45
SLIDE 45

Point Estimation

Point estimation is the attempt to estimate some fixed but unknown quantity θ of a random variable by using sample data Let {x(1),··· ,x(n)} be a set of n independent and identically distributed (i.i.d.) samples of a random variable x, a point estimator

  • r statistic is a function of the data:

ˆ θn = g(x(1),··· ,x(n))

ˆ θn is called the estimate of θ

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 21 / 76

slide-46
SLIDE 46

Sample Mean and Covariance

Given X = [x(1),··· ,x(n)]⊤ ∈ Rn×d the i.i.d samples, what are the estimates of the mean and covariance of x?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 22 / 76

slide-47
SLIDE 47

Sample Mean and Covariance

Given X = [x(1),··· ,x(n)]⊤ ∈ Rn×d the i.i.d samples, what are the estimates of the mean and covariance of x? A sample mean: ˆ µx = 1 n

n

i=1

x(i)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 22 / 76

slide-48
SLIDE 48

Sample Mean and Covariance

Given X = [x(1),··· ,x(n)]⊤ ∈ Rn×d the i.i.d samples, what are the estimates of the mean and covariance of x? A sample mean: ˆ µx = 1 n

n

i=1

x(i) A sample covariance matrix: ˆ Σx = 1 n

n

i=1

(x(i) − ˆ µx)(x(i) − ˆ µx)⊤

ˆ σ2

xi,xj = 1 n ∑n s=1(x(s) i

− ˆ µxi)(x(s)

j

− ˆ µxj)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 22 / 76

slide-49
SLIDE 49

Sample Mean and Covariance

Given X = [x(1),··· ,x(n)]⊤ ∈ Rn×d the i.i.d samples, what are the estimates of the mean and covariance of x? A sample mean: ˆ µx = 1 n

n

i=1

x(i) A sample covariance matrix: ˆ Σx = 1 n

n

i=1

(x(i) − ˆ µx)(x(i) − ˆ µx)⊤

ˆ σ2

xi,xj = 1 n ∑n s=1(x(s) i

− ˆ µxi)(x(s)

j

− ˆ µxj) If each x(i) is centered (by subtracting ˆ µx first), then ˆ Σx = 1

nX⊤X

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 22 / 76

slide-50
SLIDE 50

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 23 / 76

slide-51
SLIDE 51

Principal Components Analysis (PCA) I

Give a collection of data points X = {x(i)}N

i=1, where x(i) ∈ RD

Suppose we want to lossily compress X, i.e., to find a function f such that f(x(i)) = z(i) ∈ RK, where K < D How to keep the maximum info in X?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 24 / 76

slide-52
SLIDE 52

Principal Components Analysis (PCA) II

Let x(i)’s be i.i.d. samples of a random variable x Let f be linear, i.e., f(x) = W⊤x for some W ∈ RD×K

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 25 / 76

slide-53
SLIDE 53

Principal Components Analysis (PCA) II

Let x(i)’s be i.i.d. samples of a random variable x Let f be linear, i.e., f(x) = W⊤x for some W ∈ RD×K Principal Component Analysis (PCA) finds K orthonormal vectors W =

  • w(1),··· ,w(K)

such that the transformed variable z = W⊤x has the most “spread out” attributes, i.e., each attribute zj = w(j)⊤x has the maximum variance Var(zj)

w(1),··· ,w(K) are called the principle components

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 25 / 76

slide-54
SLIDE 54

Principal Components Analysis (PCA) II

Let x(i)’s be i.i.d. samples of a random variable x Let f be linear, i.e., f(x) = W⊤x for some W ∈ RD×K Principal Component Analysis (PCA) finds K orthonormal vectors W =

  • w(1),··· ,w(K)

such that the transformed variable z = W⊤x has the most “spread out” attributes, i.e., each attribute zj = w(j)⊤x has the maximum variance Var(zj)

w(1),··· ,w(K) are called the principle components

Why w(1),··· ,w(K) need to be orthogonal with each other?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 25 / 76

slide-55
SLIDE 55

Principal Components Analysis (PCA) II

Let x(i)’s be i.i.d. samples of a random variable x Let f be linear, i.e., f(x) = W⊤x for some W ∈ RD×K Principal Component Analysis (PCA) finds K orthonormal vectors W =

  • w(1),··· ,w(K)

such that the transformed variable z = W⊤x has the most “spread out” attributes, i.e., each attribute zj = w(j)⊤x has the maximum variance Var(zj)

w(1),··· ,w(K) are called the principle components

Why w(1),··· ,w(K) need to be orthogonal with each other?

Each w(j) keeps information that cannot be explained by others, so together they preserve the most info

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 25 / 76

slide-56
SLIDE 56

Principal Components Analysis (PCA) II

Let x(i)’s be i.i.d. samples of a random variable x Let f be linear, i.e., f(x) = W⊤x for some W ∈ RD×K Principal Component Analysis (PCA) finds K orthonormal vectors W =

  • w(1),··· ,w(K)

such that the transformed variable z = W⊤x has the most “spread out” attributes, i.e., each attribute zj = w(j)⊤x has the maximum variance Var(zj)

w(1),··· ,w(K) are called the principle components

Why w(1),··· ,w(K) need to be orthogonal with each other?

Each w(j) keeps information that cannot be explained by others, so together they preserve the most info

Why w(j) = 1 for all j?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 25 / 76

slide-57
SLIDE 57

Principal Components Analysis (PCA) II

Let x(i)’s be i.i.d. samples of a random variable x Let f be linear, i.e., f(x) = W⊤x for some W ∈ RD×K Principal Component Analysis (PCA) finds K orthonormal vectors W =

  • w(1),··· ,w(K)

such that the transformed variable z = W⊤x has the most “spread out” attributes, i.e., each attribute zj = w(j)⊤x has the maximum variance Var(zj)

w(1),··· ,w(K) are called the principle components

Why w(1),··· ,w(K) need to be orthogonal with each other?

Each w(j) keeps information that cannot be explained by others, so together they preserve the most info

Why w(j) = 1 for all j?

Only directions matter—we don’t want to maximize Var(zj) by finding a long w(j)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 25 / 76

slide-58
SLIDE 58

Solving W I

For simplicity, let’s consider K = 1 first How to evaluate Var(z1)?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 26 / 76

slide-59
SLIDE 59

Solving W I

For simplicity, let’s consider K = 1 first How to evaluate Var(z1)?

Recall that z1 = w(1)⊤x implies σ2

z1 = w(1)⊤Σxw(1) [Homework]

How to get Σx?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 26 / 76

slide-60
SLIDE 60

Solving W I

For simplicity, let’s consider K = 1 first How to evaluate Var(z1)?

Recall that z1 = w(1)⊤x implies σ2

z1 = w(1)⊤Σxw(1) [Homework]

How to get Σx? An estimate: ˆ Σx = 1

N X⊤X (assuming x(i)’s are centered first)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 26 / 76

slide-61
SLIDE 61

Solving W I

For simplicity, let’s consider K = 1 first How to evaluate Var(z1)?

Recall that z1 = w(1)⊤x implies σ2

z1 = w(1)⊤Σxw(1) [Homework]

How to get Σx? An estimate: ˆ Σx = 1

N X⊤X (assuming x(i)’s are centered first)

Optimization problem to solve: arg max

w(1)∈RD w(1)⊤X⊤Xw(1), subject to w(1) = 1

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 26 / 76

slide-62
SLIDE 62

Solving W I

For simplicity, let’s consider K = 1 first How to evaluate Var(z1)?

Recall that z1 = w(1)⊤x implies σ2

z1 = w(1)⊤Σxw(1) [Homework]

How to get Σx? An estimate: ˆ Σx = 1

N X⊤X (assuming x(i)’s are centered first)

Optimization problem to solve: arg max

w(1)∈RD w(1)⊤X⊤Xw(1), subject to w(1) = 1

X⊤X is symmetric thus can be eigendecomposed

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 26 / 76

slide-63
SLIDE 63

Solving W I

For simplicity, let’s consider K = 1 first How to evaluate Var(z1)?

Recall that z1 = w(1)⊤x implies σ2

z1 = w(1)⊤Σxw(1) [Homework]

How to get Σx? An estimate: ˆ Σx = 1

N X⊤X (assuming x(i)’s are centered first)

Optimization problem to solve: arg max

w(1)∈RD w(1)⊤X⊤Xw(1), subject to w(1) = 1

X⊤X is symmetric thus can be eigendecomposed By Rayleigh’s Quotient, the optimal w(1) is given by the eigenvector of X⊤X corresponding to the largest eigenvalue

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 26 / 76

slide-64
SLIDE 64

Solving W II

Optimization problem for w(2): arg max

w(2)∈RD w(2)⊤X⊤Xw(2), subject to w(2) = 1 and w(2)⊤w(1) = 0

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 27 / 76

slide-65
SLIDE 65

Solving W II

Optimization problem for w(2): arg max

w(2)∈RD w(2)⊤X⊤Xw(2), subject to w(2) = 1 and w(2)⊤w(1) = 0

By Rayleigh’s Quotient again, w(2) is the eigenvector corresponding to the 2-nd largest eigenvalue

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 27 / 76

slide-66
SLIDE 66

Solving W II

Optimization problem for w(2): arg max

w(2)∈RD w(2)⊤X⊤Xw(2), subject to w(2) = 1 and w(2)⊤w(1) = 0

By Rayleigh’s Quotient again, w(2) is the eigenvector corresponding to the 2-nd largest eigenvalue For general case where K > 1, the w(1),··· ,w(K) are eigenvectors of X⊤X corresponding to the largest K eigenvalues

Proof by induction [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 27 / 76

slide-67
SLIDE 67

Visualization

Figure: PCA learns a linear projection that aligns the direction of greatest variance with the axes of the new space. With these new axes, the estimated covariance matrix ˆ Σz = W⊤ ˆ ΣxW ∈ RK×K is always diagonal.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 28 / 76

slide-68
SLIDE 68

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 29 / 76

slide-69
SLIDE 69

Sure and Almost Sure Events

Given a continuous random variable x, we have Pr(x = x) = 0 for any value x Will the event x = x occur?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 30 / 76

slide-70
SLIDE 70

Sure and Almost Sure Events

Given a continuous random variable x, we have Pr(x = x) = 0 for any value x Will the event x = x occur? Yes! An event A happens surely if always occurs An event A happens almost surely if Pr(A) = 1 (e.g., Pr(x = x) = 1)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 30 / 76

slide-71
SLIDE 71

Equality of Random Variables I

Definition (Equality in Distribution) Two random variables x and y are equal in distribution iff Pr(x ≤ a) = Pr(y ≤ a) for all a. Definition (Almost Sure Equality) Two random variables x and y are equal almost surely iff Pr(x = y) = 1. Definition (Equality) Two random variables x and y are equal iff they maps the same events to same values.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 31 / 76

slide-72
SLIDE 72

Equality of Random Variables II

What’s the difference between the “equality in distribution” and “almost sure equality?”

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 32 / 76

slide-73
SLIDE 73

Equality of Random Variables II

What’s the difference between the “equality in distribution” and “almost sure equality?” Almost sure equality implies equality in distribution, but converse not true

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 32 / 76

slide-74
SLIDE 74

Equality of Random Variables II

What’s the difference between the “equality in distribution” and “almost sure equality?” Almost sure equality implies equality in distribution, but converse not true E.g., let x and y be binary random variables and Px(0) = Px(1) = Py(0) = Py(1) = 0.5

They are equal in distribution But Pr(x = y) = 0.5 = 1

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 32 / 76

slide-75
SLIDE 75

Convergence of Random Variables I

Definition (Convergence in Distribution) A sequence of random variables {x(1),x(2),···} converges in distribution to x iff limn→∞ P

  • x(n) = x
  • = P(x = x)

Definition (Convergence in Probability) A sequence of random variables {x(1),x(2),···} converges in probability to x iff for any ε > 0, limn→∞ Pr

  • |x(n) −x| < ε
  • = 1.

Definition (Almost Sure Convergence) A sequence of random variables {x(1),x(2),···} converges almost surely to x iff Pr

  • limn→∞ x(n) = x
  • = 1.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 33 / 76

slide-76
SLIDE 76

Convergence of Random Variables II

What’s the difference between the convergence “in probability” and “almost surely?”

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 34 / 76

slide-77
SLIDE 77

Convergence of Random Variables II

What’s the difference between the convergence “in probability” and “almost surely?” Almost sure convergence implies convergence in probability, but converse not true

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 34 / 76

slide-78
SLIDE 78

Convergence of Random Variables II

What’s the difference between the convergence “in probability” and “almost surely?” Almost sure convergence implies convergence in probability, but converse not true limn→∞ Pr

  • |x(n) −x| < ε
  • = 1 leaves open the possibility that

|x(n) −x| > ε happens an infinite number of times Pr

  • limn→∞ x(n) = x
  • = 1 guarantees that |x(n) −x| > ε almost surely

will not occur

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 34 / 76

slide-79
SLIDE 79

Distribution of Derived Variables I

Suppose y = f(x) and f −1 exists, does P(y = y) = P(x = f −1(y)) always hold?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 35 / 76

slide-80
SLIDE 80

Distribution of Derived Variables I

Suppose y = f(x) and f −1 exists, does P(y = y) = P(x = f −1(y)) always hold? No, when x and y are continuous Suppose x ∼ Uniform(0,1) is continuous and p(x) = c for x ∈ (0,1) Let y = x/2 ∼ Uniform(0, 1/2) If py(y) = px(2y), then

1/2

y=0 py(y)dy =

1/2

y=0 c·dy = 1

2 = 1

Violates the axiom of probability

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 35 / 76

slide-81
SLIDE 81

Distribution of Derived Variables II

Recall that Pr(y = y) = py(y)dy and Pr(x = x) = px(x)dx

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 36 / 76

slide-82
SLIDE 82

Distribution of Derived Variables II

Recall that Pr(y = y) = py(y)dy and Pr(x = x) = px(x)dx Since f may distort space, we need to ensure that |py(f(x))dy| = |px(x)dx| We have py(y) = px(f −1(y))

  • ∂f −1(y)

∂y

  • (or px(x) = py(f(x))
  • ∂f(x)

∂x

  • )

In previous example: py(y) = 2·px(2y)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 36 / 76

slide-83
SLIDE 83

Distribution of Derived Variables II

Recall that Pr(y = y) = py(y)dy and Pr(x = x) = px(x)dx Since f may distort space, we need to ensure that |py(f(x))dy| = |px(x)dx| We have py(y) = px(f −1(y))

  • ∂f −1(y)

∂y

  • (or px(x) = py(f(x))
  • ∂f(x)

∂x

  • )

In previous example: py(y) = 2·px(2y)

In multivariate case, we have py(y) = px(f −1(y))

  • det
  • J(f −1)(y)
  • ,

where J(f −1)(y) is the Jacobian matrix of f −1 at input y

J(f −1)(y)i,j = ∂f −1

i

(y)/∂yj

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 36 / 76

slide-84
SLIDE 84

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 37 / 76

slide-85
SLIDE 85

Random Experiments

The value of a random variable x can be think of as the outcome of an random experiment Helps us define P(x)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 38 / 76

slide-86
SLIDE 86

Bernoulli Distribution (Discrete)

Let x ∈ {0,1} be the outcome of tossing a coin, we have: Bernoulli(x = x;ρ) = ρ, if x = 1 1−ρ,

  • therwise
  • r ρx(1−ρ)1−x

Properties: [Proof]

E(x) = ρ Var(x) = ρ(1−ρ)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 39 / 76

slide-87
SLIDE 87

Categorical Distribution (Discrete)

Let x ∈ {1,··· ,k} be the outcome of rolling a k-sided dice, we have: Categorical(x = x;ρ) =

k

i=1

ρ1(x;x=i)

i

, where 1⊤ρ = 1

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 40 / 76

slide-88
SLIDE 88

Categorical Distribution (Discrete)

Let x ∈ {1,··· ,k} be the outcome of rolling a k-sided dice, we have: Categorical(x = x;ρ) =

k

i=1

ρ1(x;x=i)

i

, where 1⊤ρ = 1 An extension of the Bernoulli distribution for k states

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 40 / 76

slide-89
SLIDE 89

Multinomial Distribution (Discrete)

Let x ∈ Rk be a random vector where xi the number of the outcome i after rolling a k-sided dice n times: Multinomial(x = x;n,ρ) = n! x1!···xk!

k

i=1

ρxi

i , where 1⊤ρ = 1 and 1⊤x = n

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 41 / 76

slide-90
SLIDE 90

Multinomial Distribution (Discrete)

Let x ∈ Rk be a random vector where xi the number of the outcome i after rolling a k-sided dice n times: Multinomial(x = x;n,ρ) = n! x1!···xk!

k

i=1

ρxi

i , where 1⊤ρ = 1 and 1⊤x = n

Properties: [Proof]

E(x) = nρ Var(x) = n

  • diag(ρ)−ρρ⊤

(i.e., Var(xi) = nρi(1−ρi) and Var(xi,xj) = −nρiρj)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 41 / 76

slide-91
SLIDE 91

Normal/Gaussian Distribution (Continuous)

Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: N (x = x;µ,σ2) =

  • 1

2πσ2 exp

  • − 1

2σ2 (x− µ)2

  • .

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 42 / 76

slide-92
SLIDE 92

Normal/Gaussian Distribution (Continuous)

Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: N (x = x;µ,σ2) =

  • 1

2πσ2 exp

  • − 1

2σ2 (x− µ)2

  • .

Holds regardless of the original distributions of individual variables

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 42 / 76

slide-93
SLIDE 93

Normal/Gaussian Distribution (Continuous)

Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: N (x = x;µ,σ2) =

  • 1

2πσ2 exp

  • − 1

2σ2 (x− µ)2

  • .

Holds regardless of the original distributions of individual variables µx = µ and σ2

x = σ2

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 42 / 76

slide-94
SLIDE 94

Normal/Gaussian Distribution (Continuous)

Theorem (Central Limit Theorem) The sum x of many independent random variables is approximately normally/Gaussian distributed: N (x = x;µ,σ2) =

  • 1

2πσ2 exp

  • − 1

2σ2 (x− µ)2

  • .

Holds regardless of the original distributions of individual variables µx = µ and σ2

x = σ2

To avoid inverting σ2, we can parametrize the distribution using the precision β: N (x = x;µ,β −1) =

  • β

2π exp

  • −β

2 (x− µ)2

  • Shan-Hung Wu (CS, NTHU)
  • Prob. & Info. Theory

Machine Learning 42 / 76

slide-95
SLIDE 95

Confidence Intervals

Figure: Graph of N (µ,σ2).

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 43 / 76

slide-96
SLIDE 96

Confidence Intervals

Figure: Graph of N (µ,σ2).

We say the interval [µ −2σ,µ +2σ] has about the 95% confidence

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 43 / 76

slide-97
SLIDE 97

Why Is Gaussian Distribution Common in ML?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 44 / 76

slide-98
SLIDE 98

Why Is Gaussian Distribution Common in ML?

1

It can model noise in data (e.g., Gaussian white noise)

Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 44 / 76

slide-99
SLIDE 99

Why Is Gaussian Distribution Common in ML?

1

It can model noise in data (e.g., Gaussian white noise)

Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process

2

Out of all possible probability distributions (over real numbers) with the same variance, it encodes the maximum amount of uncertainty

Assuming P(y|x) ∼ N , we insert the least amount of prior knowledge into a model

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 44 / 76

slide-100
SLIDE 100

Why Is Gaussian Distribution Common in ML?

1

It can model noise in data (e.g., Gaussian white noise)

Can be considered to be the accumulation of a large number of small independent latent factors affecting data collection process

2

Out of all possible probability distributions (over real numbers) with the same variance, it encodes the maximum amount of uncertainty

Assuming P(y|x) ∼ N , we insert the least amount of prior knowledge into a model

3

Convenient for many analytical manipulations

Closed under affine transformation, summation, marginalization, conditioning, etc. Many of the integrals involving Gaussian distributions that arise in practice have simple closed form solutions

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 44 / 76

slide-101
SLIDE 101

Properties

Closed under affine transformation: if x ∼ N (µ,σ2), then ax+b ∼ N (aµ +b,a2σ2) for any deterministic a,b ∈ R, a = 0 [Proof]

z = x−µ

σ

∼ N (0,1) the z-normalization or standardization of x

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 45 / 76

slide-102
SLIDE 102

Properties

Closed under affine transformation: if x ∼ N (µ,σ2), then ax+b ∼ N (aµ +b,a2σ2) for any deterministic a,b ∈ R, a = 0 [Proof]

z = x−µ

σ

∼ N (0,1) the z-normalization or standardization of x

Closed under summation: if x(1) ∼ N (µ(1),σ2(1)) is independent with x(2) ∼ N (µ(2),σ2(2)), then x(1) +x(2) ∼ N (µ(1) + µ(2),σ2(1) +σ2(2)) [Homework: px(1)+x(2)(x) =

px(1)(x−y)px(2)(y)dy the convolution]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 45 / 76

slide-103
SLIDE 103

Properties

Closed under affine transformation: if x ∼ N (µ,σ2), then ax+b ∼ N (aµ +b,a2σ2) for any deterministic a,b ∈ R, a = 0 [Proof]

z = x−µ

σ

∼ N (0,1) the z-normalization or standardization of x

Closed under summation: if x(1) ∼ N (µ(1),σ2(1)) is independent with x(2) ∼ N (µ(2),σ2(2)), then x(1) +x(2) ∼ N (µ(1) + µ(2),σ2(1) +σ2(2)) [Homework: px(1)+x(2)(x) =

px(1)(x−y)px(2)(y)dy the convolution] Not true if x(1) and x(2) are dependent

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 45 / 76

slide-104
SLIDE 104

Multivariate Gaussian Distribution

When x is sum of many random vectors: N (x = x;µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • µx = µ and Σx = Σ (must be nonsingular)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 46 / 76

slide-105
SLIDE 105

Multivariate Gaussian Distribution

When x is sum of many random vectors: N (x = x;µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • µx = µ and Σx = Σ (must be nonsingular)

If x ∼ N (µ,Σ), then each attribute xi is univariate normal

Converse not true

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 46 / 76

slide-106
SLIDE 106

Multivariate Gaussian Distribution

When x is sum of many random vectors: N (x = x;µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • µx = µ and Σx = Σ (must be nonsingular)

If x ∼ N (µ,Σ), then each attribute xi is univariate normal

Converse not true However, if x1,··· ,xd are i.i.d. and xi ∼ N (µi,σ2

i ), then x ∼ N (µ,Σ),

where µ = [µ1,··· ,µd]⊤ and Σ = diag(σ2

1 ,··· ,σ2 d )

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 46 / 76

slide-107
SLIDE 107

Multivariate Gaussian Distribution

When x is sum of many random vectors: N (x = x;µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • µx = µ and Σx = Σ (must be nonsingular)

If x ∼ N (µ,Σ), then each attribute xi is univariate normal

Converse not true However, if x1,··· ,xd are i.i.d. and xi ∼ N (µi,σ2

i ), then x ∼ N (µ,Σ),

where µ = [µ1,··· ,µd]⊤ and Σ = diag(σ2

1 ,··· ,σ2 d )

What does the graph of N (µ,Σ) look like?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 46 / 76

slide-108
SLIDE 108

Bivariate Example I

Consider the Mahalanobis distance first N (µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • Shan-Hung Wu (CS, NTHU)
  • Prob. & Info. Theory

Machine Learning 47 / 76

slide-109
SLIDE 109

Bivariate Example I

Consider the Mahalanobis distance first N (µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • Cov(x1,x2)=0, Var(x

1)=Var(x2)

x1 x2 Cov(x1,x2)=0, Var(x

1)>Var(x2)

Cov(x1,x2)>0 Cov(x1,x2)<0

The level sets closer to the center µx are lower

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 47 / 76

slide-110
SLIDE 110

Bivariate Example I

Consider the Mahalanobis distance first N (µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • Cov(x1,x2)=0, Var(x

1)=Var(x2)

x1 x2 Cov(x1,x2)=0, Var(x

1)>Var(x2)

Cov(x1,x2)>0 Cov(x1,x2)<0

The level sets closer to the center µx are lower Increasing Cov[x1,x2] stretches the level sets along the 45◦ axis Decreasing Cov[x1,x2] stretches the level sets along the −45◦ axis

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 47 / 76

slide-111
SLIDE 111

Bivariate Example II

The hight of N (µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • in

its graph is inversely proportional to the Mahalanobis distance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 x1 x2

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 48 / 76

slide-112
SLIDE 112

Bivariate Example II

The hight of N (µ,Σ) =

  • 1

(2π)ddet(Σ) exp

  • −1

2(x− µ)⊤Σ−1(x− µ)

  • in

its graph is inversely proportional to the Mahalanobis distance

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 x1 x2

A multivariate Gaussian distribution is isotropic iff Σ = σI

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 48 / 76

slide-113
SLIDE 113

Properties

Closed under affine transformation: if x ∼ N (µ,Σ), then w⊤x ∼ N (w⊤µ,w⊤Σw) for any deterministic w ∈ Rd

More generally, given W ∈ Rd×k, k < d, we have W⊤x ∼ N (W⊤µ,W⊤ΣW) that is k-variate normal I.e., the projection of x onto a k-dimensional subspace is still normal

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 49 / 76

slide-114
SLIDE 114

Properties

Closed under affine transformation: if x ∼ N (µ,Σ), then w⊤x ∼ N (w⊤µ,w⊤Σw) for any deterministic w ∈ Rd

More generally, given W ∈ Rd×k, k < d, we have W⊤x ∼ N (W⊤µ,W⊤ΣW) that is k-variate normal I.e., the projection of x onto a k-dimensional subspace is still normal

Consider x = x1 x2

  • ∼ N (µ =

µ1 µ2

  • ,Σ =

Σ1,1 Σ1,2 Σ2,1 Σ2,2

  • ):

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 49 / 76

slide-115
SLIDE 115

Properties

Closed under affine transformation: if x ∼ N (µ,Σ), then w⊤x ∼ N (w⊤µ,w⊤Σw) for any deterministic w ∈ Rd

More generally, given W ∈ Rd×k, k < d, we have W⊤x ∼ N (W⊤µ,W⊤ΣW) that is k-variate normal I.e., the projection of x onto a k-dimensional subspace is still normal

Consider x = x1 x2

  • ∼ N (µ =

µ1 µ2

  • ,Σ =

Σ1,1 Σ1,2 Σ2,1 Σ2,2

  • ):

Closed under marginalization: x1 ∼ N (µ1,Σ1,1) [Proof: P(x1) =

  • x2 P(x1,x2 ; µ,Σ)dx2)]

Closed under conditioning: (x1 |x2) ∼ N (µ1 +Σ1,2Σ−1

2,2(x2 − µ2), Σ1,1 −Σ1,2Σ−1 2,2Σ2,1) [Proof]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 49 / 76

slide-116
SLIDE 116

Exponential Distribution (Continuous)

In deep learning, we often want to have a probability distribution with a sharp point at x = 0

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 50 / 76

slide-117
SLIDE 117

Exponential Distribution (Continuous)

In deep learning, we often want to have a probability distribution with a sharp point at x = 0 To accomplish this, we can use the exponential distribution: Exponential(x = x;λ) = λ1(x;x ≥ 0)exp(−λx)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 50 / 76

slide-118
SLIDE 118

Laplace Distribution (Continuous)

Laplace distribution can be think of as a “two-sided” exponential distribution centered at µ: Laplace(x = x;µ,b) = 1 2b exp

  • −|x− µ|

b

  • Shan-Hung Wu (CS, NTHU)
  • Prob. & Info. Theory

Machine Learning 51 / 76

slide-119
SLIDE 119

Dirac Distribution (Continuous)

In some cases, we wish to specify that all of the mass in a probability distribution clusters around a single data point µ

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 52 / 76

slide-120
SLIDE 120

Dirac Distribution (Continuous)

In some cases, we wish to specify that all of the mass in a probability distribution clusters around a single data point µ This can be accomplished by using the Dirac distribution: Dirac(x = x;µ) = δ(x− µ), where δ(·) is the Dirac delta function that

1

Is zero-valued everywhere except at input 0

2

Integrals to 1

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 52 / 76

slide-121
SLIDE 121

Empirical Distribution (Continuous)

Given a dataset X = {x(i)}N

i=1 where x(i)’s are i.i.d. samples of x

What is the distribution P(θ) that maximizes the likelihood P(θ|X) of X?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 53 / 76

slide-122
SLIDE 122

Empirical Distribution (Continuous)

Given a dataset X = {x(i)}N

i=1 where x(i)’s are i.i.d. samples of x

What is the distribution P(θ) that maximizes the likelihood P(θ|X) of X? If x is discrete, the distribution simply reflects the empirical frequency

  • f values:

Empirical(x = x;X) = 1 N

N

i=1

1(x;x = x(i))

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 53 / 76

slide-123
SLIDE 123

Empirical Distribution (Continuous)

Given a dataset X = {x(i)}N

i=1 where x(i)’s are i.i.d. samples of x

What is the distribution P(θ) that maximizes the likelihood P(θ|X) of X? If x is discrete, the distribution simply reflects the empirical frequency

  • f values:

Empirical(x = x;X) = 1 N

N

i=1

1(x;x = x(i)) If x is continuous, we have the empirical distribution: Empirical(x = x;X) = 1 N

N

i=1

δ(x−x(i))

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 53 / 76

slide-124
SLIDE 124

Mixtures of Distributions

We may define a probability distribution by combining other simpler probability distributions {P(i)(θ (i))}i

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 54 / 76

slide-125
SLIDE 125

Mixtures of Distributions

We may define a probability distribution by combining other simpler probability distributions {P(i)(θ (i))}i E.g., the mixture model: Mixture(x = x;ρ,{θ (i)}i) =∑

i

P(i)(x = x|c = i;θ (i))Categorical(c = i;ρ)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 54 / 76

slide-126
SLIDE 126

Mixtures of Distributions

We may define a probability distribution by combining other simpler probability distributions {P(i)(θ (i))}i E.g., the mixture model: Mixture(x = x;ρ,{θ (i)}i) =∑

i

P(i)(x = x|c = i;θ (i))Categorical(c = i;ρ)

The empirical distribution is a mixture distribution (where ρi = 1/N)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 54 / 76

slide-127
SLIDE 127

Mixtures of Distributions

We may define a probability distribution by combining other simpler probability distributions {P(i)(θ (i))}i E.g., the mixture model: Mixture(x = x;ρ,{θ (i)}i) =∑

i

P(i)(x = x|c = i;θ (i))Categorical(c = i;ρ)

The empirical distribution is a mixture distribution (where ρi = 1/N)

The component identity variable c is a latent variable

Whose values are not observed

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 54 / 76

slide-128
SLIDE 128

Gaussian Mixture Model

A mixture model is called the Gaussian mixture model iff P(i)(x = x|c = i;θ (i)) = N (i)(x = x|c = i;µ(i),Σ(i)), ∀i

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 55 / 76

slide-129
SLIDE 129

Gaussian Mixture Model

A mixture model is called the Gaussian mixture model iff P(i)(x = x|c = i;θ (i)) = N (i)(x = x|c = i;µ(i),Σ(i)), ∀i

Variants: Σ(i) = Σ or Σ(i) = diag(σ) or Σ(i) = σI

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 55 / 76

slide-130
SLIDE 130

Gaussian Mixture Model

A mixture model is called the Gaussian mixture model iff P(i)(x = x|c = i;θ (i)) = N (i)(x = x|c = i;µ(i),Σ(i)), ∀i

Variants: Σ(i) = Σ or Σ(i) = diag(σ) or Σ(i) = σI

Any smooth density can be approximated by a Gaussian mixture model with enough components

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 55 / 76

slide-131
SLIDE 131

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 56 / 76

slide-132
SLIDE 132

Parametrizing Functions

A probability distribution P(θ) is parametrized by θ In ML, θ may be the output value of a deterministic function

Called parametrizing function

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 57 / 76

slide-133
SLIDE 133

Logistic Function

The logistic function (a special case of sigmoid functions) is defined as: σ (x) = exp(x) exp(x)+1 = 1 1+exp(−x)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 58 / 76

slide-134
SLIDE 134

Logistic Function

The logistic function (a special case of sigmoid functions) is defined as: σ (x) = exp(x) exp(x)+1 = 1 1+exp(−x) Always takes on values between (0,1)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 58 / 76

slide-135
SLIDE 135

Logistic Function

The logistic function (a special case of sigmoid functions) is defined as: σ (x) = exp(x) exp(x)+1 = 1 1+exp(−x) Always takes on values between (0,1) Commonly used to produce the ρ parameter of Bernoulli distribution

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 58 / 76

slide-136
SLIDE 136

Softplus Function

The softplus function : ζ (x) = log(1+exp(x))

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 59 / 76

slide-137
SLIDE 137

Softplus Function

The softplus function : ζ (x) = log(1+exp(x)) A “softened” version of x+ = max(0,x)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 59 / 76

slide-138
SLIDE 138

Softplus Function

The softplus function : ζ (x) = log(1+exp(x)) A “softened” version of x+ = max(0,x) Range: (0,∞) Useful for producing the β or σ parameter of Gaussian distribution

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 59 / 76

slide-139
SLIDE 139

Properties [Homework]

1−σ(x) = σ(−x) logσ(x) = −ζ(−x)

d dxσ(x) = σ(x)(1−σ(x)) d dxζ(x) = σ(x)

∀x ∈ (0,1),σ−1(x) = log x

1−x

  • ∀x > 0,ζ −1(x) = log(exp(x)−1)

ζ(x) =

x

−∞ σ(y)dy

ζ(x)−ζ(−x) = x

ζ(−x) is the softened x− = max(0,−x) x = x+ −x−

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 60 / 76

slide-140
SLIDE 140

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 61 / 76

slide-141
SLIDE 141

What’s Information Theory

Probability theory allows us to make uncertain statements and reason in the presence of uncertainty

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 62 / 76

slide-142
SLIDE 142

What’s Information Theory

Probability theory allows us to make uncertain statements and reason in the presence of uncertainty Information theory allows us to quantify the amount of uncertainty

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 62 / 76

slide-143
SLIDE 143

Self-Information

Given a random variable x, how much information you receive when seeing an event x = x?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 63 / 76

slide-144
SLIDE 144

Self-Information

Given a random variable x, how much information you receive when seeing an event x = x?

1

Likely events should have low information

E.g., we are less surprised when tossing a biased coins

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 63 / 76

slide-145
SLIDE 145

Self-Information

Given a random variable x, how much information you receive when seeing an event x = x?

1

Likely events should have low information

E.g., we are less surprised when tossing a biased coins

2

Independent events should have additive information

E.g, “two heads” should have twice as much info as “one head”

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 63 / 76

slide-146
SLIDE 146

Self-Information

Given a random variable x, how much information you receive when seeing an event x = x?

1

Likely events should have low information

E.g., we are less surprised when tossing a biased coins

2

Independent events should have additive information

E.g, “two heads” should have twice as much info as “one head”

The self-information: I(x = x) = −logP(x = x)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 63 / 76

slide-147
SLIDE 147

Self-Information

Given a random variable x, how much information you receive when seeing an event x = x?

1

Likely events should have low information

E.g., we are less surprised when tossing a biased coins

2

Independent events should have additive information

E.g, “two heads” should have twice as much info as “one head”

The self-information: I(x = x) = −logP(x = x)

Called bit if base-2 logarithm is used Called nat if base-e

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 63 / 76

slide-148
SLIDE 148

Entropy

Self-information deals with a particular outcome

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 64 / 76

slide-149
SLIDE 149

Entropy

Self-information deals with a particular outcome We can quantify the amount of uncertainty in an entire probability distribution using the entropy: H(x ∼ P) = Ex∼P[I(x)] = −∑

x

P(x)logP(x) or −

  • p(x)logp(x)dx

Let 0log0 = limx→0 xlogx = 0

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 64 / 76

slide-150
SLIDE 150

Entropy

Self-information deals with a particular outcome We can quantify the amount of uncertainty in an entire probability distribution using the entropy: H(x ∼ P) = Ex∼P[I(x)] = −∑

x

P(x)logP(x) or −

  • p(x)logp(x)dx

Let 0log0 = limx→0 xlogx = 0 Called Shannon entropy when x is discrete; differential entropy when x is continuous

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 64 / 76

slide-151
SLIDE 151

Entropy

Self-information deals with a particular outcome We can quantify the amount of uncertainty in an entire probability distribution using the entropy: H(x ∼ P) = Ex∼P[I(x)] = −∑

x

P(x)logP(x) or −

  • p(x)logp(x)dx

Let 0log0 = limx→0 xlogx = 0 Called Shannon entropy when x is discrete; differential entropy when x is continuous Figure: Shannon entropy H(x) over Bernoulli distributions with different ρ.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 64 / 76

slide-152
SLIDE 152

Average Code Length

Shannon entropy gives a lower bound on the number of “bits” needed

  • n average to encode values drawn from a distribution P

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 65 / 76

slide-153
SLIDE 153

Average Code Length

Shannon entropy gives a lower bound on the number of “bits” needed

  • n average to encode values drawn from a distribution P

Consider a random variable x ∼ Uniform having 8 equally likely states

To send a value x to receiver, we would encode it into 3 bits Shannon entropy: H(x ∼ Uniform) = −8× 1

8 log2 1 8 = 3

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 65 / 76

slide-154
SLIDE 154

Average Code Length

Shannon entropy gives a lower bound on the number of “bits” needed

  • n average to encode values drawn from a distribution P

Consider a random variable x ∼ Uniform having 8 equally likely states

To send a value x to receiver, we would encode it into 3 bits Shannon entropy: H(x ∼ Uniform) = −8× 1

8 log2 1 8 = 3

If the probabilities of the 8 states are ( 1

2, 1 4, 1 8, 1 16, 1 64, 1 64, 1 64, 1 64) instead

H(x) = 2 The encoding 0,10,110,1110,111100,111101,111110,111111 gives the average code length 2

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 65 / 76

slide-155
SLIDE 155

Kullback-Leibler (KL) Divergence

How many extra “bits” needed in average to transmit a value drawn from distribution P when we use a code that was designed for another distribution Q?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 66 / 76

slide-156
SLIDE 156

Kullback-Leibler (KL) Divergence

How many extra “bits” needed in average to transmit a value drawn from distribution P when we use a code that was designed for another distribution Q? Kullback-Leibler (KL) Divergence or (relative entropy) from distribution Q to P: DKL(PQ) = Ex∼P

  • log P(x)

Q(x)

  • = −Ex∼P [logQ(x)]−H(x ∼ P)

The term −Ex∼P [logQ(x)] is called the cross entropy

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 66 / 76

slide-157
SLIDE 157

Kullback-Leibler (KL) Divergence

How many extra “bits” needed in average to transmit a value drawn from distribution P when we use a code that was designed for another distribution Q? Kullback-Leibler (KL) Divergence or (relative entropy) from distribution Q to P: DKL(PQ) = Ex∼P

  • log P(x)

Q(x)

  • = −Ex∼P [logQ(x)]−H(x ∼ P)

The term −Ex∼P [logQ(x)] is called the cross entropy

If P and Q are independent, we can solve argmin

Q DKL(PQ)

by argmin

Q −Ex∼P [logQ(x)]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 66 / 76

slide-158
SLIDE 158

Properties

DKL(PQ) ≥ 0, ∀P,Q DKL(PQ) = 0 iff P and Q are equal almost surely KL divergence is asymmetric, i.e., DKL(PQ) = DKL(QP)

Figure: KL divergence for two normal distributions.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 67 / 76

slide-159
SLIDE 159

Minimizer of KL Divergence

Given P, we want to find Q∗ that minimizes the KL divergence Q∗(from) = argminQ DKL(PQ) or Q∗(to) = argminQ DKL(QP)?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 68 / 76

slide-160
SLIDE 160

Minimizer of KL Divergence

Given P, we want to find Q∗ that minimizes the KL divergence Q∗(from) = argminQ DKL(PQ) or Q∗(to) = argminQ DKL(QP)? Q∗(from) places high probability where P has high probability Q∗(to) places low probability where P has low probability

Figure: Approximating a mixture P of two Gaussians using a single Gaussian Q.

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 68 / 76

slide-161
SLIDE 161

Outline

1

Random Variables & Probability Distributions

2

Multivariate & Derived Random Variables

3

Bayes’ Rule & Statistics

4

Application: Principal Components Analysis

5

Technical Details of Random Variables

6

Common Probability Distributions

7

Common Parametrizing Functions

8

Information Theory

9

Application: Decision Trees & Random Forest

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 69 / 76

slide-162
SLIDE 162

Decision Trees

Given a supervised dataset X = {(x(i),y(i))}N

i=1

Can we find out a tree-like function f (i.e, a set of rules) such that f(x(i)) = y(i)?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 70 / 76

slide-163
SLIDE 163

Training a Decision Tree

Start from root which corresponds to all data points {(x(i),y(i)) : Rules = / 0)} Recursively split leaf nodes until data corresponding to children are “pure” in labels

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 71 / 76

slide-164
SLIDE 164

Training a Decision Tree

Start from root which corresponds to all data points {(x(i),y(i)) : Rules = / 0)} Recursively split leaf nodes until data corresponding to children are “pure” in labels How to split?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 71 / 76

slide-165
SLIDE 165

Training a Decision Tree

Start from root which corresponds to all data points {(x(i),y(i)) : Rules = / 0)} Recursively split leaf nodes until data corresponding to children are “pure” in labels How to split? Find a cutting point (j,v) among all unseen attributes such that after partitioning the corresponding data points Xparent = {(x(i),y(i) : Rules)} into two groups Xleft = {(x(i),y(i)) : Rules∪{x(i)

j

< v}}, and Xright = {(x(i),y(i)) : Rules∪{x(i)

j

≥ v}}, the “impurity” of labels drops the most

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 71 / 76

slide-166
SLIDE 166

Training a Decision Tree

Start from root which corresponds to all data points {(x(i),y(i)) : Rules = / 0)} Recursively split leaf nodes until data corresponding to children are “pure” in labels How to split? Find a cutting point (j,v) among all unseen attributes such that after partitioning the corresponding data points Xparent = {(x(i),y(i) : Rules)} into two groups Xleft = {(x(i),y(i)) : Rules∪{x(i)

j

< v}}, and Xright = {(x(i),y(i)) : Rules∪{x(i)

j

≥ v}}, the “impurity” of labels drops the most, i.e., solve argmax

j,v

  • Impurity(Xparent)−Impurity(Xleft,Xright)
  • Shan-Hung Wu (CS, NTHU)
  • Prob. & Info. Theory

Machine Learning 71 / 76

slide-167
SLIDE 167

Impurity Measure

argmax

j,v

  • Impurity(Xparent)−Impurity(Xleft,Xright)
  • What’s Impurity(·)?

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 72 / 76

slide-168
SLIDE 168

Impurity Measure

argmax

j,v

  • Impurity(Xparent)−Impurity(Xleft,Xright)
  • What’s Impurity(·)?

Entropy is a common choice: Impurity(Xparent) = H[y ∼ Empirical(Xparent)] Impurity(Xleft,Xright) =

i=left,right

|X(i)| |Xparent|H[y ∼ Empirical(X(i))]

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 72 / 76

slide-169
SLIDE 169

Impurity Measure

argmax

j,v

  • Impurity(Xparent)−Impurity(Xleft,Xright)
  • What’s Impurity(·)?

Entropy is a common choice: Impurity(Xparent) = H[y ∼ Empirical(Xparent)] Impurity(Xleft,Xright) =

i=left,right

|X(i)| |Xparent|H[y ∼ Empirical(X(i))] In this case, Impurity(Xparent)−Impurity(Xleft,Xright) is called the information gain

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 72 / 76

slide-170
SLIDE 170

Random Forests

A decision tree can be very deep

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 73 / 76

slide-171
SLIDE 171

Random Forests

A decision tree can be very deep Deeper nodes give more specific rules

Backed by less training data May not be applicable to testing data

How to ensure the generalizability of a decision tree?

I.e., to have high prediction accuracy on testing data

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 73 / 76

slide-172
SLIDE 172

Random Forests

A decision tree can be very deep Deeper nodes give more specific rules

Backed by less training data May not be applicable to testing data

How to ensure the generalizability of a decision tree?

I.e., to have high prediction accuracy on testing data

1

Pruning (e.g., limit the depth of the tree)

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 73 / 76

slide-173
SLIDE 173

Random Forests

A decision tree can be very deep Deeper nodes give more specific rules

Backed by less training data May not be applicable to testing data

How to ensure the generalizability of a decision tree?

I.e., to have high prediction accuracy on testing data

1

Pruning (e.g., limit the depth of the tree)

2

Random forest: an ensemble of many (deep) trees

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 73 / 76

slide-174
SLIDE 174

Training a Random Forest

1

Randomly pick M samples from the training set with replacement

Called the bootstrap samples

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 74 / 76

slide-175
SLIDE 175

Training a Random Forest

1

Randomly pick M samples from the training set with replacement

Called the bootstrap samples

2

Grow a decision tree from the bootstrap samples. At each node:

1

Randomly select K features without replacement

2

Find the best cutting point (j,v) and split the node

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 74 / 76

slide-176
SLIDE 176

Training a Random Forest

1

Randomly pick M samples from the training set with replacement

Called the bootstrap samples

2

Grow a decision tree from the bootstrap samples. At each node:

1

Randomly select K features without replacement

2

Find the best cutting point (j,v) and split the node

3

Repeat the steps 1 and 2 for T times to get T trees

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 74 / 76

slide-177
SLIDE 177

Training a Random Forest

1

Randomly pick M samples from the training set with replacement

Called the bootstrap samples

2

Grow a decision tree from the bootstrap samples. At each node:

1

Randomly select K features without replacement

2

Find the best cutting point (j,v) and split the node

3

Repeat the steps 1 and 2 for T times to get T trees

4

Aggregate the predictions made by different trees via the majority vote

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 74 / 76

slide-178
SLIDE 178

Training a Random Forest

1

Randomly pick M samples from the training set with replacement

Called the bootstrap samples

2

Grow a decision tree from the bootstrap samples. At each node:

1

Randomly select K features without replacement

2

Find the best cutting point (j,v) and split the node

3

Repeat the steps 1 and 2 for T times to get T trees

4

Aggregate the predictions made by different trees via the majority vote Each tree is trained slightly differently because of Step 1 and 2(a) Provides different “perspectives” when voting

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 74 / 76

slide-179
SLIDE 179

Decision Boundaries

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 75 / 76

slide-180
SLIDE 180

Decision Trees vs. Random Forests

Cons of random forests:

Less interpretable model

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 76 / 76

slide-181
SLIDE 181

Decision Trees vs. Random Forests

Cons of random forests:

Less interpretable model

Pros:

Less sensitive to the depth of trees

The majority voting can “absorb” the noise from individual trees

Can be parallelized

Each tree can grow independently

Shan-Hung Wu (CS, NTHU)

  • Prob. & Info. Theory

Machine Learning 76 / 76