Probability and Information Theory Lecture slides for Chapter 3 of - - PowerPoint PPT Presentation

probability and information theory
SMART_READER_LITE
LIVE PREVIEW

Probability and Information Theory Lecture slides for Chapter 3 of - - PowerPoint PPT Presentation

Probability and Information Theory Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26 Probability Mass Function The domain of P must be the set of all possible states of x. x x , 0


slide-1
SLIDE 1

Probability and Information Theory

Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26

slide-2
SLIDE 2

(Goodfellow 2016)

Probability Mass Function

  • The domain of P must be the set of all possible states of x.
  • ∀x ∈ x, 0 ≤ P(x) ≤ 1. An impossible event has probability 0 and no state can

be less probable than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can have a greater chance of occurring.

  • P

x∈x P(x) = 1. We refer to this property as being normalized. Without

this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring.

Example: uniform distribution: P(x = xi) = 1

k

slide-3
SLIDE 3

(Goodfellow 2016)

Probability Density Function

  • The domain of p must be the set of all possible states of x.
  • 8x 2 x, p(x) 0. Note that we do not require p(x)  1.
  • R

p(x)dx = 1.

Example: uniform distribution:

mass outside the u(x; a, b) =

1 b−a.

integrates to 1. W

slide-4
SLIDE 4

(Goodfellow 2016)

Computing Marginal Probability with the Sum Rule

8x 2 x, P(x = x) = X

y

P(x = x, y = y). (3.3)

p(x) = Z p(x, y)dy. (3.4)

slide-5
SLIDE 5

(Goodfellow 2016)

Conditional Probability

P(y = y | x = x) = P(y = y, x = x) P(x = x) . (3.5)

slide-6
SLIDE 6

(Goodfellow 2016)

Chain Rule of Probability

P(x(1), . . . , x(n)) = P(x(1))Πn

i=2P(x(i) | x(1), . . . , x(i−1)).

(3.6)

slide-7
SLIDE 7

(Goodfellow 2016)

Independence

∀x ∈ x, y ∈ y, p(x = x, y = y) = p(x = x)p(y = y). (3.7) ndom variables x and y are given a random

slide-8
SLIDE 8

(Goodfellow 2016)

Conditional Independence

∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z). (3.8) We can denote independence and conditional independence with compact

slide-9
SLIDE 9

(Goodfellow 2016)

Expectation

Ex∼P [f(x)] = X

x

P(x)f(x), (3.9)

Ex∼p[f(x)] = Z p(x)f(x)dx. (3.10)

linearity of expectations:

Ex[αf(x) + βg(x)] = αEx[f(x)] + βEx[g(x)], (3.11)

slide-10
SLIDE 10

(Goodfellow 2016)

Variance and Covariance

Var(f(x)) = E h (f(x) − E[f(x)])2i . (3.12)

Cov(f(x), g(y)) = E [(f(x) − E [f(x)]) (g(y) − E [g(y)])] . (3.13)

Covariance matrix:

Cov(x)i,j = Cov(xi, xj). (3.14) f the covariance give the variance:

slide-11
SLIDE 11

(Goodfellow 2016)

Bernoulli Distribution

P(x = 1) = φ (3.16) P(x = 0) = 1 − φ (3.17) P(x = x) = φx(1 − φ)1−x (3.18) Ex[x] = φ (3.19) Varx(x) = φ(1 − φ) (3.20)

slide-12
SLIDE 12

(Goodfellow 2016)

Gaussian Distribution

N(x; µ, σ2) = r 1 2πσ2 exp ✓ − 1 2σ2 (x − µ)2 ◆ . (3.21) 3.1 for a plot of the density function.

Parametrized by variance: Parametrized by precision:

N(x; µ, β1) = r β 2π exp ✓ −1 2β(x − µ)2 ◆ . (3.22)

slide-13
SLIDE 13

(Goodfellow 2016)

Gaussian Distribution

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p(x) Maximum at x = µ Inflection points at x = µ ± σ

Figure 3.1

slide-14
SLIDE 14

(Goodfellow 2016)

Multivariate Gaussian

N(x; µ, Σ) = s 1 (2π)ndet(Σ) exp ✓ −1 2(x − µ)>Σ1(x − µ) ◆ . (3.23)

Parametrized by covariance matrix:

N(x; µ, β1) = s det(β) (2π)n exp ✓ −1 2(x − µ)>β(x − µ) ◆ . (3.24)

Parametrized by precision matrix:

slide-15
SLIDE 15

(Goodfellow 2016)

More Distributions

p(x; λ) = λ1x0 exp (−λx) . (3.25) ution uses the indicator function to assign probability

Laplace(x; µ, γ) = 1 2γ exp ✓ −|x − µ| γ ◆ . (3.26)

Exponential: Laplace:

p(x) = δ(x − µ). (3.27)

Dirac:

slide-16
SLIDE 16

(Goodfellow 2016)

Empirical Distribution

ˆ p(x) = 1 m

m

X

i=1

δ(x − x(i)) (3.28)

slide-17
SLIDE 17

(Goodfellow 2016)

Mixture Distributions

P(x) = X

i

P(c = i)P(x | c = i) (3.29)

x1 x2

Figure 3.2 Gaussian mixture with three components

slide-18
SLIDE 18

(Goodfellow 2016)

−10 −5 5 10 0.0 0.2 0.4 0.6 0.8 1.0 σ(x)

Figure 3.3: The logistic sigmoid function.

Logistic Sigmoid

Commonly used to parametrize Bernoulli distributions

slide-19
SLIDE 19

(Goodfellow 2016)

Softplus Function

−10 −5 5 10 2 4 6 8 10 ζ(x)

Figure 3.4: The softplus function.

slide-20
SLIDE 20

(Goodfellow 2016)

Bayes’ Rule

P(x | y) = P(x)P(y | x) P(y) . (3.42) appears in the formula, it is usually feasible to compute

slide-21
SLIDE 21

(Goodfellow 2016)

Change of Variables

px(x) = py(g(x))

  • det

✓∂g(x) ∂x ◆

  • .

(3.47)

slide-22
SLIDE 22

(Goodfellow 2016)

Information Theory

I(x) = − log P(x). (3.48)

H(x) = Ex∼P [I(x)] = Ex∼P [log P(x)]. (3.49)

Information: Entropy:

DKL(PkQ) = Ex∼P  log P(x) Q(x)

  • = Ex∼P [log P(x) log Q(x)] .

(3.50)

KL divergence:

slide-23
SLIDE 23

(Goodfellow 2016)

Entropy of a Bernoulli Variable

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Shannon entropy in nats

Figure 3.5 Bernoulli parameter

slide-24
SLIDE 24

(Goodfellow 2016)

The KL Divergence is Asymmetric

x Probability Density

q∗ = argminqDKL(pq) p(x) q∗(x)

x Probability Density

q∗ = argminqDKL(qp) p(x) q∗(x)

Figure 3.6

slide-25
SLIDE 25

(Goodfellow 2016)

Directed Model

a c b e d

p(a, b, c, d, e) = p(a)p(b | a)p(c | a, b)p(d | b)p(e | c). (3.54)

Figure 3.7

slide-26
SLIDE 26

(Goodfellow 2016)

Undirected Model

a c b e d

p(a, b, c, d, e) = 1 Z φ(1)(a, b, c)φ(2)(b, d)φ(3)(c, e). (3.56)

Figure 3.8