Probability and Information Theory
Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26
Probability and Information Theory Lecture slides for Chapter 3 of - - PowerPoint PPT Presentation
Probability and Information Theory Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26 Probability Mass Function The domain of P must be the set of all possible states of x. x x , 0
Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26
(Goodfellow 2016)
be less probable than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can have a greater chance of occurring.
x∈x P(x) = 1. We refer to this property as being normalized. Without
this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring.
Example: uniform distribution: P(x = xi) = 1
(Goodfellow 2016)
p(x)dx = 1.
Example: uniform distribution:
1 b−a.
(Goodfellow 2016)
8x 2 x, P(x = x) = X
y
P(x = x, y = y). (3.3)
p(x) = Z p(x, y)dy. (3.4)
(Goodfellow 2016)
P(y = y | x = x) = P(y = y, x = x) P(x = x) . (3.5)
(Goodfellow 2016)
P(x(1), . . . , x(n)) = P(x(1))Πn
i=2P(x(i) | x(1), . . . , x(i−1)).
(3.6)
(Goodfellow 2016)
∀x ∈ x, y ∈ y, p(x = x, y = y) = p(x = x)p(y = y). (3.7) ndom variables x and y are given a random
(Goodfellow 2016)
∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z)p(y = y | z = z). (3.8) We can denote independence and conditional independence with compact
(Goodfellow 2016)
Ex∼P [f(x)] = X
x
P(x)f(x), (3.9)
Ex∼p[f(x)] = Z p(x)f(x)dx. (3.10)
linearity of expectations:
Ex[αf(x) + βg(x)] = αEx[f(x)] + βEx[g(x)], (3.11)
(Goodfellow 2016)
Var(f(x)) = E h (f(x) − E[f(x)])2i . (3.12)
Cov(f(x), g(y)) = E [(f(x) − E [f(x)]) (g(y) − E [g(y)])] . (3.13)
Covariance matrix:
Cov(x)i,j = Cov(xi, xj). (3.14) f the covariance give the variance:
(Goodfellow 2016)
P(x = 1) = φ (3.16) P(x = 0) = 1 − φ (3.17) P(x = x) = φx(1 − φ)1−x (3.18) Ex[x] = φ (3.19) Varx(x) = φ(1 − φ) (3.20)
(Goodfellow 2016)
N(x; µ, σ2) = r 1 2πσ2 exp ✓ − 1 2σ2 (x − µ)2 ◆ . (3.21) 3.1 for a plot of the density function.
Parametrized by variance: Parametrized by precision:
N(x; µ, β1) = r β 2π exp ✓ −1 2β(x − µ)2 ◆ . (3.22)
(Goodfellow 2016)
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 p(x) Maximum at x = µ Inflection points at x = µ ± σ
Figure 3.1
(Goodfellow 2016)
N(x; µ, Σ) = s 1 (2π)ndet(Σ) exp ✓ −1 2(x − µ)>Σ1(x − µ) ◆ . (3.23)
Parametrized by covariance matrix:
N(x; µ, β1) = s det(β) (2π)n exp ✓ −1 2(x − µ)>β(x − µ) ◆ . (3.24)
Parametrized by precision matrix:
(Goodfellow 2016)
p(x; λ) = λ1x0 exp (−λx) . (3.25) ution uses the indicator function to assign probability
Laplace(x; µ, γ) = 1 2γ exp ✓ −|x − µ| γ ◆ . (3.26)
Exponential: Laplace:
Dirac:
(Goodfellow 2016)
m
i=1
(Goodfellow 2016)
P(x) = X
i
P(c = i)P(x | c = i) (3.29)
x1 x2
Figure 3.2 Gaussian mixture with three components
(Goodfellow 2016)
−10 −5 5 10 0.0 0.2 0.4 0.6 0.8 1.0 σ(x)
Figure 3.3: The logistic sigmoid function.
Commonly used to parametrize Bernoulli distributions
(Goodfellow 2016)
−10 −5 5 10 2 4 6 8 10 ζ(x)
Figure 3.4: The softplus function.
(Goodfellow 2016)
P(x | y) = P(x)P(y | x) P(y) . (3.42) appears in the formula, it is usually feasible to compute
(Goodfellow 2016)
px(x) = py(g(x))
✓∂g(x) ∂x ◆
(3.47)
(Goodfellow 2016)
H(x) = Ex∼P [I(x)] = Ex∼P [log P(x)]. (3.49)
Information: Entropy:
DKL(PkQ) = Ex∼P log P(x) Q(x)
(3.50)
KL divergence:
(Goodfellow 2016)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Shannon entropy in nats
Figure 3.5 Bernoulli parameter
(Goodfellow 2016)
x Probability Density
q∗ = argminqDKL(pq) p(x) q∗(x)
x Probability Density
q∗ = argminqDKL(qp) p(x) q∗(x)
Figure 3.6
(Goodfellow 2016)
a c b e d
p(a, b, c, d, e) = p(a)p(b | a)p(c | a, b)p(d | b)p(e | c). (3.54)
Figure 3.7
(Goodfellow 2016)
a c b e d
p(a, b, c, d, e) = 1 Z φ(1)(a, b, c)φ(2)(b, d)φ(3)(c, e). (3.56)
Figure 3.8