Probability and Information Theory Lecture slides for Chapter 3 of - PowerPoint PPT Presentation

Probability and Information Theory Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26

Probability Mass Function • The domain of P must be the set of all possible states of x. • ∀ x ∈ x , 0 ≤ P ( x ) ≤ 1 . An impossible event has probability 0 and no state can be less probable than that. Likewise, an event that is guaranteed to happen has probability 1 , and no state can have a greater chance of occurring. • P x ∈ x P ( x ) = 1 . We refer to this property as being normalized . Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring. Example: uniform distribution: P ( x = x i ) = 1 k (Goodfellow 2016)

Probability Density Function • The domain of p must be the set of all possible states of x. • 8 x 2 x , p ( x ) � 0 . Note that we do not require p ( x )  1 . R p ( x ) dx = 1 . • mass outside the 1 u ( x ; a, b ) = b − a . Example: uniform distribution: integrates to 1. W (Goodfellow 2016)

Computing Marginal Probability with the Sum Rule X 8 x 2 x , P ( x = x ) = P ( x = x, y = y ) . (3.3) y Z p ( x ) = p ( x, y ) dy. (3.4) (Goodfellow 2016)

Conditional Probability P ( y = y | x = x ) = P ( y = y, x = x ) (3.5) . P ( x = x ) (Goodfellow 2016)

Chain Rule of Probability i =2 P ( x ( i ) | x (1) , . . . , x ( i − 1) ) . P ( x (1) , . . . , x ( n ) ) = P ( x (1) ) Π n (3.6) (Goodfellow 2016)

Independence ∀ x ∈ x , y ∈ y , p ( x = x, y = y ) = p ( x = x ) p ( y = y ) . (3.7) ndom variables x and y are given a random (Goodfellow 2016)

Conditional Independence ∀ x ∈ x , y ∈ y , z ∈ z , p ( x = x, y = y | z = z ) = p ( x = x | z = z ) p ( y = y | z = z ) . (3.8) We can denote independence and conditional independence with compact (Goodfellow 2016)

Expectation X E x ∼ P [ f ( x )] = P ( x ) f ( x ) , (3.9) x Z E x ∼ p [ f ( x )] = p ( x ) f ( x ) dx. (3.10) linearity of expectations: E x [ α f ( x ) + β g ( x )] = α E x [ f ( x )] + β E x [ g ( x )] , (3.11) (Goodfellow 2016)

Variance and Covariance h ( f ( x ) − E [ f ( x )]) 2 i Var( f ( x )) = E (3.12) . Cov( f ( x ) , g ( y )) = E [( f ( x ) − E [ f ( x )]) ( g ( y ) − E [ g ( y )])] . (3.13) Covariance matrix: Cov( x ) i,j = Cov( x i , x j ) . (3.14) f the covariance give the variance: (Goodfellow 2016)

Bernoulli Distribution P ( x = 1) = φ (3.16) P ( x = 0) = 1 − φ (3.17) P ( x = x ) = φ x (1 − φ ) 1 − x (3.18) E x [ x ] = φ (3.19) Var x ( x ) = φ (1 − φ ) (3.20) (Goodfellow 2016)

Gaussian Distribution Parametrized by variance: r ✓ ◆ 1 − 1 N ( x ; µ, σ 2 ) = 2 σ 2 ( x − µ ) 2 2 πσ 2 exp (3.21) . Parametrized by precision: 3.1 for a plot of the density function. r ✓ ◆ − 1 β N ( x ; µ, β � 1 ) = 2 β ( x − µ ) 2 (3.22) 2 π exp . (Goodfellow 2016)

Gaussian Distribution 0 . 40 0 . 35 Maximum at x = µ 0 . 30 0 . 25 Inflection points at p(x) x = µ ± σ 0 . 20 0 . 15 0 . 10 0 . 05 0 . 00 − 2 . 0 − 1 . 5 − 1 . 0 − 0 . 5 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 Figure 3.1 (Goodfellow 2016)

Multivariate Gaussian Parametrized by covariance matrix: s 1 ✓ − 1 ◆ 2( x − µ ) > Σ � 1 ( x − µ ) N ( x ; µ , Σ ) = (2 π ) n det( Σ ) exp (3.23) . Parametrized by precision matrix: s ✓ ◆ det( β ) − 1 N ( x ; µ , β � 1 ) = 2( x − µ ) > β ( x − µ ) (2 π ) n exp (3.24) . (Goodfellow 2016)

More Distributions Exponential: (3.25) p ( x ; λ ) = λ 1 x � 0 exp ( − λ x ) . ution uses the indicator function to assign probability Laplace: ✓ − | x − µ | ◆ Laplace( x ; µ, γ ) = 1 2 γ exp (3.26) . γ Dirac: p ( x ) = δ ( x − µ ) . (3.27) (Goodfellow 2016)

Empirical Distribution m p ( x ) = 1 X δ ( x − x ( i ) ) ˆ (3.28) m i =1 (Goodfellow 2016)

Mixture Distributions X P ( x ) = P ( c = i ) P ( x | c = i ) (3.29) i Gaussian mixture with three components x 2 x 1 Figure 3.2 (Goodfellow 2016)

Logistic Sigmoid 1 . 0 0 . 8 0 . 6 σ ( x ) 0 . 4 0 . 2 0 . 0 − 10 − 5 0 5 10 Figure 3.3: The logistic sigmoid function. Commonly used to parametrize Bernoulli distributions (Goodfellow 2016)

Softplus Function 10 8 6 ζ ( x ) 4 2 0 − 10 − 5 0 5 10 Figure 3.4: The softplus function. (Goodfellow 2016)

Bayes’ Rule P ( x | y ) = P ( x ) P ( y | x ) (3.42) . P ( y ) appears in the formula, it is usually feasible to compute (Goodfellow 2016)

Change of Variables � ◆� ✓ ∂ g ( x ) � � p x ( x ) = p y ( g ( x )) � det (3.47) � . � � ∂ x (Goodfellow 2016)

Information Theory Information: I ( x ) = − log P ( x ) . (3.48) Entropy: H ( x ) = E x ∼ P [ I ( x )] = � E x ∼ P [log P ( x )] . (3.49) KL divergence:  � log P ( x ) D KL ( P k Q ) = E x ∼ P = E x ∼ P [log P ( x ) � log Q ( x )] . (3.50) Q ( x ) (Goodfellow 2016)

Entropy of a Bernoulli Variable 0 . 7 0 . 6 Shannon entropy in nats 0 . 5 0 . 4 0 . 3 0 . 2 0 . 1 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Bernoulli parameter Figure 3.5 (Goodfellow 2016)

The KL Divergence is Asymmetric q ∗ = argmin q D KL ( p � q ) q ∗ = argmin q D KL ( q � p ) p ( x ) p( x ) Probability Density Probability Density q ∗ ( x ) q ∗ ( x ) x x Figure 3.6 (Goodfellow 2016)

Directed Model b a Figure 3.7 d c e p ( a , b , c , d , e ) = p ( a ) p ( b | a ) p ( c | a , b ) p ( d | b ) p ( e | c ) . (3.54) (Goodfellow 2016)

Undirected Model b a d c Figure 3.8 e p ( a , b , c , d , e ) = 1 Z φ (1) ( a , b , c ) φ (2) ( b , d ) φ (3) ( c , e ) . (3.56) (Goodfellow 2016)

Probability and Information Theory Lecture slides for Chapter 3 of - PowerPoint PPT Presentation

Probability and Information Theory Lecture slides for Chapter 3 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26 Probability Mass Function The domain of P must be the set of all possible states of x. x x , 0

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Which probability Which probability Which probability Which probability theory for cosmology?

Recap of Basic Probability Elements of basic probability theory probability theory The

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Theory p ( E ) = p ( a 1 ) + p ( a 2 ) + ... + p ( a m ) 1 2 3 4 5 6 7 8 9 10 11 12 13

Counting and Probability Whats to come? Counting and Probability Whats to come?

Basics of Probability Basics of Probability Janyl Jumadinova February 2426, 2020 Janyl

Unit 2: Probability and distributions Lecture 1: Probability and conditional probability

1 2 3 4 Stopping Probability Visiting Probability 5 Stopping

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability

DATA MINING TECHNIQUES Review of Probability Theory Yijun Zhao Northeastern University spring

Outline 1. Bayes Law L7: Probability Basics 2. Probability distributions CS 344R/393R:

Compressing IP Forwarding Tables: Towards Entropy Bounds and Beyond Gbor Rtvri, Jnos

On the Polarization of Rnyi Entropy Mengfan Zheng Based on joint work with Ling Liu and Cong

Entropy and Uncertainty Appendix C Computer Security: Art and Science, 2 nd Edition Version 1.0

Robust and On-the-fly Data Denoising For Image Classification Jia ming Song, Yann Dauphin, Michael

Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur

tr t trr t s

Floating point representation (Unsigned) Fixed-point representation The numbers are stored with a

6/29/2017 Floating Point Integer data type 32-bit unsigned integers limited to whole numbers