Probability, Entropy, and Inference Ensemble X is a triple ( x, A X - PowerPoint PPT Presentation

2 Ensembles and probabilities Probability, Entropy, and Inference • Ensemble X is a triple ( x, A X , P X ) , where Based on David J.C. MacKay: – x is the outcome of random variable Information Theory, Inference and Learning Algorithms, 2003 – A X = { a 1 , a 2 , . . . , a I } are the possible values for x Chapter 2 – P X = { p 1 , p 2 , . . . , p I } are the probabilities of outcomes P ( x = a i ) = p i – p i ≥ 0 – � a i ∈A X P ( x = a i ) = 1 Juha Raitio juha.raitio@iki.fi • P ( x = a i ) may be written as P ( a i ) or P ( x ) 5th February 2004 • Probability of a subset T of A x � P ( T ) = P ( x ∈ T ) = P ( x = a i ) (1) a i ∈ T HUT T-61.182 Information Theory and Machine Learning Juha Raitio 5th February 2004 1 3 Outline Joint ensembles and marginal probabilities • Joint ensemble XY 1. On notation of probabilities – Outcome is an ordered pair x, y (or xy ) 2. Meaning of probability – Possible values A X = { a 1 , a 2 , . . . , a I } and A Y = { b 1 , b 2 , . . . , b J } – Joint probability P ( x, y ) 3. Forward and inverse probabilities • Marginal probabilities 4. Probabilistic inference � P ( x = a i ) ≡ P ( x = a i , y ) (2) 5. Shannon information and entropy y ∈A Y � P ( y ) ≡ P ( y, x ) (3) 6. On convexity of functions x ∈A X 7. Exercises Juha Raitio 5th February 2004 Juha Raitio 5th February 2004

4 6 Conditioning rules Two meanings for probability • Conditional probability P ( x = a i | y = b j ) ≡ P ( x = a i , y = b j ) • Frequentist view of probability P ( y = b j ) � = 0 , (4) P ( y = b j ) – Probabilities are frequencies of outcomes in random experiments – Probabilities describe random variables • Assumptions H – ”the probability that x equals a i , given H ” • Bayesian view of probability – Probabilities are degrees of belief in propositions • Product (chain) rule – Probabilities describe assumptions, and inferences given assumptions – Subjective intepretation of probability P ( x, y |H ) = P ( x | y, H ) P ( y |H ) = P ( y | x, H ) P ( x |H ) (5) “you cannot do inference without making assumptions” • Sum rule � � P ( x |H ) = P ( x, y |H ) = P ( x | y, H ) P ( y |H ) (6) y y Juha Raitio 5th February 2004 Juha Raitio 5th February 2004 5 7 Bayes theorem and independence Forward and inverse probabilities • Bayes theorem • Assume generative model describing a process giving rise to some data P ( x | y, H ) P ( y |H ) • Forward probability P ( y | x, H ) = (7) P ( x |H ) – Task is to compute probability distribution of some quantity that depends on data P ( x | y, H ) P ( y |H ) = (8) � y ′ P ( x | y ′ , H ) P ( y ′ |H ) • Inverse probability – Task is to compute probability distribution of unobserved variables given data • Two random variables X and Y are independent ( X ⊥ Y ) if and only if – Requires use of Bayes’ theorem P ( x, y ) = P ( x ) P ( y ) (9) Juha Raitio 5th February 2004 Juha Raitio 5th February 2004

8 10 Inference with inverse probabilities Decomposability of entropy • Inference on parameters θ given data D and hypothesis H by Bayes’ theorem • Entropy of probability distribution p = { p 1 , p 2 , . . . , p I } P ( θ | D, H ) = P ( D | θ, H ) P ( θ |H ) � � p 2 p 3 p I , (10) H ( p ) = H ( p 1 , 1 − p 1 ) + (1 − p 1 ) H , , . . . , (15) P ( D |H ) 1 − p 1 1 − p 1 1 − p 1 where • More generally P ( θ |H ) is the prior probability for parameters H ( p ) = H [( p 1 + p 2 + . . . + p m ) , ( p m +1 + p m +2 + . . . + p I )] P ( D | θ, H ) is the likelihood of the parameters given the data P ( D |H ) is the evidence � p 1 p m � +( p 1 + . . . + p m ) H ( p 1 + . . . + p m ) , . . . , (16) P ( θ | D, H ) is the posterior probability for parameters ( p 1 + . . . + p m ) � � p m +1 p I +( p m +1 + . . . + p I ) H ( p m +1 + . . . + p I ) , . . . , • in written ( p m +1 + . . . + p I ) posterior = likelihood × prior (11) evidence Juha Raitio 5th February 2004 Juha Raitio 5th February 2004 9 11 Shannon information and entropy Relative entropy • Shannon information content of an outcome x = a i (bits) • Kullback-Leibler divergence between P ( x ) and Q ( x ) over alphabet A X 1 h ( x = a i ) = log 2 (12) P ( x ) log P ( x ) � P ( x = a i ) D KL ( P � Q ) = (17) Q ( x ) x • Entropy of an ensemble X (bits) • Properties of relative entropy 1 � H ( X ) ≡ P ( x ) log (13) – Gibbs’ inequality : D KL ( P � Q ) ≥ 0 and D KL ( P � Q ) = 0 , if P = Q P ( x ) x ∈A X – in general D KL ( P � Q ) � = D KL ( Q � P ) • Joint entropy of X, Y 1 � H ( X, Y ) ≡ P ( x, y ) log (14) P ( x, y ) xy ∈A X A Y Juha Raitio 5th February 2004 Juha Raitio 5th February 2004

12 Convex and concave functions • f ( x ) is convex over ( a, b ) , if for all x 1 , x 2 ∈ ( a, b ) and 0 ≤ λ ≤ 1 f ( λx 1 + (1 − λ ) x 2 ) ≤ λf ( x 1 ) + (1 − λ ) f ( x 2 ) (18) • f ( x ) is concave is the above above holds for f with the inequities reversed • f ( x ) is strictly convex (concave) if the equality in (18) holds only for λ = 0 and λ = 1 • Jensen’s inequality for convex function f ( x ) of random variable x E [ f ( x )] ≥ f ( E [ x ]) , where E denotes expectation (19) • If f ( x ) is convex (concave) and ∇ f ( x ) = 0 , then f has its minimum (maximum) value at x Juha Raitio 5th February 2004 13 Problems 1. A circular coin of diameter a is thrown onto a square grid whose squares are b × b , ( a < b ) . What is the probability that the coin will lie entirely within one square? (MacKay exercise 2 . 31 ) 2. The inhabitants of an island tell the truth one third of the time. They lie with probability 2 / 3 . On an occasion, after one of them made a statement, you ask another ’was the statement true?’ and he says ’yes’. What is the probability that the statement was indeed true? (MacKay exercise 2 . 37 ) Juha Raitio 5th February 2004

Probability, Entropy, and Inference Ensemble X is a triple ( x, A X - PowerPoint PPT Presentation

2 Ensembles and probabilities Probability, Entropy, and Inference Ensemble X is a triple ( x, A X , P X ) , where Based on David J.C. MacKay: x is the outcome of random variable Information Theory, Inference and Learning Algorithms, 2003

Entropy, Relative Entropy, Cross Entropy Entropy Entropy, H(x) is a measure of the uncertainty of

Formal Modeling in Cognitive Science Lecture 25: Entropy, Joint Entropy, Conditional Entropy 1

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Chapter 2 Entropy, Relative Entropy, and Mutual Infor- mation Peng-Hua Wang Graduate Institute

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

1) Entropy = measure of randomness 2) Entropy = measure of compressibility More random = Less

Probability Basics Probabilistic Inference Martin Emms October 1, 2020 Probability Basics

Continuing Probability. Wrap up: Total Probability and Conditional Probability. Continuing

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Probability Basics Probability Background Martin Emms October 1, 2020 Probability Basics

Chapter 2 Probability 1. Definition of Probability 2. Probability of disjoint events 3.

Maximum Entropy Beyond Fact to Explain Selecting Probability Maximum Entropy . . . Explaining a

Counting and Probability Whats to come? Counting and Probability Whats to come?

Topological entropy and algebraic entropy on locally compact abelian groups - The Bridge Theorem

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

Entropy and The Second Law of Thermodynamics Entropy (S)

H Filtering of Uncertain LPV Systems with Time-Delay C.Briat, O.Sename and JF.Lafay August

Robust Hidden Markov Models Inference in the Presence of Label Noise Benot Frnay February 7,

CS480/680 Machine Learning Lecture 5: January 21 st , 2020 Information Theory Zahra Sheikhbahaee

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net

Todays exercises 5.17: Football Pools 5.18: Cells of Line and Hyperplane Arrangements

Basic Definitions and Facts Iftach Haitner Tel Aviv University. October 28, 2014 Iftach Haitner

Low Rank Approximation Lecture 5 Daniel Kressner Chair for Numerical Algorithms and HPC

The Total Curvature and Betti Numbers of Complex Projective Manifolds Convex, Discrete and