Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby - PowerPoint PPT Presentation

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2010

Quantifying a Code • How much information does a neural response carry about a stimulus? • How efficient is a hypothetical code, given the statistical behaviour of the components? • How much better could another code do, given the same components? • Is the information carried by different neurons complementary, synergistic (whole is greater than sum of parts), or redundant? • Can further processing extract more information about a stimulus? Information theory is the mathematical framework within which questions such as these can be framed and answered. Information theory does not directly address: • estimation (but there are some relevant bounds) • computation (but “information bottleneck” might provide a motivating framework) • representation (but redundancy reduction has obvious information theoretic con- nections)

Uncertainty and Information Information is related to the removal of uncertainty. S → R → P ( S | R ) How informative is R about S ? � � P ( S | R ) = ⇒ high information? 0 , 0 , 1 , 0 , . . . , 0 � 1 � M , 1 M , . . . , 1 P ( S | R ) = ⇒ low information? M But also depends on P ( S ) . We need to start by considering the uncertainty in a probability distribution → called the entropy Let S ∼ P ( S ) . The entropy is the minimum number of bits needed, on average, to specify the value S takes, assuming P ( S ) is known. Equivalently, the minimum average number of yes/no questions needed to guess S .

Entropy • Suppose there are M equiprobable stimuli: P ( s m ) = 1 /M . To specify which stimulus appears on a given trial, we would need assign each a (binary) number. This would take, [2 B ≥ M ] B s ≤ log 2 M + 1 1 = − log 2 M + 1 bits • Now suppose we code N such stimuli, drawn iid, at once. log 2 M N + 1 B N ≤ 1 → − N log 2 as N → ∞ M ⇒ B s → − log 2 p bits This is called block coding. It is useful for extracting theoretical limits. The nervous system is unlikely to use block codes in time, but may in space.

Entropy • Now suppose stimuli are not equiprobable. Write P ( s m ) = p m . Then � p n m P ( S 1 , S 2 , . . . , S N ) = [ where n m = ( # of S i = s m )] . m m Now, as N → ∞ only “typical” sequences, with n m = p m N , have non-zero probability of occuring; and they are all equally likely. This is called the Asymptotic Equipartition Property (or AEP). Thus, � = − � m p n m B N → − log 2 m n m log 2 p m m � = − � m p m N log 2 p m = − N p m log 2 p m m � �� − H [ s ] H [ S ] = E [log 2 P ( S )] , also written H [ P ( S )] , is the entropy of the stimulus distribution. Rather than appealing to typicality, we could instead have used the law of large numbers directly: � � N log 2 P ( S 1 , S 2 , . . . S N ) = 1 1 P ( S i ) = 1 N →∞ N log 2 log 2 P ( S i ) → E [log 2 P ( S i )] N i i

Average Mutual Information A natural definition of the average information gained about S from R is I [ S ; R ] = H [ S ] − H [ S | R ] Measures reduction in uncertainty due to R . It follows from the definition that � � 1 1 P ( s ) − I [ S ; R ] = P ( s ) log P ( s, r ) log P ( s | r ) s s,r � � 1 P ( s, r ) log P ( s | r ) = P ( s, r ) log P ( s ) + s,r s,r � P ( s, r ) log P ( s | r ) = P ( s ) s,r � P ( s, r ) log P ( s, r ) = P ( s ) P ( r ) s,r = I [ R ; S ]

Average Mutual Information The symmetry suggests a Venn-like diagram. H [ S ] H [ R ] I [ S ; R ] H [ S | R ] H [ R | S ] I [ R ; S ] H [ S, R ] All of the additive and equality relationships implied by this picture hold for two variables. Unfortunately, we will see that this does not generalise to any more than two.

Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� cross entropy Excess cost in bits paid by encoding according to Q instead of P . � P ( s ) log Q ( s ) − KL [ P � Q ] = P ( s ) s � P ( s ) Q ( s ) ≤ log by Jensen P ( s ) s � = log Q ( s ) = log 1 = 0 s So KL [ P � Q ] ≥ 0 . Equality iff P = Q

Mutual Information and KL � P ( s, r ) log P ( s, r ) P ( s ) P ( r ) = KL [ P ( s, r ) � P ( s ) P ( r )] I [ S ; R ] = s,r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0 2. Conditioning never increases entropy H [ S | R ] ≤ H [ S ]

Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 redundant no yes I 12 = I 1 + I 2 independent yes yes I 12 > I 1 + I 2 synergistic yes no no no ? any of the above I 12 > max( I 1 , I 2 ) : the second response cannot destroy information. Thus, the Venn-like diagram with three variables is misleading.

Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 ) Thus, H [ S | R 2 ] ≥ H [ S | R 1 , R 2 ] = H [ S | R 1 ] ⇒ I [ S ; R 2 ] ≤ I [ S ; R 1 ] So any computation based on R 1 that does not have separate access to S cannot add information (in the Shannon sense) about the world. Equality holds iff S → R 2 → R 1 as well. In this case R 2 is called a sufficient statistic for S .

Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ] iid If S i ∼ P ( S ) then H [ S ] = H [ S ] . If S is Markov (and stationary) then H [ S ] = H [ S n | S n − 1 ] .

Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ]= − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s (log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞ We define the differential entropy : � h ( S ) = − ds p ( s ) log p ( s ) . Note that h ( S ) can be < 0 , and can be ±∞ .

Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s, r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved. The mutual information, however, is well-defined I ∆ [ S ; R ] = H ∆ [ S ] − H ∆ [ S | R ] � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � � � � − dr p ( r ) − ∆ s p ( s i | r ) log p ( s i | r ) − log ∆ s i → h ( S ) − h ( S | R ) as are other KL divergences.

Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? 2. Let Use Lagrange multipliers: �� L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δp ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Ze λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, f ( s ) = s ⇒ p ( s ) = 1 Z e λ 1 s . Exponential (need p ( s ) = 0 for s < T ) . f ( s ) = s 2 ⇒ p ( s ) = 1 Z e λ 1 s 2 . Gaussian . Both results together ⇒ maximum entropy point process (for fixed mean arrival rate) is homogeneous Poisson – independent, exponentially distributed ISIs.

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby - PowerPoint PPT Presentation

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2010 Quantifying a Code How much information does a neural response carry about a stimulus? How

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Information Theory project Lo Bordy 29 mai 2017 Lo Bordy Information Theory project Global

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, & CRITICAL RACE

EXCHANGE THEORY Chapter 3 Leader Member Exchange Theory 2 Initially the theory described the

Challenge Codes for Physically Unclonable Functions (PUFs) A Maximum Entropy Problem Alexander

Deep Learning for Image and Video Compression Yao Wang Dept. of Electrical and Computer

Image/video compression: Basics and research issues Christine GUILLEMOT Outline A few basics

Perceptually-Driven Video Coding with the Daala Video Codec Timothy B. Terriberry The Xiph.Org

Mitigating Information Leakage in Image Representations: A Maximum Entropy Approach Proteek Roy

CS 260: Seminar in Computer Science: Multimedia Networking Jiasi Chen Lectures: MWF 4:10-5pm in

TKT TKT- -2431 SoC design 2431 SoC design Introduction to exercises SoC design / September 09

Outline The IP protocol 15-441/641: Computer Networks IPv4 The Internet Protocol IPv6

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby - PowerPoint PPT Presentation

Information Theory Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2010 Quantifying a Code How much information does a neural response carry about a stimulus? How

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Game Theory and Nuclear Weapons Game Theory and Nuclear Weapons Game Theory and Nuclear Warfare

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

Theory and Applications of Boosting Theory and Applications of Boosting Theory and Applications

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? ! What does a theory consist of?

Applied Hodge Theory: Social Choice, Crowdsourced Ranking, and Game Theory Yuan Yao HKUST

SOCIOLOGICAL THEORY: A SCIENTIFIC APPROACH What is a theory? What does a theory consist of?

General motivations Model theory Recursion theory Lambda calculus Set theory

Information Theory project Lo Bordy 29 mai 2017 Lo Bordy Information Theory project Global

Overview Coding and Information Theory What is information theory? Entropy Coding Chris

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Absolute notions in model theory Syntactic and semantic notions Absolutness from model theory

Theory or Practice? Theory : Without theory, practice is but routine born out of habit.

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, &amp; CRITICAL RACE

EXCHANGE THEORY Chapter 3 Leader Member Exchange Theory 2 Initially the theory described the

Challenge Codes for Physically Unclonable Functions (PUFs) A Maximum Entropy Problem Alexander

Deep Learning for Image and Video Compression Yao Wang Dept. of Electrical and Computer

Image/video compression: Basics and research issues Christine GUILLEMOT Outline A few basics

Perceptually-Driven Video Coding with the Daala Video Codec Timothy B. Terriberry The Xiph.Org

Mitigating Information Leakage in Image Representations: A Maximum Entropy Approach Proteek Roy

CS 260: Seminar in Computer Science: Multimedia Networking Jiasi Chen Lectures: MWF 4:10-5pm in

TKT TKT- -2431 SoC design 2431 SoC design Introduction to exercises SoC design / September 09

Outline The IP protocol 15-441/641: Computer Networks IPv4 The Internet Protocol IPv6

FILLING IN THE MARGINS: THE USE OF QUEER THEORY, FEMINIST STANDPOINT THEORY, & CRITICAL RACE