Probability Theory as Extended Logic: Probability Theory as Extended - - PDF document

probability theory as extended logic probability theory
SMART_READER_LITE
LIVE PREVIEW

Probability Theory as Extended Logic: Probability Theory as Extended - - PDF document

Probability Theory as Extended Logic: Probability Theory as Extended Logic: A short introduction into quantitative reasoning with A short introduction into quantitative reasoning with incomplete information incomplete information


slide-1
SLIDE 1

Probability Theory as Extended Logic: Probability Theory as Extended Logic:

A short introduction into quantitative reasoning with A short introduction into quantitative reasoning with incomplete information incomplete information Erik van Nimwegen

Division of Bioinformatics Biozentrum, Universität Basel, Swiss Institute of Bioinformatics

  • Axiomatic derivation of probability theory.
  • Bayes’ theorem and posterior probalities vs. p-values and

confidence intervals

  • Model selection: inferring dependence between variables.
  • Prior probabilities: symmetry transformations and

the maximum entropy principle.

  • Stochastic processes: generating functions and the central limit theorem.

Probability Theory as Extended Logic: Probability Theory as Extended Logic:

A short introduction into quantitative reasoning with A short introduction into quantitative reasoning with incomplete information incomplete information I cannot conceal the fact here that in the specific application

  • f these rules, I foresee many things happening which can

cause one to be badly mistaken if he does not proceed cautiously. Jacob Bernoulli, Ars Conjectandi, Basel 1705

slide-2
SLIDE 2

Probability Theory as Extended Logic. Probability Theory as Extended Logic.

E.T. Jaynes in 1982 Almost everything in this lecture can be found in this book. Jaynes left the book unfinished when he died in 1998. The unfinished version was available on the internet for many years (it still is). It was edited by a former student and was finally published in 2003.

  • E. van Nimwegen, EMBnet Basel, March 2006

From logic to extended logic From logic to extended logic

Aristotelian logic is a calculus of propositions. It tells us how to deduce the truth or falsity of certain statements from the truth of falsity of

  • ther statements.

B is false A is false. (B|A)(A) = (B)(A) Assume: If A is true then B is true. Or in symbols: B|A A is true B is true. (B|A)(~B) = (~A)(~B) But in reality it is almost always necessary to reason like this: B is true A becomes more plausible A is false B becomes less plausible Or even: If A is true than B becomes more plausible B is true A becomes more plausible

  • E. van Nimwegen, EMBnet Basel, March 2006
slide-3
SLIDE 3

From logic to extended logic From logic to extended logic

R.T. Cox (1946):

  • 1. Plausibilities are represented by real numbers and depend on the

information we have, i.e. P(x|I) the plausibility of x given our information I.

  • 2. Plausibilities should match common sense: They should reduce to

logic for statements that we know to be true or false and should go up and down in accordance with common sense.

  • 3. Consistency: If a plausibility can be derived in multiple ways, all

ways should give the same answer. The solution is unique and matches probability theory a la Laplace.

The two quantitative rules The two quantitative rules

(1)

A certainly true statement has probability 1, a false statement has probability 0. The probability that the statement is true determines the probability that the statement is false. (2) The probability of A and B given the information I can be written as either The probability of B given I times the probabiltiy of A given B and I, or As the probability of A given I times the probability of B given A and I.

1 ) | ( ) | ( = +

¬

I A P I A P

) | ( ) | ( ) | ( ) | ( ) | ( I A P AI B P I B P BI A P I AB P = =

Example: The probability that there is liquid water and life on mars is the probability that there is liquid water times the probability of life given liquid water

  • r the probability of life times the probability of liquid water given life.
slide-4
SLIDE 4

Assigning probabilities using symmetry Assigning probabilities using symmetry

  • Assume n mutually exclusive and exhaustive hypotheses Ai

=

=

n i i I

A P

1

1 ) | (

=

≠ ∀ =

n i j i

j i I A A P

1

, ) | (

  • Assume you know nothing else.
  • Consistency now demands that:

i n I A P

i

∀ = , 1 ) | (

Proof:

  • Any relabelling of our hypotheses changes our problem into an

equivalent problem. That is, the same information I applies to all.

  • When the supplied information I is the same the assignment of

probabilities has to be the same.

  • Unless all P(Ai|I) are equal this will be violated.

Contrast with Contrast with ‘ ‘frequency frequency’ ’ interpretation interpretation

  • f probabilities
  • f probabilities
  • In orthodox probability theory a probability is associated with a random

variable and records the physical tendency for something to happen in repeated trials.

Example: The probability of “a coin coming up heads when thrown” is a feature of the coin and can be determined by repeated experiment.

slide-5
SLIDE 5

Contrast with Contrast with ‘ ‘frequency frequency’ ’ interpretation interpretation

  • f probabilities
  • f probabilities
  • In standard probability theory a probability is associated with a random

variable and records the physical tendency for something to happen in repeated trials.

Example: The probability of “a coin coming up heads when thrown” is a feature of the coin and can be determined by repeated experiment.

  • Quote from William Feller (an Introduction to Probability Theory and its Applications 1950):

The number of possible distributions of cards in Bridge is almost 1030. Usually we agree to consider them as equally probable. For a check of this convention more than 1030 experiments would be required.

Contrast with Contrast with ‘ ‘frequency frequency’ ’ interpretation interpretation

  • f probabilities
  • f probabilities
  • In standard probability theory a probability is associated with a random

variable and records the physical tendency for something to happen in repeated trials.

Example: The probability of “a coin coming up heads when thrown” is a feature of the coin and can be determined by repeated experiment.

  • Quote from William Feller (an Introduction to Probability Theory and its Applications 1950):

The number of possible distributions of cards in Bridge is almost 1030. Usually we agree to consider them as equally probable. For a check of this convention more than 1030 experiments would be required.

  • Is this really how anyone reasons?

Example: Say that I tell you that I went to the store, bought a normal deck of cards, and dealt 1000 Bridge hands, making sure to shuffle well between every two deals. I found that the king and queen of hearts were always in the same hand. What would you think?

slide-6
SLIDE 6

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

G A reference genome has G genes:

  • A different strain of the species is isolated and we want to estimate what

number g of its genes is mutated with respect to the reference genome.

  • To estimate this we sequence one gene at a time from the new strain and

compare it with the reference genome.

?

Wildtype: Mutant:

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

G After sequencing (m+w) genes we have m mutants and w wildtypes What do we now know about the number g of all genes that are mutants? Formalizing our information:

  • We have no information if the two genomes are closely or distantly related

so a priori g = G is as likely as g = 0 or any other value.

  • Assuming the number of mutants g given, there is no information about

which of the G genes are the mutants.

slide-7
SLIDE 7

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

Formalizing our information:

  • Prior probability that g genes are mutant given our information:
  • Assuming g mutants, the probability that the first sequenced gene will

be a mutant or wildtype:

1 1 ) | ( + = G I g P

) | ( I g P

G g G g wt P G g g P − = = ) | ( , ) | (μ

  • The probabilities for the first two sequenced genes are:

) 1 ( ) 1 )( ( ) , ( , ) 1 ( ) ( ) , ( , ) 1 ( ) ( ) | , ( , ) 1 ( ) 1 ( ) | , ( − − − − = − − = − − = − − = G G g G g G wt wt P G G g g G wt P G G g G g g wt P G G g g g P μ μ μ μ and so on

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

Generally, the probability for a particular series of mutant/wildtype observations containing m mutants and w wildtype is given by: ) 1 ( ) 1 ( ) 1 ( ) 1 )( )( 1 ( ) 1 ( ) | , ( + − − − + − − + − − + − − = w m G G G w g G g G g G m g g g g w m P L L L

  • r

! )! ( )! ( )! ( )! ( ! ) | , ( G w g G m g w m G g G g g w m P − − − − − − = We now know the prior probability that a certain number of genes are mutants. We know the likelihood to observe a given string of observations given g. We want to know the posterior probability of g given m and w. ) | ( I g P

) | , ( g w m P ) , | ( w m g P

slide-8
SLIDE 8

=

'

) | ' ( ) ' | , ( ) | ( ) | , ( ) , | (

g

I g P g w m P I g P g w m P w m g P

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

Bayes’s Theorem

We can write the joint probability of g, m, and w in terms of conditional probabilities in two ways:

) | ( ) | , ( ) | , , ( I g P g w m P I g w m P =

) | , ( ) , | ( ) | , , ( I w m P w m g P I g w m P =

and Combining these we obtain:

) | , ( ) | ( ) | , ( ) , | ( I w m P I g P g w m P w m g P =

We also have:

∑ ∑

= =

' '

) | ' ( ) ' | , ( ) | ' , , ( ) | , (

g g

I g P g w m P I g w m P I w m P

Putting it all together we have:

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

=

− − − − = − − − − =

G g

w g G m g g G g w m Z w g G m g w m Z g G g w m g P )! ( )! ( )! ( ! ) , ( , )! ( )! )( , ( )! ( ! ) , | (

m=0,w=0 m=0,w=1 m=2,w=5 m=23,w=50

slide-9
SLIDE 9

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

=

= ≤ ≤

b a g

w m g P w m b g a P ) , | ( ) , | (

m=2,w=5 For example: the probability for g to lie in a particular interval is simply given by summing the probabilities: a b All our information about g is encoded in the posterior distribution

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

025 . ) 5 , 2 | ( ) 5 , 2 | 42 (

41

≈ = <

= g

g P g P

m=2,w=5

42 325

The 95% posterior probability interval

025 . ) 5 , 2 | ( ) 5 , 2 | 325 (

500 326

≈ = >

= g

g P g P

95% How does this compare to so called confidence intervals of orthodox statistics?

slide-10
SLIDE 10

Confidence Intervals Confidence Intervals

  • In orthodox statistics one cannot talk about the probability of g.
  • The probability of a particular sample is the same as we have it, for example:

) 4 )( 3 )( 2 )( 1 ( ) 2 )( 1 )( 1 )( ( ) | , , , , ( − − − − − − − − − − = G G G G G g G g G g g G g g wt wt wt P μ μ

  • Now one focuses on a statistic s which is a function of the sample.

For example a statistic s could be the total number of mutants:

2 ) , , , , ( = wt wt wt s μ μ

  • One then calculates the probabilities P(s) for the statistic to take on different

values:

! )! ( ! )! ( )! ( )! ( )! ( ! ! ) ( G n G n s n s n g G g G s g s g s P − − + − − − − =

with n the total number of samples.

Confidence Intervals Confidence Intervals

! )! ( ! )! ( )! ( )! ( )! ( ! ! ) ( G n G n s n s n g G g G s g s g s P − − + − − − − =

This is called the hypergeometric distribution. Given a fixed values of g we can calculate the probability that s takes on different values. For example, for g=175 and n=20. g=175,n=20,G=500

slide-11
SLIDE 11

Confidence Intervals Confidence Intervals

g=175,n=20,G=500

  • We can now, for a given g, calculate the range of s values that is likely

to occur.

  • For example, we can find smin and smax such that

∑ ∑

= =

= =

min max

025 . ) ( , 025 . ) (

s s n s s

s P s P

  • With probability 95% s will lie in the interval [smin ,smax]

Confidence Intervals Confidence Intervals

g=175,n=20,G=500 But we need an interval for g given an s not for s given a g! Solution: For a given s, find all values of g such that s occurs within the 95% confidence interval for that g.

slide-12
SLIDE 12

Confidence Intervals Confidence Intervals

Solution: For a given s, find all values of g such that s occurs within the 95% confidence interval for that g. Find gmin and gmax such that:

05 . ) | ' (

' min =

= n s s

g s P 05 . ) | ' (

' max =

= s s

g s P

g=27,G=500,n=7 g=328,G=500,n=7 The 95% confidence interval for g is thus [27,328]

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

m=2,w=5 What to do if we are forced to make one specific estimate gest of g?

slide-13
SLIDE 13

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

m=2,w=5 What to do if we are forced to make one specific estimate gest of g?

143 * = g

We could pick the gest = g* that maximizes the posterior However, it is clear that more often g>143 than g<143. So can’t we decrease the expected “error” by choosing gest a bit larger? ) , | ( w m g P

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

What is the “error” we want to minimize? Absolute deviation:

) , ( 1 ) , ( ) ( ) , ( | | ) , (

2 true est true est true est true est true est true est

g g g g E g g g g E g g g g E δ − ∝ − ∝ − ∝

Square deviation: All errors equally bad: General solution, minimize expected error:

=

=

G g est

w m g P g g E E ) , | ( ) , ( Absolute deviation: ∑

= =

=

est est

g g G g g

w m g P w m g P ) , | ( ) , | ( The median Square deviation:

=

=

G g est

w m g gP g ) , | ( The mean All errors equally bad: * g gest = The maximum of the posterior

slide-14
SLIDE 14

Assessing the evolutionary divergence Assessing the evolutionary divergence

  • f two genomes
  • f two genomes

m=2,w=5 What to do if we are forced to make one specific estimate gest of g?

143 * = g 159 =

median

g 166 = g

MAP: median: mean:

Comparing two classes of genes Comparing two classes of genes

G The G genes can be divided into R regulatory genes and N=G-R non-regulatory genes. What do we now know about the relative frequency of occurrence of mutants among regulators and non-regulatory genes? Among the m mutants are mr regulators and mn non-regulators. Among the w wild-type are wr regulators and wn non-regulators.

slide-15
SLIDE 15

Comparing two classes of genes Comparing two classes of genes

We need the prior for gr regulatory mutants and gn non-regulatory mutants:

) 1 )( 1 ( 1 ) | ( ) | ( ) , | , ( + + = = N R N g P R g P N R g g P

n r n r

We need the likelihood to observe the sample

! )! ( )! ( )! ( )! ( ! ! )! ( )! ( )! ( )! ( ! ) , | , ( ) , | , ( ) , , , | , , , ( N w g N m g w g N g N g R w g R m g w g R g R g N g w m P R g w m P N g R g w m w m P

n n n n n n n n r r r r r r r r n n n r r r n r n n r r

− − − − − − − − − − − − = =

So the posteriors become:

)! ( )! ( )! ( ! ) , , | (

r r r r r r r r r

w g R m g Z g R g R w m g P − − − − = )! ( )! ( ' )! ( ! ) , , | (

n n n n n n n n n

w g R m g Z g N g N w m g P − − − − =

With Z and Z’ normalizing constants

Comparing two classes of genes Comparing two classes of genes

23 , 4 , 400 , 5 , 2 , 100 = = = = = =

n n r r

w m N w m R

Example:

r

g

n

g

) 400 , 23 , 4 | ( ) 100 , 5 , 2 | (

n r

g P g P The probability that the fraction

  • f regulatory mutants is larger

than the fraction of non-regulatory mutants is

∑ ∑

= − =

= >

R g g g n r n r

r r n

g P g P N g R g P

1 4

) ( ) ( ) (

slide-16
SLIDE 16

Comparing two classes of genes Comparing two classes of genes

23 , 4 , 400 , 5 , 2 , 100 = = = = = =

n n r r

w m N w m R

Example:

r

g

n

g

) 400 , 23 , 4 | ( ) 100 , 5 , 2 | (

n r

g P g P We find

835 . ) ( = > N g R g P

n r

r n

g g 4 =

How would orthodox statistics answer this question?

Comparing two classes of genes Comparing two classes of genes

Null Hypothesis: Regulators are equally likely to be mutants as non-regulator genes. Formally: We are given regulators and non-regulators and pick the mutants at random from the genes. ) (

r r

w m + ) (

n n

w m + ) (

n r

m m + ) (

r n n r

w w m m + + +

) (

r r

w m + regulators ) (

n n

w m + non-regulators ) (

n r

m m + mutants ) (

n r

w w + wild-type

Now calculate the probability to end up with mr regulators among the mutants.

slide-17
SLIDE 17

Comparing two classes of genes Comparing two classes of genes

Null Hypothesis: Regulators are equally likely to be mutants as non-regulator genes.

) (

r r

w m + regulators ) (

n n

w m + non-regulators ) (

n r

m m + mutants ) (

n r

w w + wild-type

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + + + + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

n r n r n r n n n r r r r

m m w w m m m w m m w m m P ) (

According to the null hypothesis, the probabiliy to draw mr regulators among the mr+mn mutants is:

Comparing two classes of genes Comparing two classes of genes

Null Hypothesis: Regulators are equally likely to be mutants as non-regulator genes.

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + + + + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

n r n r n r n n n r r r r

m m w w m m m w m m w m m P ) (

We observe:

23 , 4 , 5 , 2 = = = =

n n r r

w m w m

36 . 6 34 6 27 7 ) (

6 2

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + + + + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

∑ ∑

= + = k n r n r n r r n n n r r m m m k

k k m m w w m m k m m w m k w m k P

n r r

The p-value at which the null hypothesis is rejected is 0.36

slide-18
SLIDE 18

Comparing two classes of genes Comparing two classes of genes

0.10 0.98 26 1 5 2 0.04 0.99 27 5 2 0.18 0.94 25 2 5 2 0.04 0.99 23 4 3 4 0.14 0.95 23 4 4 3 0.36 0.84 23 4 5 2 p-value wn mn wr mr

) (

N g R g

n r

P >

Examples of posterior probabilities and p-values of the null hypothesis test The p-values might seem very conservative. However, remember the p-value is essentially only asking how plausible the data is given the null hypothesis.

Comparing two classes of genes Comparing two classes of genes

Model Selection: Independent versus equal fractions of mutants.

) 1 ) , (min( ) , ( ) indep , , | , ( + = N R N R g g P

N g R g n r

n r

δ

) indep ( ) indep , , | , ( ) equal ( ) equal , , | , ( ) , | , ( P N R g g P P N R g g P N R g g P

n r n r n r

+ =

We assume that either the fractions of mutants are equal or independent: Prior assuming the fractions are equal.

) 1 )( 1 ( 1 ) equal , , | , ( + + = N R N R g g P

n r

Prior assuming the fractions are independent We now want to calculate the posterior probability that the fractions are equal given the data:

) , , , , , | indep ( N R w m w m P

n n r r

slide-19
SLIDE 19

Comparing two classes of genes Comparing two classes of genes

Model Selection: Independent versus equal fractions of mutants. Bayes theorem gives:

) equal ( ) equal | Data ( indep) ( ) indep | Data ( indep) ( ) indep | Data ( ) Data | indep ( P P P P P P P + =

The probability of the data depend on gr and gn. Probability theory tells us that we can simply sum these nuisance parameters out of the problem:

∑ ∑ ∑ ∑

= = = =

= =

R g N g R g N g n r n r n r

r n r n

g g P g g P g g P P ) indep | , ( ) , | Data ( ) indep | , , Data ( ) indep | Data (

∑ ∑ ∑ ∑

= = = =

= =

R g N g R g N g n r n r n r

r n r n

g g P g g P g g P P ) equal | , ( ) , | Data ( ) equal | , , Data ( ) equal | Data (

Comparing two classes of genes Comparing two classes of genes

Model Selection: Independent versus equal fractions of mutants. In our case we have specifically: ) indep | , ( ) indep | , ( ) indep | , , , (

n n r r n n r r

w m P w m P w m w m P =

∑ ∑

= =

+ + = + − − − − − − = =

N g N g

w m w m N N w g N m g w m N g N g g P g w m P w m P )! 1 ( ! ! ) 1 ( 1 ! )! ( )! ( )! ( )! ( ! ) indep | ( ) | , ( ) indep | , (

We obtain:

)! 1 ( ! ! )! 1 ( ! ! ) indep | Data ( + + + + =

n n n n r r r r

w m w m w m w m P

We can obtain in a similar way by summing out and (although the result is not a nice analytical expression.) Using

) equal | Data ( P

r

g

n

g

slide-20
SLIDE 20

Average the probability of the data only over the line (red box).

Comparing two classes of genes Comparing two classes of genes

Model Selection: Independent versus equal fractions of mutants.

) equal | Data ( P ) indep | Data ( P

Average the probability of the data over all combinations of and .

r

g

n

g

r n

g g 4 =

Comparing two classes of genes Comparing two classes of genes

Model Selection: Independent versus equal fractions of mutants.

8 8

10 * 16 . 2 ) equal | Data ( 10 * 21 . 1 ) indep | Data (

− −

= = P P

For the case: 400 , 23 , 4 , 100 , 5 , 2 = = = = = = N w m R w m

n n r r

If we assume that “equal” and “independent” are a priori equally likely:

2 1

) equal ( ) indep ( = = P P The posterior becomes:

64 . 10 * 16 . 2 10 * 21 . 1 10 * 16 . 2 ) Data | equal (

2 1 8 2 1 8 2 1 8

= + =

− − −

P

The “equal” model is thus slightly preferred.

slide-21
SLIDE 21

Comparing two classes of genes Comparing two classes of genes

0.83 0.64 0.50 0.84 0.59 0.36 0.10 0.98 26 1 5 2 0.04 0.99 27 5 2 0.18 0.94 25 2 5 2 0.04 0.99 23 4 3 4 0.14 0.95 23 4 4 3 0.36 0.84 23 4 5 2 p-value wn mn wr mr

) (

N g R g

n r

P >

Examples of posterior probabilities, p-values of the null hypothesis test, and posterior probability of the model selection.

Data) | indep ( P

Note how different the two posteriors are. Both the prior and the precise question can matter a great deal.

Splice variation in a mouse Transcription Unit Splice variation in a mouse Transcription Unit

http://www.spaed.unibas.ch

slide-22
SLIDE 22

Splice variation in a mouse Transcription Unit Splice variation in a mouse Transcription Unit

http://www.spaed.unibas.ch cryptic exon

Splice variation in a mouse Transcription Unit Splice variation in a mouse Transcription Unit

http://www.spaed.unibas.ch cryptic exon

1 2 3 4 4

4 different promoters are used in the transcripts that could have contained the exon.

slide-23
SLIDE 23

Exon Exon inclusion dependence on promoter usage inclusion dependence on promoter usage

Assume that we have the following data for a given cryptic exon:

  • n transcripts in total.
  • P different promoters used.
  • ip number of times exon included when promoter p is used.
  • ep number of times exon excluded when promoter p is used.
  • i total number of times exon included.
  • e total number of times exon excluded.

Independent model: Each transcript has an independent probability f to include the exon, with a uniform prior probability for f. Dependent model: For a transcript from promoter p the probability of including the exon is fp, with a uniform prior probability over fp for each p.

Exon Exon inclusion dependence on promoter usage inclusion dependence on promoter usage

Probability of the data under the independent model: Probability of the data under the dependent model:

)! 1 ( ! ! ) 1 ( ) ( ) 1 ( ) indep | data (

1 1

+ + = − = − =

∫ ∫

i e e i df f f df f P f f P

e i e i

∏ ∏ ∫

= =

+ + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − =

P p p p p p P p p e p i p

e i e i df f f P

p p

1 1 1

)! 1 ( ! ! ) 1 ( ) dep | data (

With Bayes’ theorem:

) (indep ) indep | data ( ) dep ( ) dep | data ( ) dep ( ) dep | data ( ) data | dep ( P P P P P P P + =

slide-24
SLIDE 24

Exon Exon inclusion dependence on promoter usage inclusion dependence on promoter usage

Probability of the data under the independent model: Probability of the data under the dependent model:

)! 1 ( ! ! ) 1 ( ) ( ) 1 ( ) indep | data (

1 1

+ + = − = − =

∫ ∫

i e e i df f f df f P f f P

e i e i

∏ ∏ ∫

= =

+ + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − =

P p p p p p P p p e p i p

e i e i df f f P

p p

1 1 1

)! 1 ( ! ! ) 1 ( ) dep | data (

With Bayes’ theorem:

) (indep ) indep | data ( ) dep ( ) dep | data ( ) dep ( ) dep | data ( ) data | dep ( P P P P P P P + =

Using this we estimate that between 5% and 15% of mouse internal cryptic exons are included in a promoter-dependent way.

Example dependency exon inclusion

  • n promoter usage

1x 3x 8x 1x 1x 1x 2x Ugt1a6: UDP glucuronosyltransferase 1 family, polypeptide A6

985 . ) Data | dep ( = P

multiplicity

slide-25
SLIDE 25

Prior Probabilities: Know thy ignorance Prior Probabilities: Know thy ignorance

Given hypotheses H, our prior probability P(H|I) represents our information I. There are two main methods for determining such priors:

  • 1. Invariance of our problem under a group of transformations.
  • 2. The maximum entropy method.

Simple example, a “scale parameter”:

  • We want to model gene expression values e and need, before we see any

data, some prior probability P(e) over expression values.

  • We assume we know nothing about expression values.
  • What is a reasonable prior P(e)?

Prior Probabilities: Know thy ignorance Prior Probabilities: Know thy ignorance

What distribution P(e) expresses complete ignorance about expression levels e?

  • You might be tempted to suggest the uniform distribution:

constant ) ( = e P

  • Think of transformations that leave your state of

unchanged.

  • In this case, if we are really ignorant of e it shouldn’t matter if the gene

expression was measured in units for mRNAs per cell, mRNAs per ml of solution, or light intensity as measured by some optical scanner.

  • The distribution P(e) should thus be invariant under the scale on which

e is measured.

) ( ' e f e e = →

slide-26
SLIDE 26

Prior Probabilities: Know thy ignorance Prior Probabilities: Know thy ignorance

What distribution P(e) expresses complete ignorance about expression levels e? We should thus have invariance under: for any

e e e λ = → ' λ

We thus demand that:

) ( ) ( ) ( ) ( ) ( e P e P de e P e d e P = ⇔ = λ λ λ λ

Taking the derivative with respect to and setting we get

λ 1 = λ e e P e e P e P constant ) ( ) ( ) ( ' = ⇔ − =

Thus instead of a uniform distribution we find that the distribution is uniform in the logarithm of the expression level.

) log( constant ) ( e d de e P =

Prior Probabilities: Know thy ignorance Prior Probabilities: Know thy ignorance

Assume we obtain a very large table of expression levels, what fraction of the numbers would we a priori expect to start with digit d? To start with digit d=1 the number e has to lie between 1 and 2, or 10 and 20,

  • r 0.1 and 0.2, and so on. Notice:

∫ ∫ ∫

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = = =

2 . 1 . 2 1 20 10

1 2 log c de e c de e c de e c

Similarly, for the first digit to be 2:

∫ ∫ ∫

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = = =

3 . 2 . 3 2 30 20

2 3 log c de e c de e c de e c

So in general, the probability for the first digit to be d is:

( )

) 10 log( log ) (

1 d d

d P

+

=

This is called Benford’s law

slide-27
SLIDE 27

Prior Probabilities: Know thy ignorance Prior Probabilities: Know thy ignorance

The actual frequencies of first digits in various collections of numbers follow Benford’s law. Carefully representing one’s ignorance can already make nontrivial predictions!

The over-expressed pathway

  • A mutant strain of an organism has an altered phenotypic property X.
  • We know that this is caused by overexpression of one or more genes

in one of the three pathways A, B, or C.

  • We measured the absolute number of mRNAs in a single cell of the mutant

strain and found that there were:

  • 180 mRNAs from genes in pathway A.
  • 170 mRNAs from genes in pathway B.
  • 40 mRNAs from genes in pathway C.
  • We can only investigate one of the pathways in detail and want to guess

which one is most promising to study. Assuming different kinds of prior information.

slide-28
SLIDE 28

The over-expressed pathway

A B C Stage 1: 180 170 40

The total number of mRNAs of each pathway in the mutant cell.

Stage 2:

The average number of mRNAs of each pathway in wild-type cells..

50 10 100 Stage 3:

The average number of mRNAs per gene of each pathway in wild-type cells.

75 20 10 Let’s take a poll to see if our common sense agrees. Assume we progressively receive more information:

At each state, which pathway seems most likely to be over-expressed in the mutant?

The over-expressed pathway

  • 1. You only know the total mRNA in each pathway in the mutant

Although we have very little information to go by, if we are forced to guess which pathway is over-expressed my common sense says to pick the

  • ne with the highest number of mRNA, i.e. pathway A.

Can probability theory tell us this? A B C 180 170 40

  • A pathway is over-expressed if its mRNA count nm is bigger than the mRNA

count n in a wild-type cell in the same condition.

  • Being completely ignorant our prior for n is
  • The probability that n < nm is:

n n P 1 ) ( ∝

) 1 log( ) ( ) (

1 1

− ∝ = <

− = m n n m

n n P n n P

m

slide-29
SLIDE 29

The over-expressed pathway

  • 1. You only know the total mRNA in each pathway in the mutant

A B C 180 170 40

) 1 log( ) ( ) (

1 1

− ∝ = <

− = m n n m

n n P n n P

m

Thus we find:

) 179 log( )) ( ( ∝ < A n n P

m

) 169 log( )) ( ( ∝ < B n n P

m

) 39 log( )) ( ( ∝ < C n n P

m

Indeed this says the best guess is pathway A.

  • What probability distribution best represents our information:

that we only know the averages 50, 100, and 10?

  • What probability distribution is as “ignorant” as possible,

while satisfying the right averages?

  • Can we quantify how much ignorance a distribution represents?

The over-expressed pathway

  • 2. You know the average number of mRNAs expressed in each pathway

in wild-type cells.

) , , (

C B A

n n n P ) , , (

C B A

n n n P

The answer was given by Claude Shannon in 1948 A B C 50 100 10

slide-30
SLIDE 30

Definition of an ignorance function

Axiomas:

  • 1. There exists a function that assigns a real number to each

probability distribution which quantifies the ignorance associated with that probability distribution.

  • 2. It is a continuous function of its arguments.
  • 3. For the uniform distribution over n variables, the function

] [P H P ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = n n n H n h 1 , , 1 , 1 ) ( L

should increase with n.

  • 4. It should be consistent in that, if it can be calculated in multiple ways,

it always gives the same result. In particular, it should be additive:

[ ] [ ]

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + + + + + =

c b c c b b c b c b a c b a

p p p p p p H p p p p p H p p p H , ) ( , , ,

Definition of an ignorance function

[ ] [ ]

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + + + + + =

c b c c b b c b c b a c b a

p p p p p p H p p p p p H p p p H , ) ( , , ,

pa pb pc pa pb/(pb+ pc) pc /(pb+ pc) pd=pb+pc pd The function measures how ignorant we are of a, b, or c being the case. Now group b and c into category d. Now our ingorance about a, b, or c should equal our ignorance about a or d being the case, plus a fraction (pb+pc) of the time the additional ignorance about b or c given that d is the case. [ ]

c b a

p p p H , ,

[ ]

c b a

p p p H , ,

[ ]

c b a

p p p H + , ) (

c b

p p +

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + +

c b c c b b

p p p p p p H ,

slide-31
SLIDE 31

Definition of an ignorance function

  • Assume we have a uniform probability distribution over n hypotheses

n i n P

i

, , 2 , 1 , 1 L = =

by definition the ignorance is

) (n h

  • Now divide the n possibilities into G groups, with the first hypotheses

in the first group, the next in the second group, and so on. Using the consistency requirement we have

1

g

2

g

=

+ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

G i i i G

g h n g n g n g n g H n h

1 2 1

) ( , , , ) ( L

  • If we set all equal we get
  • The solution to this equation is:

where K is a constant we can choose freely.

G n g gi = = ) log( ) ( n K n h = ) ( ) ( ) ( g h G h gG h + =

Definition of an ignorance function

( )

∑ ∑ ∑

= = =

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = − = − = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

G i i i G i G i i i i i G

n g n g g n g n g h n g n h n g n g n g H

1 1 1 2 1

log log ) log( ) ( ) ( , , , L

  • Now use the solution and substitute this in the equation

for general distributions

) log( ) ( n n h = n g p

i i =

  • But we can now interpret the fractions as general probabilities:
  • So finally our solution becomes:

[ ] ( )

=

− =

n i i i n

p p p p p H

1 2 1

log , , , L

slide-32
SLIDE 32

The Entropy of a distribution

[ ] ( )

=

− =

n i i i n

p p p p p H

1 2 1

log , , , L

  • Thermodynamics: Because this function has the same functional form

as the entropy function of statistical physics Shannon called it entropy.

  • Yes-and-No questions: If we want to find out which hypothesis is true

by asking yes/no questions, it takes on average H questions to find out.

  • Optimal coding: If a large number of samples are taken from the distribution,

the shortest description of the whole sample will have size nH. Claude Shannon, 1948

Entropy measures Ignorance

Back to the over-expression problem: The Maximum Entropy Principle

  • We will find the distribution that maximizes entropy

under the constraint that is has the correct average values.

  • Any other distribution is inconsistent with our information: such a

distribution would be less ignorant and would thus effectively assume things that we do not know.

) , , (

C B A

n n n P

slide-33
SLIDE 33

The over-expressed pathway

  • We need a distribution that maximizes H[P] conditioned on:

) , , (

g y r

n n n P 10 ) , , ( , 100 ) , , ( , 50 ) , , (

, , , , , ,

= = =

∑ ∑ ∑

C B A C B A C B A

n n n C B A C n n n C B A B n n n C B A A

n n n P n n n n P n n n n P n

  • 2. You know the average number of mRNAs expressed in each pathway

in wild-type cells. A B C 50 100 10

  • Or, performing the sums we get:

10 ) ( , 100 ) ( , 50 ) ( = = =

∑ ∑ ∑

C B A

n C C C n B B B n A A A

n P n n P n n P n

  • Thus, since we have no information relating the pathways our solution

will have independent distributions for A, B, and C:

) ( ) ( ) ( ) , , (

C C B B A A C B A

n P n P n P n n n P =

This is a variational problem that can be solved using the method of Lagrange multipliers. The solution satisfies:

[ ]

= −

n

n P n P maximal ) ( log ) (

Our general problem has the form: Find the distribution such that

The over-expressed pathway

) (n P

=

n av

n n nP ) (

and the entropy is maximized: the average matches a given value:

( ) ( )

) ( ) ( ) ( ) ( log ) ( = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − −

n P n nP n P n P n P

n

δ λ μ δ and is given by:

Z e n P

n λ −

= ) (

with Z a normalizing constant.

slide-34
SLIDE 34

The over-expressed pathway

Z e n P

n λ −

= ) (

Normalization requires:

λ λ − −

− = =∑ e e Z

n n

1 1

We set such that the average takes on the desired value. Note that:

λ

( )

n Z n nP d Z d

n

= = −

) ( log λ

So we can solve for by solving:

λ

( )

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = ⇔ − = − =

− − av av av

n n e e d Z d n 1 log 1 log λ λ

λ λ

The general form of the distribution is: Z is often called a partition function

The over-expressed pathway

n av av av

n n n n P ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + + = 1 ) 1 ( 1 ) (

So the maximum entropy distribution given average nav is:

C B A

n C n B n A

n P n P n P ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = 11 10 11 1 ) ( , 101 100 101 1 ) ( , 51 50 51 1 ) (

And the solution for our case is: Thus, the probability that a wild-type cell would have less mRNAs in pathway A than the 180 that the mutant has is:

97 . 51 50 1 51 50 51 1 ) 180 (

180 179

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = <

= n n A

n P

slide-35
SLIDE 35

The over-expressed pathway

97 . 51 50 1 51 50 51 1 ) 180 (

180 179

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = <

= n n A

n P

For all three pathways we have:

82 . 101 100 1 101 100 101 1 ) 170 (

170 169

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = <

= n n B

n P 98 . 11 10 1 11 10 11 1 ) 40 (

40 39

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = <

= n n C

n P

Now pathway C looks the most promising! This is, roughly speaking because the ratio is the largest for this pathway, i.e. 4.

av

n nmutant

The over-expressed pathway

  • We now to break down the total expression in terms of the numberof genes

with different expression levels.

  • Because the information about each color is still independent of the others

we focus on a single color first (say A).

  • The expression states of genes in this pathway can be specified by a vector:

meaning genes with 1 mRNA, genes with 2 mRNAs, etc.

  • The total number of mRNAs in the pathway is:
  • The total number of genes:
  • We again find the maximum entropy distribution

satisfying the constraints on the averages and .

) , , (

3 2 1

L n n n

1

n

2

n

i i

n i t ∑ =

=

i i

n n

) , , (

2 1

L n n P t n

  • 3. You know the average number of mRNAs per gene in each pathway

in wild-type cells. A B C 75 10 20

slide-36
SLIDE 36

The over-expressed pathway

The variational equation now gives:

( ) ( )

log

, , 2

1

= ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − − −

P tP nP cP P P

n n

δ μ λ δ

L

( )

∑ ∑

− − =

i i i i

n in C n n P μ λ ) , , ( log

2 1

L Which can be solved to give: The normalization constant C is again obtained from the partition function:

∏ ∏ ∑ ∑

∞ = + − ∞ = ∞ + − ∞ = + −

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = =

=

1 ) ( 1 ) ( , , ) (

1 1

2 1

i i i n i n n n i n

e e e Z

i i i

μ λ μ λ μ λ L

To fit the constraints we again take derivatives of the partition function:

∞ = + −

= − =

1

1 1 ) log(

i i

e d Z d n

μ λ

μ

∞ = + −

= − =

1

1 ) log(

i i

e i d Z d t

μ λ

λ

The over-expressed pathway 50 1

1

= −

∞ = + i i

A A

e i

μ λ

A B C 75 10 20 Total mRNAs: mRNAs per gene: 50 100 10 For pathway A the constraint for he total number of mRNAs gives: The average mRNA per gene gives for the expected number of expressed genes:

75 50 1 1

1

= −

∞ = + i i

A A

e

μ λ

Solving this numerically we find:

73 . 4 , 0134 . = =

A A

μ λ

slide-37
SLIDE 37

The over-expressed pathway

73 . 4 , 0134 . = =

B B

μ λ 599 . , 085 . = =

A A

μ λ 708 . 3 , 051 . = =

C C

μ λ

A B C 75 10 20 Total mRNAs: mRNAs per gene: 50 100 10 Similarly, for the all three pathways we find the solutions: And the general form of the distribution is:

e

k k

n k

Z n n P ∑ =

∞ =

+ −

1

) ( 2 1

1 ) , , (

μ λ

K

However, we want the distribution of the total

) (t P

∞ =

=

1 k k

kn t

  • For t not too small the distribution of the total

is thus a sum of many independent contributions. As we will see in a minute, we can therefore approximate by a Gaussian distribution:

The over-expressed pathway

Notice that the distribution for the number of genes that have k mRNAs expressed is independent for each k:

) (t P

∞ =

=

1 2 1

) ( ) , , (

i i

n P n n P L

( )

) ( ) (

1 ) (

μ λ μ λ + − + −

− =

k n k k

e e n P

k

∞ =

=

1 k k

kn t

( )

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − ≈

2 2

2 exp ) ( σ t t C t P

  • Using this approximation we find for the standard-deviations of the

distributions for A, B, and C:

3 . 19 , . 46 , 8 . 85 = = =

C B A

σ σ σ

) (

k

n P

k

n

) (t P

slide-38
SLIDE 38

The over-expressed pathway

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − ≈ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − ≈ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − ≈

2 2 2

3 . 19 10 2 1 exp ) ( . 46 100 2 1 exp ) ( 8 . 85 50 2 1 exp ) (

C C C B B B A A A

t C t P t C t P t C t P

In summary, we find the following distributions for the total numbers t of mRNAs in wild-type cells for each pathway: The probabilities for each pathway that the mutant is over-expressed are given by:

∫ ∫ ∫

= = =

40 170 180

91 . ) ( , 93 . ) ( , 91 . ) (

C C B B A A

dt t P dt t P dt t P

So now we have a slight preference for pathway B.

The over-expressed pathway

Summary:

  • The entropy of a distribution quantifies the ‘ignorance’ associated with

it.

  • We can use the maximum entropy principle to represent our

information in a given situation.

  • If we know the average of a certain number of quantities the maximum

entropy distributions take on the form:

∑ = ∑ = =

− − h h f h f i i

i i i i i i

e Z Z e h P f f

) ( ) (

, ) ( ,

λ λ

with Z the partition function. The constraints can be solved by taking derivatives

  • f the partitition function:

i i

Z f λ ∂ ∂ − = ) log(

slide-39
SLIDE 39

Stochastic processes

We can also use probability theory to describe processes that show variations which we cannot predict nor control. Probably the simplest example is a process in which a certain events happen irregularly but at a certain rate overall rate r per unit time. Examples:

  • A gene being transcribed to produce a new mRNA.
  • An mRNA being degraded.
  • A cell reproducing.
  • A cell dying.
  • A mutation occurring.
  • And so on.

Stochastic processes

What is the distribution of the time until the next transcription of a gene if we only know the overall rate r? That is, we know that in a sufficiently large time T the number of times N that the gene is transcribed is given by:

) | ( r t P

T N r =

The solution is given by the maximum entropy distribution:

dt re dt r t P

rt −

= ) | (

Notice that normally this distribution is derived by assuming a constant rate per unit time:

rt

re r t P r t rP dt r t dP

= ⇔ − = ) | ( ) | ( ) | (

Thus probability theory tells us a constant rate is the most ‘ignorant’ assumption.

slide-40
SLIDE 40

Stochastic processes dt re dt r t P

rt −

= ) | (

How long does it take before n transcriptions have occurred? The probability to obtain the first transcription at t1, the second at t2, and so on until the nth transcription at time t is:

rt n rt t t r t t r n n

e r e r r t t t t P

n n n

− − − − − − −

= =

− − − 1 2 1 1

) ( ) ( 1 2 1

) | , , , , (

L

K

We then integrate out the n-1 nuisance parameters:

∫ ∫ ∫

− − − −

− = =

t t t t t rt n n n rt n

n

e n t r dt dt dt e r r n t P

1 1 2 1

1 2

)! 1 ( ) , | ( L

Stochastic processes

rt n n

e n t r r n t P

− −

− = )! 1 ( ) , | (

1

Time until the nth transcription: This is called a Gamma-distribution n=2 n=5 n=10 n=50

slide-41
SLIDE 41

The moment generating function

For each probability distribution the associated moment-generating function is defined as:

) (t P ) (k G

∞ − −

= = ) ( ) ( dt e t P e k G

kt kt

It is called a moment generating function because of the property: Formally, this is a Laplace transform of the function

) (t P

( )

∞ =

= = − ) ( ) ( 1

n n k n n k

t dt t P t dk k G d

The generating function of a convolution is the product of generating functions: ) (k Gn

n

t t t t + + + = L

2 1

∫ ∫

∞ ∞ − − − −

= = =

1 2 1

) ( ) ( ) ( ) ( ) (

2 1

n n kt kt kt n kt n

k G t P t P e dt dt dt t P e k G

n

L L

L

The central limit theorem

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = =

+ + + − n n n x x x k n

n k G x P x P e dx dx dx y G

n

) ( ) ( ) ( '

1 / ) ( 2 1

2 1

L L

L

Any smooth function raised to a very large power will be dominated by its behavior close to its maximum. So for very large n the function we can approximate: Let y be the average of n values xi where each xi has the same probability distribution P(x). Let G(k) be the generating function of P(x). The generating function for y is given by

n x k k x n

e n k G x n k n k x G G G G n k n k G G n k G

2 ) var( 2 2 2 2 2 2

2

) var( 2 ) ( ) ( ' ) ( ) ( ' ' 2 ) ( ) ( ' log

+ −

≈ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⇔ + − = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + ≈ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛

slide-42
SLIDE 42

The central limit theorem

n x k k x n

e n k G

2 ) var(

2

+ −

≈ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛

For large n we found: The generating function of a Gaussian distribution is given by:

2 2 ) (

2 2 2 2

) (

σ μ σ μ k k x kx

e e e dx k G

+ − − − −

= = ∫

We have thus established that the generating function of the average of n variables from the same distribution is a Gaussian distribution with mean and variance

x

. / ) var( n x Conclusion: Adding many independent random contributions together leads to a Gaussian distribution of the sum.

General Summary

  • Probability theory is the unique extension of logic to cases where our
  • information is incomplete.
  • A probability represents our state of knowledge.
  • We assign probabilities by formalizing precisely what it is we do and

do not know about the problem at hand.

  • We can use symmetries (in equivalent situations we assign the same

probabilities) to determine the probabilities.

  • We can use the maximum entropy principle to determine which probability

distributions correctly represent partial information.

  • Bayes’ theorem allows us to update our probabilities in light of data.

=

'

) ' ( ) ' | Data ( ) ( ) | Data ( ) Data | (

h

h P h P h P h P h P