Information, Learning and Falsification David Balduzzi December 17, - - PowerPoint PPT Presentation

information learning and falsification
SMART_READER_LITE
LIVE PREVIEW

Information, Learning and Falsification David Balduzzi December 17, - - PowerPoint PPT Presentation

Effective information Algorithmic information Learning theory Falsification Conclusion Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute for Intelligent Systems T ubingen, Germany Effective


slide-1
SLIDE 1

Effective information Algorithmic information Learning theory Falsification Conclusion

Information, Learning and Falsification

David Balduzzi

December 17, 2011

Max Planck Institute for Intelligent Systems T¨ ubingen, Germany

slide-2
SLIDE 2

Effective information Algorithmic information Learning theory Falsification Conclusion

Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description.

slide-3
SLIDE 3

Effective information Algorithmic information Learning theory Falsification Conclusion

Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble.

slide-4
SLIDE 4

Effective information Algorithmic information Learning theory Falsification Conclusion

Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm.

slide-5
SLIDE 5

Effective information Algorithmic information Learning theory Falsification Conclusion

Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related?

slide-6
SLIDE 6

Effective information Algorithmic information Learning theory Falsification Conclusion

Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related? Effective information. “Discrimination.” The information produced by a physical process when it produces an output depends on how sharply it discriminates between inputs.

slide-7
SLIDE 7

Effective information Algorithmic information Learning theory Falsification Conclusion

Effective information

slide-8
SLIDE 8

Effective information Algorithmic information Learning theory Falsification Conclusion

Nature decomposes into specific, bounded physical systems which we model as deterministic functions f : X → Y

  • r more generally as Markov matrices pm(y|x),

where X and Y are finite sets.

slide-9
SLIDE 9

Effective information Algorithmic information Learning theory Falsification Conclusion

Physical processes discriminate between inputs

thermometer

slide-10
SLIDE 10

Effective information Algorithmic information Learning theory Falsification Conclusion

Definition

The discrimination given by Markov matrix m outputting y is ˆ pm

  • x|y
  • := pm
  • y|do(x)
  • · punif (x)

pm(y) , where pm(y) :=

x pm

  • y|do(x)
  • · punif (x) is the effective

distribution.

Definition

Effective information is the Kullback-Leibler divergence ei(m, y) := H

  • ˆ

pm

  • X|y
  • punif (X)
  • Balduzzi and Tononi, PLoS Computational Biology, 2008
slide-11
SLIDE 11

Effective information Algorithmic information Learning theory Falsification Conclusion

Special case: deterministic f : X → Y

Definition

The discrimination given by f outputting y assigns equal probability to all elements of pre-image f −1(y).

Definition

Effective information is ei(f , y) := − log |f −1(y)| |X|

slide-12
SLIDE 12

Effective information Algorithmic information Learning theory Falsification Conclusion

thermometer input when thermometer outputs is

ei = -log size size

[ [

discrimination

slide-13
SLIDE 13

Effective information Algorithmic information Learning theory Falsification Conclusion

Algorithmic information

slide-14
SLIDE 14

Effective information Algorithmic information Learning theory Falsification Conclusion

Definition

Given universal prefix Turing machine T, the Kolmogorov complexity of string s is KT(s) := min

{i:T(i)=s•} len(i)

the length of the shortest program that generates s. For any Turing machine U = T, there exists a constant c such that KU(s) − c ≤ KT(s) ≤ KU(s) + c for all s.

slide-15
SLIDE 15

Effective information Algorithmic information Learning theory Falsification Conclusion

Definition

Given T, the (unnormalized) Solomonoff prior probability of string s is pT(s) :=

  • {i|T(i)=s•}

2−len(i), where the sum is over strings i that cause T to output s as a prefix, and no proper prefix of i outputs s. The Turing machine discriminates between programs according to which strings they output; Solomonoff prior counts programs are in each class (weighted by length).

slide-16
SLIDE 16

Effective information Algorithmic information Learning theory Falsification Conclusion

Kolmogorov complexity = Algorithmic probability

Theorem (Levin)

For all s − log PT(s) = KT(s). up to an additive constant c. Upshot: for my purposes, Solomonoff’s formulation of Kolmogorov complexity is the right one KT(s) := − log pT(s).

slide-17
SLIDE 17

Effective information Algorithmic information Learning theory Falsification Conclusion

Recall, the effective distribution was the denominator when computing discriminations using Bayes’ rule: ˆ pm

  • x|y
  • := pm
  • y|do(x)
  • · punif (x)

pm(y) .

slide-18
SLIDE 18

Effective information Algorithmic information Learning theory Falsification Conclusion

Solomonoff prior → Effective distribution

Proposition

The effective distribution on Y induced by f is pf (y) =

  • {x:f (x)=y}

2−len(x) Compare with Solomonoff distribution: pT(s) :=

  • {i|T(i)=s•}

2−len(i) Compute effective distribution by replacing universal Turing machine T with f : X → Y ; and giving inputs len(x) = log |X| in the optimal code for the uniform distribution on X.

slide-19
SLIDE 19

Effective information Algorithmic information Learning theory Falsification Conclusion

Kolmogorov Complexity → Effective information

Proposition

For function f : X → Y , effective information equals ei(f , y) = − log pf (y) = − log  

  • {x:f (x)=y}

2−len(x)   Compare with Kolmogorov complexity: KT(s) = − log pT(s) = − log  

  • {i|T(i)=s•}

2−len(i)  

slide-20
SLIDE 20

Effective information Algorithmic information Learning theory Falsification Conclusion

Statistical learning theory

slide-21
SLIDE 21

Effective information Algorithmic information Learning theory Falsification Conclusion

Hypothesis space

Given unlabeled data D = (x1, . . . , xl) ⊂ X l, let hypothesis space ΣD =

  • σ : D → ±1
  • be the set of all possible labelings.

HYPOTHESIS SPACE

  • 1

+1 +1

  • 1
  • 1

+1

slide-22
SLIDE 22

Effective information Algorithmic information Learning theory Falsification Conclusion

Setup

Suppose data D = (x1, . . . , xl) is drawn from unknown probability distribution PX and labeled yi = σ(xi) by an unknown supervisor σ ∈ ΣX. The learning problem: Find a classifier ˆ f guaranteed to perform well on future (unseen) data sampled via PX and labeled by σ.

slide-23
SLIDE 23

Effective information Algorithmic information Learning theory Falsification Conclusion

Empirical risk minimization

Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ ΣD, find classifier ˆ f ∈ F ⊂ ΣD, that minimizes empirical risk: ˆ f := arg min

f ∈F

1 l

l

  • i=1

If (xi)=σ(xi)

slide-24
SLIDE 24

Effective information Algorithmic information Learning theory Falsification Conclusion

Empirical risk minimization

Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ ΣD, find classifier ˆ f ∈ F ⊂ ΣD, that minimizes empirical risk: ˆ f := arg min

f ∈F

1 l

l

  • i=1

If (xi)=σ(xi) Key step. Reformulate algorithm as function between finite sets:

slide-25
SLIDE 25

Effective information Algorithmic information Learning theory Falsification Conclusion

Empirical risk minimization

Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ ΣD, find classifier ˆ f ∈ F ⊂ ΣD, that minimizes empirical risk: ˆ f := arg min

f ∈F

1 l

l

  • i=1

If (xi)=σ(xi) Key step. Reformulate algorithm as function between finite sets: Empirical risk minimization: RF,D : HYPOTHESIS SPACE − → EMPIRICAL RISK ΣD − → R σ → minf ∈F 1

l

l

i=1 If (xi)=σ(xi)

slide-26
SLIDE 26

Effective information Algorithmic information Learning theory Falsification Conclusion

TRAINING ERROR HYPOTHESIS SPACE ε3 ε2 ε1 HIGH CAPACITY fits many hypotheses LOW CAPACITY fits few hypotheses ε2 ε1 EMPIRICAL RISK MINIMIZER ε3 R1 R2 F1 F2

slide-27
SLIDE 27

Effective information Algorithmic information Learning theory Falsification Conclusion

Theorem (standard template for error bounds in SLT)

With probability 1 − δ,

  • expected error
  • f learner
  • historical

error

  • +
  • capacity of

algorithm

  • +
  • confidence

term

  • UNDERFITTING

OVERFITTING

slide-28
SLIDE 28

Effective information Algorithmic information Learning theory Falsification Conclusion

Minimizing empirical risk RF,D : ΣX → R is a physical process. Questions:

  • Q1. What is the effective distribution

(“Solomonoff prior”) of the ERM?

  • Q2. What is the effective information

(“Kolmogorov complexity”) of its outputs?

slide-29
SLIDE 29

Effective information Algorithmic information Learning theory Falsification Conclusion

Effective distribution → Rademacher complexity

ε3 ε2 ε1 R

Proposition (“Solomonoff − → Rademacher”)

The expectation of the ERM over the effective distribution “is” empirical Rademacher complexity:

  • ǫ∈R

ǫ · pRF,D(ǫ) = 1 2

  • 1 − Rademacher(F, D)
slide-30
SLIDE 30

Effective information Algorithmic information Learning theory Falsification Conclusion

Effective information → VC-entropy

ε3 ε2 ε1 R

Proposition (“Kolmogorov − → Vapnik”)

The effective information generated by the ERM when it outputs 0 “is” empirical VC-entropy: ei(RF,D, 0) = − log pRF,D(0) = l − VC-entropy(F, D), where l is amount of training data.

slide-31
SLIDE 31

Effective information Algorithmic information Learning theory Falsification Conclusion

Corollary (reformulation of error bounds in SLT)

With probability 1 − δ,

  • expected
  • utput ERM
  • historical
  • utput ERM
  • +
  • discrimination of

inputs by ERM

  • +
  • confidence

term

  • ε2

ε1 ε3 ERM what ERM outputs how ERM discriminates inputs ERRORS HYPOTHESES

slide-32
SLIDE 32

Effective information Algorithmic information Learning theory Falsification Conclusion

Falsification

slide-33
SLIDE 33

Effective information Algorithmic information Learning theory Falsification Conclusion

Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein’s bold conjecture about the Sun’s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton.

slide-34
SLIDE 34

Effective information Algorithmic information Learning theory Falsification Conclusion

Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein’s bold conjecture about the Sun’s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton. Popper’s big idea: Rely on theories that have been severely tested, rather than theories supported by lots of facts. Unfortunately, Popper failed to justify his big idea.

slide-35
SLIDE 35

Effective information Algorithmic information Learning theory Falsification Conclusion

Counting falsified hypotheses. Rademacher complexity.

  • ǫ · pRF,D(ǫ) = 1

2

  • 1 − Rademacher(F, D)
  • ε3

ε2 ε1 R

  • ǫ∈R

pRF,D(ǫ) · ǫ =

  • ǫ
  • fraction of hypothesis ERM falsifies
  • ·
  • n fraction ǫ of data
  • =
  • weighted count of falsified hypotheses
slide-36
SLIDE 36

Effective information Algorithmic information Learning theory Falsification Conclusion

Counting falsified hypotheses. VC-entropy.

ei(RF,D, 0) = l − VC-entropy(F, D)

ε3 ε2 ε1 R

ei(RF,D, 0) = log |ΣX|

total # hypotheses

− log |R−1

F,D(0)|

  • # hypotheses ERM fits

=

  • logarithmic count of falsified hypotheses.
slide-37
SLIDE 37

Effective information Algorithmic information Learning theory Falsification Conclusion

Back to Popper and justifying scientific knowledge. Minimal model of Popper’s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity

slide-38
SLIDE 38

Effective information Algorithmic information Learning theory Falsification Conclusion

Back to Popper and justifying scientific knowledge. Minimal model of Popper’s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity ERM has small capacity ↔ ERM falsifies many hypotheses.

slide-39
SLIDE 39

Effective information Algorithmic information Learning theory Falsification Conclusion

Conclusion

slide-40
SLIDE 40

Effective information Algorithmic information Learning theory Falsification Conclusion

Philosophy

A major theme of 20th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions).

slide-41
SLIDE 41

Effective information Algorithmic information Learning theory Falsification Conclusion

Philosophy

A major theme of 20th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions). This talk: substituted thinking about sets (e.g. function class F ⊂ ΣX) with thinking about the structure of arrow ERM : ΣX → R from hypothesis space to training errors Immediate consequences:

1 SLT ↔ algorithmic information theory 2 SLT ↔ falsification

slide-42
SLIDE 42

Effective information Algorithmic information Learning theory Falsification Conclusion

Conclusion

Physical processes discriminate between inputs Effective information is non-universal analog of Kolmogorov complexity

universal Turing machine → finite function

Information generated while minimizing empirical risk

1

controls error bounds (SLT) and

2

in terms of number of falsified hypotheses.

Conjecture: effective information generated by optimizations

  • ther than ERM also controls future performance.
slide-43
SLIDE 43

Effective information Algorithmic information Learning theory Falsification Conclusion

Thank you!