Effective information Algorithmic information Learning theory Falsification Conclusion
Information, Learning and Falsification
David Balduzzi
December 17, 2011
Max Planck Institute for Intelligent Systems T¨ ubingen, Germany
Information, Learning and Falsification David Balduzzi December 17, - - PowerPoint PPT Presentation
Effective information Algorithmic information Learning theory Falsification Conclusion Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute for Intelligent Systems T ubingen, Germany Effective
Effective information Algorithmic information Learning theory Falsification Conclusion
David Balduzzi
December 17, 2011
Max Planck Institute for Intelligent Systems T¨ ubingen, Germany
Effective information Algorithmic information Learning theory Falsification Conclusion
Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description.
Effective information Algorithmic information Learning theory Falsification Conclusion
Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble.
Effective information Algorithmic information Learning theory Falsification Conclusion
Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm.
Effective information Algorithmic information Learning theory Falsification Conclusion
Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related?
Effective information Algorithmic information Learning theory Falsification Conclusion
Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related? Effective information. “Discrimination.” The information produced by a physical process when it produces an output depends on how sharply it discriminates between inputs.
Effective information Algorithmic information Learning theory Falsification Conclusion
Effective information Algorithmic information Learning theory Falsification Conclusion
Nature decomposes into specific, bounded physical systems which we model as deterministic functions f : X → Y
where X and Y are finite sets.
Effective information Algorithmic information Learning theory Falsification Conclusion
Effective information Algorithmic information Learning theory Falsification Conclusion
Definition
The discrimination given by Markov matrix m outputting y is ˆ pm
pm(y) , where pm(y) :=
x pm
distribution.
Definition
Effective information is the Kullback-Leibler divergence ei(m, y) := H
pm
Effective information Algorithmic information Learning theory Falsification Conclusion
Definition
The discrimination given by f outputting y assigns equal probability to all elements of pre-image f −1(y).
Definition
Effective information is ei(f , y) := − log |f −1(y)| |X|
Effective information Algorithmic information Learning theory Falsification Conclusion
thermometer input when thermometer outputs is
discrimination
Effective information Algorithmic information Learning theory Falsification Conclusion
Effective information Algorithmic information Learning theory Falsification Conclusion
Definition
Given universal prefix Turing machine T, the Kolmogorov complexity of string s is KT(s) := min
{i:T(i)=s•} len(i)
the length of the shortest program that generates s. For any Turing machine U = T, there exists a constant c such that KU(s) − c ≤ KT(s) ≤ KU(s) + c for all s.
Effective information Algorithmic information Learning theory Falsification Conclusion
Definition
Given T, the (unnormalized) Solomonoff prior probability of string s is pT(s) :=
2−len(i), where the sum is over strings i that cause T to output s as a prefix, and no proper prefix of i outputs s. The Turing machine discriminates between programs according to which strings they output; Solomonoff prior counts programs are in each class (weighted by length).
Effective information Algorithmic information Learning theory Falsification Conclusion
Theorem (Levin)
For all s − log PT(s) = KT(s). up to an additive constant c. Upshot: for my purposes, Solomonoff’s formulation of Kolmogorov complexity is the right one KT(s) := − log pT(s).
Effective information Algorithmic information Learning theory Falsification Conclusion
Recall, the effective distribution was the denominator when computing discriminations using Bayes’ rule: ˆ pm
pm(y) .
Effective information Algorithmic information Learning theory Falsification Conclusion
Proposition
The effective distribution on Y induced by f is pf (y) =
2−len(x) Compare with Solomonoff distribution: pT(s) :=
2−len(i) Compute effective distribution by replacing universal Turing machine T with f : X → Y ; and giving inputs len(x) = log |X| in the optimal code for the uniform distribution on X.
Effective information Algorithmic information Learning theory Falsification Conclusion
Proposition
For function f : X → Y , effective information equals ei(f , y) = − log pf (y) = − log
2−len(x) Compare with Kolmogorov complexity: KT(s) = − log pT(s) = − log
2−len(i)
Effective information Algorithmic information Learning theory Falsification Conclusion
Effective information Algorithmic information Learning theory Falsification Conclusion
Given unlabeled data D = (x1, . . . , xl) ⊂ X l, let hypothesis space ΣD =
HYPOTHESIS SPACE
+1 +1
+1
Effective information Algorithmic information Learning theory Falsification Conclusion
Suppose data D = (x1, . . . , xl) is drawn from unknown probability distribution PX and labeled yi = σ(xi) by an unknown supervisor σ ∈ ΣX. The learning problem: Find a classifier ˆ f guaranteed to perform well on future (unseen) data sampled via PX and labeled by σ.
Effective information Algorithmic information Learning theory Falsification Conclusion
Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ ΣD, find classifier ˆ f ∈ F ⊂ ΣD, that minimizes empirical risk: ˆ f := arg min
f ∈F
1 l
l
If (xi)=σ(xi)
Effective information Algorithmic information Learning theory Falsification Conclusion
Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ ΣD, find classifier ˆ f ∈ F ⊂ ΣD, that minimizes empirical risk: ˆ f := arg min
f ∈F
1 l
l
If (xi)=σ(xi) Key step. Reformulate algorithm as function between finite sets:
Effective information Algorithmic information Learning theory Falsification Conclusion
Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ ΣD, find classifier ˆ f ∈ F ⊂ ΣD, that minimizes empirical risk: ˆ f := arg min
f ∈F
1 l
l
If (xi)=σ(xi) Key step. Reformulate algorithm as function between finite sets: Empirical risk minimization: RF,D : HYPOTHESIS SPACE − → EMPIRICAL RISK ΣD − → R σ → minf ∈F 1
l
l
i=1 If (xi)=σ(xi)
Effective information Algorithmic information Learning theory Falsification Conclusion
TRAINING ERROR HYPOTHESIS SPACE ε3 ε2 ε1 HIGH CAPACITY fits many hypotheses LOW CAPACITY fits few hypotheses ε2 ε1 EMPIRICAL RISK MINIMIZER ε3 R1 R2 F1 F2
Effective information Algorithmic information Learning theory Falsification Conclusion
Theorem (standard template for error bounds in SLT)
With probability 1 − δ,
error
algorithm
term
OVERFITTING
Effective information Algorithmic information Learning theory Falsification Conclusion
Minimizing empirical risk RF,D : ΣX → R is a physical process. Questions:
(“Solomonoff prior”) of the ERM?
(“Kolmogorov complexity”) of its outputs?
Effective information Algorithmic information Learning theory Falsification Conclusion
ε3 ε2 ε1 R
Proposition (“Solomonoff − → Rademacher”)
The expectation of the ERM over the effective distribution “is” empirical Rademacher complexity:
ǫ · pRF,D(ǫ) = 1 2
Effective information Algorithmic information Learning theory Falsification Conclusion
ε3 ε2 ε1 R
Proposition (“Kolmogorov − → Vapnik”)
The effective information generated by the ERM when it outputs 0 “is” empirical VC-entropy: ei(RF,D, 0) = − log pRF,D(0) = l − VC-entropy(F, D), where l is amount of training data.
Effective information Algorithmic information Learning theory Falsification Conclusion
Corollary (reformulation of error bounds in SLT)
With probability 1 − δ,
inputs by ERM
term
ε1 ε3 ERM what ERM outputs how ERM discriminates inputs ERRORS HYPOTHESES
Effective information Algorithmic information Learning theory Falsification Conclusion
Effective information Algorithmic information Learning theory Falsification Conclusion
Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein’s bold conjecture about the Sun’s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton.
Effective information Algorithmic information Learning theory Falsification Conclusion
Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein’s bold conjecture about the Sun’s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton. Popper’s big idea: Rely on theories that have been severely tested, rather than theories supported by lots of facts. Unfortunately, Popper failed to justify his big idea.
Effective information Algorithmic information Learning theory Falsification Conclusion
2
ε2 ε1 R
pRF,D(ǫ) · ǫ =
Effective information Algorithmic information Learning theory Falsification Conclusion
ei(RF,D, 0) = l − VC-entropy(F, D)
ε3 ε2 ε1 R
ei(RF,D, 0) = log |ΣX|
total # hypotheses
− log |R−1
F,D(0)|
=
Effective information Algorithmic information Learning theory Falsification Conclusion
Back to Popper and justifying scientific knowledge. Minimal model of Popper’s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity
Effective information Algorithmic information Learning theory Falsification Conclusion
Back to Popper and justifying scientific knowledge. Minimal model of Popper’s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity ERM has small capacity ↔ ERM falsifies many hypotheses.
Effective information Algorithmic information Learning theory Falsification Conclusion
Effective information Algorithmic information Learning theory Falsification Conclusion
A major theme of 20th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions).
Effective information Algorithmic information Learning theory Falsification Conclusion
A major theme of 20th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions). This talk: substituted thinking about sets (e.g. function class F ⊂ ΣX) with thinking about the structure of arrow ERM : ΣX → R from hypothesis space to training errors Immediate consequences:
1 SLT ↔ algorithmic information theory 2 SLT ↔ falsification
Effective information Algorithmic information Learning theory Falsification Conclusion
Physical processes discriminate between inputs Effective information is non-universal analog of Kolmogorov complexity
universal Turing machine → finite function
Information generated while minimizing empirical risk
1
controls error bounds (SLT) and
2
in terms of number of falsified hypotheses.
Conjecture: effective information generated by optimizations
Effective information Algorithmic information Learning theory Falsification Conclusion