High-dimensional integration without Markov chains
Alexander Gray
Carnegie Mellon University School of Computer Science
High-dimensional integration without Markov chains Alexander Gray - - PowerPoint PPT Presentation
High-dimensional integration without Markov chains Alexander Gray Carnegie Mellon University School of Computer Science High-dimensional integration by: Nonparametric statistics Computational geometry Computational physics
Alexander Gray
Carnegie Mellon University School of Computer Science
…but NOT Markov chains.
D
Compute
where
∫ ∫
= dx x f dx x f x g I ) ( ) ( ) (
Often:
Motivating example: Bayesian inference on large datasets
We can evaluate f(x) at any x. But we have no
about f.
Often f(x) expensive to evaluate.
Quadrature doesn’t extend: O(mD). How to get the job done (Monte Carlo):
Often:
MCMC (Metropolis-Hastings algorithm)
MCMC (Metropolis-Hastings algorithm)
Do this a huge number of times. Analyze the stream of x’s so far by hand to see if you can stop the process. If you jump around f long enough (make a sequence long enough) and draw x’s from it uniformly, these x’s act as if drawn iid from f. Then
[ ]
= ≈
N i i
x f N x f E dx x f ) ( 1 ) ( ) (
Really cool:
– Ultra-simple – Requires only evaluations of f – Faster than quadrature in high dimensions
gave us the bomb (??) listed in “Top 10 algorithms of all time”
Really unfortunate:
1. No reliable way to choose the scale s; yet, its choice is critical for good performance 2. With multiple modes, can get stuck 3. Correlated sampling is hard to characterize in terms of error 4. Prior knowledge can be incorporated only in simple ways 5. No way to reliably stop automatically
Requires lots of runs/tuning/guesswork. Many workarounds, for different knowledge about f. (Note that in general case, almost nothing has changed.) Must become an MCMC expert just to do integrals.
Black art. Not yet in Numerical Recipes.
Goal:
– Simple like MCMC – Weak/no requirements on f() – Principled and automatic choice of key parameter(s) – Real error bounds – Better handling of isolated modes – Better incorporation of prior knowledge – Holy grail: automatic stopping rule
dx x q x q x f I ) ( ) ( ) (
=
=
N i i i
x q x f Ι ) ( ) ( ˆ
you can; try not to underestimate f()
analyze/measure
know something about f)
Idea: can we improve q() as go along?
Sample from q() Re-estimate q() from samples By the way… Now we’re doing integration by statistical inference.
Assume f() is a density. 1. Generate x’s from an initial q() 2. Estimate density fhat from the x’s, using kernel density estimation (KDE): 3. Sample x’s from fhat. 4. Return to step 2.
≠
− =
N i j j i h i
x x K N x f ) ( 1 ) ( ˆ
– O(N2); choice of bandwidth
Ghahramani, NIPS 2002]
– O(N3); choice of bandwidth
– nonlinear unconstrained optimization is itself time-consuming and contributes variance; choice of k, …
– (let’s get serious) awkward to sample from; like mixtures but worse
Bayesian quadrature: right idea but not general
(i.e. is a viable alternative to MCMC)
– ability to sample from it – efficiently computable – reliable and automatic parameter estimation
(i.e. what objective function should we be optimizing?)
Is this active learning
(aka optimal experiment design)? Basic framework: bias-variance decomposition of least- squares objective function minimize only variance term Seems reasonable, but:
regions where f() is large?
x q x f
−
2
) ( ) (
dx x q x q x f I ) ( ) ( ) (
=
() ~ , ) ( ) ( ˆ q x x q x f Ι
i N i i i
=
Estimate (unbiased)
− =
2 2
) ( ) ( 1 I dx x q x f N
2
ˆ ˆ ) ˆ (
q q q
I E I E I V − =
Variance
− =
2
) ( ) ( ) ( 1 I dx x q x f x f N
) ( ) ( 1 ) ˆ (
2 2
= − =
I dx x q x f N I V
q
I x f x q ) ( ) (
*
=
Optimal q()
x q x Iq x f N I V
q
− = ) ( ) ( ) ( 1 ) ˆ (
2
q ∫
2 ()
() ~ , ) ( ) ( ˆ ) ( 1 ) ˆ ( ˆ
2 2
q x x q x q I x f N I V
i N i i i q i q
− =
) ( ) ( , ˆ
i i i i i i
x q x f w w w V =
Observation #2:
(even if f() is a density)
that of density estimation (kernel estimators, finite-
sample)
[ ] ∫
= = = dy y x f dy y x yf dy x y yf x X Y E ) , ( ) , ( ) | ( |
≠ ≠
− − =
N i j j i h N i j j i h j i
x x K N x x K y N x f ) ( 1 ) ( 1 ) ( ˆ
then draw from N(xi,h*)
sample more widely
=
i i i i
x f x f p ) ( ˆ ) ( ˆ
0 x
2 2
q
This also answered: “How can we incorporate prior
points in a metric space [Euclidean]
monotonic kernel [pdf]
≠ ≠
− − =
N i j j i h N i j j i h j i
x x K N x x K y N x f ) ( 1 ) ( 1 ) ( ˆ
Extension from KDE to NWR
Okay, so we can compute NWR efficiently with accuracy.
What about an analytical method?
Bias-variance decomposition has many terms that can be estimated (if very roughly) But the real problem is D. Thus, this is not reliable in our setting.
∈ 2 } { *
i h i h i h h
i
∈ 2 } , , { *
* 2 3 * * 3 2
i h i h i h h h h new
We can share work between bandwidths.
[Gray and Moore, 2003]
1 ) , (
2a i
t x x K − = 1 1 ≥ < ≤ t t
2 2 / h
x x t
i
− =
dimension for the N-body algorithm?
= − =
i i i i i i q i i
v v p x q x q I x f v , ) ( ) ( ˆ ) (
2
I tried that. It doesn’t work. So I remember all the old samples.
( )
N N x q x f N Ι Ι
tot N i i i tot q q
+ + =
/ ) ( ) ( ˆ ˆ
[ ]
( )
2 2 2
/ ) ( ) ( ˆ ) ( ) ˆ ( ˆ ) ˆ ( ˆ N N x q x q I x f N I V I V
tot N i i i q i tot q q
+ − + =
∑
Overall method: FIRE
Repeat: 1. Resample N points from {xi} using Add to training set. Build/update
{ }
i
x i h
T x f ), ( ˆ *
() () ˆ ) 1 ( () f f q α α + − =
{ }
i
x
T ~
[ ]
− =
∈ 2 } , , { *
) ~ ( ) ~ ( ˆ ) ~ ( min
* 2 3 * * 3 2
i h i h q i h h h h new
x q x q I x f h
{ }
i
x ~
[ ]
) ( ) ( ˆ ) (
2 i i q i i
x q x q I x f v − =
∑
=
i i i i
v v p
– consistent – unbiased
EM-like; like N interacting Metropolis samplers
Anisotropic Gaussian a s2 in off-diagonals a={0.99,0.9,0.5}, D={5,10,25,100}
Mixture of two normals 0.5N(4,1) + 0.5N(4+b,1) b={2,4,6,8,10}, D={5,10,25,100}
– starting point [Gelman-Roberts-Gilks 95] – adaptation phase [Gelman-Roberts-Gilks 95] – burn-in time [Geyer 92] – multiple chains [Geyer 92] – thinning [Gelman 95]
Look at its relative error over many runs When to stop it?
Anisotropic Gaussian (a=0.9,D=10)
– started at center of mass – when it wants to stop: >2 hours – after 2 hours
– ~40M samples
– when it wants to stop: ~1 hour – after 2 hours: rel. err {1%,2%,1%,1%} – ~1.5M samples
Mixture of Gaussians (b=10,D=10)
– started at one mode – when it wants to stop: ~30 minutes – after 2 hours:
– ~40M samples
– when it wants to stop: ~10 minutes – after 2 hours: rel. err {<1%,1%,32%,<1%} – ~1.2M samples
Non-positive functions Positivization [Owen-Zhou 1998]
More defensiveness, and accuracy Control variates [Veach 1997]
More accurate regression Local linear regression
Fully automatic stopping Function-wide confidence bands stitch together pointwise bands, control with FDR
without Markov chains, by statistical inference
– safer (e.g. isolated modes) – not a black art – faster
me!
One notion of intrinsic dimension
‘Correlation dimension’ Similar: notion in metric analysis log r log C(r)
a i i i
x x mm x x K − = ) , (
(high accuracy required)
estimation
a i i i
x x mm x x K − = ) , (
2 2 2
/
) , (
σ
i
x x i
e x x K
− −
= 1 ) , (
2a i
t x x K − = 1 1 ≥ < ≤ t t
2 2 /σ i
x x t − =
(only moderate accuracy required, often high-D) (high accuracy required)
estimation
particle hydrodynamics)
a i i i
x x mm x x K − = ) , (
2 2 2
/
) , (
σ
i
x x i
e x x K
− −
= 1 ) , (
2a i
t x x K − =
) 2 ( 3 6 4 ) , (
3 3 2
t t t x x K
i
− + − =
2 2 1 1 ≥ < ≤ < ≤ t t t
1 1 ≥ < ≤ t t
2 2 /σ i
x x t − =
Also: different for every point, non-isotropic, edge-dependent, …
(only moderate accuracy required, often high-D) (only moderate accuracy required) (high accuracy required)
≈
i R R i
x K N x x K ) , ( ) , ( µ
if
θ r s >
s r
≈
i R R i
x K N x x K ) , ( ) , ( µ
if
θ r s >
≈ ∀
i i
x x K x ) , ( ,
multipole/Taylor expansion if
r s >
s s r
non-rigorous, uniform distribution
non-rigorous, uniform distribution
non-rigorous, uniform distribution
non-rigorous, uniform distribution [Callahan-Kosaraju 95]: O(N) is impossible for log-depth tree (in the worst case)
even analytic), heterogeneous kernels
even analytic), heterogeneous kernels
Needed to achieve high accuracy Needed to have hard error bounds
even analytic), heterogeneous kernels
Needed to achieve high accuracy (?) Needed to have hard error bounds (?)
can use any kind of tree
requires grid structure
quad-tree/oct-tree not very adaptive kd-tree adaptive ball-tree/metric tree very adaptive
kd-trees:
most widely-used space- partitioning tree
[Friedman, Bentley & Finkel 1977]
A kd-tree: level 1
A kd-tree: level 2
A kd-tree: level 3
A kd-tree: level 4
A kd-tree: level 5
A kd-tree: level 6
A ball-tree: level 1
[Uhlmann 1991], [Omohundro 1991]
A ball-tree: level 2
A ball-tree: level 3
A ball-tree: level 4
A ball-tree: level 5
Barnes-Hut FMM runtime O(NlogN) O(N) expansions optional required simple,recursive? yes no adaptive trees? yes no error bounds? no yes
Is it really because of the expansions?
..............................expansions simple, recursive
Single-tree: Dual-tree (symmetric):
Simple recursive algorithm
SingleTree(q,R) { if approximate(q,R), return. if leaf(R), SingleTreeBase(q,R). else, SingleTree(q,R.left). SingleTree(q,R.right). }
(NN or range-search: recurse on the closer node first)
Simple recursive algorithm
DualTree(Q,R) { if approximate(Q,R), return. if leaf(Q) and leaf(R), DualTreeBase(Q,R). else, DualTree(Q.left,R.left). DualTree(Q.left,R.right). DualTree(Q.right,R.left). DualTree(Q.right,R.right). }
(NN or range-search: recurse on the closer node first)
Query points Reference points
Dual-tree traversal
(depth-first)
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Query points Reference points
Dual-tree traversal
Finite-difference function approximation.
) ( ) ( ) ( 2 1 ) ( ) (
1 1 i i i i i i
x x x x x f x f x f x f − − − + ≈
+ +
) ( ) ( ) ( 2 1 ) ( ) (
min min max min max min
δ δ δ δ δ δ δ δ − − − + ≈ K K K K ) )( ( ) ( ) ( a x a f a f x f − ′ + ≈
Taylor expansion: Gregory-Newton finite form:
Finite-difference function approximation.
) ( ) ( 2
max min QR QR R N r qr q
K K N K K err
R
δ δ δ − ≤ − =∑
) ( ) (
max min 2 1 QR QR
K K K δ δ + =
assumes monotonic decreasing kernel
Stopping rule: approximate if s > r
could also use center of mass
Simple approximation method
approximate(Q,R) { if incorporate(dl, du). }
)) ( ), ( max(
min min
R diam Q diam s ⋅ ≥ δ ). ( ), (
min max
δ δ K N du K N dl
R R
= =
trivial to change kernel hard error bounds
THEOREM: Dual-tree algorithm is O(N)
in worst case (linear-depth trees) NOTE: Faster algorithm using different approximation rule: O(N) expected case
ASSUMPTION: N points from density f
) (log N O N ⋅ ⇒ ) 1 ( ) 1 ( ) 1 ( ) 2 / ( ) ( O T O N T N T = + = ) 1 ( ) 1 ( ) 1 ( ) 2 / ( 2 ) ( O T O N T N T = + =
single-tree (point-node) dual-tree (node-node)
) (N O ⇒
LEMMA: Number of nodes that are well- separated from a query node Q is bounded by a constant
D
C c s g ) , , ( 1+
Thus the recurrence yields the entire runtime. Done. CONJECTURE: should actually be D’ (the intrinsic dimension).
Speedup Results: Number of points
One order-of-magnitude speedup
23 35 hrs 1.6M 10 31616* 800K 5 7904* 400K 2 1976* 200K 1.0 494 100K .46 123 50K .31 31 25K .12 7 12.5K dual- N naïve tree 5500x
Speedup Results: Different kernels
51 23 1.6M 22 10 800K 11 5 400K 5 2 200K 2 1.0 100K 1.1 .46 50K .70 .31 25K .32 .12 12.5K N Epan. Gauss. Epanechnikov: 10-6 relative error Gaussian: 10-3 relative error
Speedup Results: Dimensionality
51 23 1.6M 22 10 800K 11 5 400K 5 2 200K 2 1.0 100K 1.1 .46 50K .70 .31 25K .32 .12 12.5K N Epan. Gauss.
Speedup Results: Different datasets
Name N D Time (sec) 9 2 3M PSF2d 24 784 10K MNIST 8 38 136K CovType 10 5 103K Bio5
Meets desiderata?
[Gray and Moore, 2003], [Gray and Moore 2005 in prep.]
Meets desiderata?
Meets desiderata?
Which data structure is best in practice?
variants: approximate, all-nearest-neighbor, bichromatic nearest-neighbor, point location)
metric trees? SR-trees? Miller et al.’s separator tree? WSPD? navigating nets? Locality-sensitive hashing?
journal near you
Side note: Many problems are easy for this framework
Now use q() to do importance sampling. Compute
() ~ , ) ( ) ( ˆ q x x q x f Ι
i M i i i final q
=