Adaptive Histograms from a Randomized Queue that is Prioritized for - - PowerPoint PPT Presentation

adaptive histograms from a randomized queue that is
SMART_READER_LITE
LIVE PREVIEW

Adaptive Histograms from a Randomized Queue that is Prioritized for - - PowerPoint PPT Presentation

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Adaptive Histograms from a Randomized Queue that is Prioritized for Statistically Equivalent Blocks Gloria Teng Jennifer Harlow Raazesh


slide-1
SLIDE 1

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Adaptive Histograms from a Randomized Queue that is Prioritized for Statistically Equivalent Blocks

Gloria Teng Jennifer Harlow Raazesh Sainudiin

Department of Mathematics and Statistics, University of Canterbury, New Zealand

August 19, 2010

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-2
SLIDE 2

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Introduction

Present statistical regular sub-pavings as an efficient, data-driven, multi-dimensional data-structure for non-parametric density estimation of massive data sets; Apply our methods to earthquakes in NZ, weather and aircraft trajectories over a busy US airport and samples simulated from challenging multi-dimensional densities, including Levy and Rosenbrock.

Figure: Shape of a Levy density with 700 modes.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-3
SLIDE 3

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Intervals and Boxes in Rd

Intervals and Boxes as interval vectors: x = [x1, x1] × [x2, x2] × . . . × [xd, xd] , xi ≤ xi .

1-dim. 2-dim. ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ ✏ 3-dim.

Figure: Boxes in 1D, 2D, and 3D.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-4
SLIDE 4

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Binary Tree Representation

These boxes can also be represented by ordered binary trees. An operation of bisection on a box is equivalent to performing the

  • peration on its corresponding node in the tree, i.e.:

③ ρ X ✲ ③ ρ ③ L ③ R

❅ ❅ XL XR

Figure: Bisecting a box or its equivalent node.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-5
SLIDE 5

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Regular Sub-pavings (RSPs) (Jaulin et. al., 2001)

A sequence of bisections of boxes; Start from the root box; Along the first widest dimension.

Figure: A sequence of bisections on root box X to produce a 4-leafed RSP s.

③ ρ X

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-6
SLIDE 6

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Regular Sub-pavings (RSPs) (Jaulin et. al., 2001)

A sequence of bisections of boxes; Start from the root box; Along the first widest dimension.

Figure: A sequence of bisections on root box X to produce a 4-leafed RSP s.

③ ρ X ③ ρ ③ L ③ R

❅ ❅ XL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-7
SLIDE 7

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Regular Sub-pavings (RSPs) (Jaulin et. al., 2001)

A sequence of bisections of boxes; Start from the root box; Along the first widest dimension.

Figure: A sequence of bisections on root box X to produce a 4-leafed RSP s.

③ ρ X ③ ρ ③ L ③ R

❅ ❅ XL XR ③ ρ

LL ❅ ❅ ❅③ LR ❅ ❅ ❅③ R XLR XLL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-8
SLIDE 8

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Regular Sub-pavings (RSPs) (Jaulin et. al., 2001)

A sequence of bisections of boxes; Start from the root box; Along the first widest dimension.

Figure: A sequence of bisections on root box X to produce a 4-leafed RSP s.

③ ρ X ③ ρ ③ L ③ R

❅ ❅ XL XR ③ ρ

LL ❅ ❅ ❅③ LR ❅ ❅ ❅③ R XLR XLL XR ③ ρ

LL ❅ ❅ ❅③

LRL ❅ ❅ ❅③ LRR ❅ ❅ ❅③ R XLRL XLRR XLL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-9
SLIDE 9

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

The Space of All Possible RSPs

The number of distinct RSP with i splits is equal to the Catalan number: Ci = 1 i + 1 2i i

  • =

(2i)! (i + 1)!(i!) .

s0 s s s s s s s s

11 221 122 3321 2331 2222 1332 1233

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-10
SLIDE 10

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Statistical Regular Sub-pavings (SRSPs)

Extended from the RSP; Caches recursively computable statistics at each box or node as data falls through; These statistics include:

the sample count; the sample mean vector; the sample variance-covariance matrix; and the volume of the box. Figure: Caching the sample count in each node (or box).

ρ 10

r r r r r r r r r r Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-11
SLIDE 11

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Statistical Regular Sub-pavings (SRSPs)

Extended from the RSP; Caches recursively computable statistics at each box or node as data falls through; These statistics include:

the sample count; the sample mean vector; the sample variance-covariance matrix; and the volume of the box. Figure: Caching the sample count in each node (or box).

ρ 10

r r r r r r r r r r r r r r r r r r r r

❅ ❅ ❅③

R 5 5

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-12
SLIDE 12

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion Intervals and Boxes Regular Sub-pavings (RSPs) Statistical Regular Sub-pavings (SRSPs)

Statistical Regular Sub-pavings (SRSPs)

Extended from the RSP; Caches recursively computable statistics at each box or node as data falls through; These statistics include:

the sample count; the sample mean vector; the sample variance-covariance matrix; and the volume of the box. Figure: Caching the sample count in each node (or box).

ρ 10

r r r r r r r r r r r r r r r r r r r r

❅ ❅ ❅③

R 5 5

LL

❅ ❅ ❅③

LR 3 2 XLR XLL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-13
SLIDE 13

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

SRSPs as Adaptive Histograms

The histogram estimate of i.i.d. random variables X1, X2, . . . , Xn in Rd with density f is given by: ˆ fn(x) = 1 n

n

  • i=1

IXi∈x(x) vol(x) x(x): the leaf box x that contains x vol(x): volume of box x

Figure: A SRSP as a histogram estimate.

③ ρ 10 r r r r r r r r r r

❅ ❅ ❅③ R 5 5

LL ❅ ❅ ❅③ LR 2 3 XLR XLL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-14
SLIDE 14

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

SRSPs as Adaptive Histograms

The histogram estimate of i.i.d. random variables X1, X2, . . . , Xn in Rd with density f is given by: ˆ fn(x) = 1 n

n

  • i=1

IXi∈x(x) vol(x) x(x): the leaf box x that contains x vol(x): volume of box x

Figure: A SRSP as a histogram estimate.

③ ρ 10 r r r r r r r r r r

❅ ❅ ❅③ R 5 5

LL ❅ ❅ ❅③ LR 2 3 XLR XLL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-15
SLIDE 15

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

A Prioritized Queue based Algorithm

Algorithm SplitMostCounts As data arrives, order the leaf boxes of the SRSP so that the leaf box with the most number of points will be chosen for the next bisection.

ρ 10 X

r r r r r r r r r r Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-16
SLIDE 16

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

A Prioritized Queue based Algorithm

Algorithm SplitMostCounts As data arrives, order the leaf boxes of the SRSP so that the leaf box with the most number of points will be chosen for the next bisection. Split the root box.

ρ 10 X

r r r r r r r r r r

❅ ❅ ❅③

R L 5 5 XL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-17
SLIDE 17

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

A Prioritized Queue based Algorithm

Algorithm SplitMostCounts As data arrives, order the leaf boxes of the SRSP so that the leaf box with the most number of points will be chosen for the next bisection. Two or more boxes with the most number of points?

ρ 10 X

r r r r r r r r r r

❅ ❅ ❅③

R L 5 5 XL XR

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-18
SLIDE 18

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

A Prioritized Queue based Algorithm

Algorithm SplitMostCounts As data arrives, order the leaf boxes of the SRSP so that the leaf box with the most number of points will be chosen for the next bisection. Break ties by picking these boxes at random for the next bisection.

ρ 10 X

r r r r r r r r r r

❅ ❅ ❅③

R L 5 5 XL XR L

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-19
SLIDE 19

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

A Prioritized Queue based Algorithm

Algorithm SplitMostCounts As data arrives, order the leaf boxes of the SRSP so that the leaf box with the most number of points will be chosen for the next bisection. Keep bisecting till each box has less than or equal to kn number of points (let kn = 3 here).

ρ 10 X

r r r r r r r r r r

❅ ❅ ❅③

R L 5 5 XL XR L

LL

❅ ❅ ❅③

LR L R 3 2 XLR XLL

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-20
SLIDE 20

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

A Prioritized Queue based Algorithm

Algorithm SplitMostCounts As data arrives, order the leaf boxes of the SRSP so that the leaf box with the most number of points will be chosen for the next bisection. Final state

ρ 10 X

r r r r r r r r r r

❅ ❅ ❅③

R L 5 5 XL XR L

LL

❅ ❅ ❅③

LR L R 3 2 XLR XLL XRR XRL

③ ✁ ✁ ✁ ❅ ❅ ❅

R RL RR

③ ③

3 2

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-21
SLIDE 21

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion S.E.B. Priority Queue

Some Examples

Figure: Histogram density estimates their corresponding sub-pavings for the bivariate Gaussian, Levy and Rosenbrock densities.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-22
SLIDE 22

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Choice of kn

Figure: Two histogram density estimates for the standard bivariate gaussian density with different choices of kn. The histogram is under-smoothed when kn is relatively smaller than n and over-smoothed when kn is relatively larger.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-23
SLIDE 23

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Adding and Averaging SRSPs

Perform a non-minimal union (or add sub-pavings) and adjust counts:

n(1)

LR

n(1)

LL

n(1)

R

s(1) + n(2)

RR

n(2)

RL

n(2)

L

s(2) =

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-24
SLIDE 24

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Adding and Averaging SRSPs

Perform a non-minimal union (or add sub-pavings) and adjust counts:

n(1)

LR

n(1)

LL

n(1)

R

s(1) + n(2)

RR

n(2)

RL

n(2)

L

s(2) =

n(1)

LR + n(2)

L

2

n(1)

LL + n(2)

L

2 n(1)

R

2 + n(2) RL n(1)

R

2 + n(2) RR

s(1) + s(2)

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-25
SLIDE 25

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Adding and Averaging SRSPs

Adding m histogram density estimates

m

  • i=1

ˆ f (i) = ˆ f (1) + ˆ f (2) + ˆ f (3) + . . . + ˆ f (m) =

  • ˆ

f (1) + ˆ f (2) + ˆ f (3) + . . . + ˆ f (m) .

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-26
SLIDE 26

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Adding and Averaging SRSPs

Adding m histogram density estimates

m

  • i=1

ˆ f (i) = ˆ f (1) + ˆ f (2) + ˆ f (3) + . . . + ˆ f (m) =

  • ˆ

f (1) + ˆ f (2) + ˆ f (3) + . . . + ˆ f (m) . Averaging m histogram density estimate

  • f = 1

m

m

  • i=1

ˆ f (i) .

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-27
SLIDE 27

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

An Example

Figure: Histogram density estimates of the bivariate Levy using different values of kn.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-28
SLIDE 28

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

An Example

Figure: The averaged histogram density estimate.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-29
SLIDE 29

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

An Example of Application

Example Air Traffic Data (Link to SAGE server): interested in applying SRSPs to the analysis of thunderstorm effects on aggregated aircraft trajectories.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-30
SLIDE 30

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

Conclusions

We proposed an efficient, data-driven, multi-dimensional data-structure, SRSPs, for non-parametric density estimation

  • f massive data sets;

The SRSP can be represented by a binary tree and can either grow (through bisection of nodes) or be pruned (through merging nodes) adaptively; Arithmetic operations can be efficiently extended to these data structures, i.e. averaging histograms.

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ

slide-31
SLIDE 31

Statistical Regular Sub-pavings (SRSPs) Adaptive Histograms Arithmetic on SRSPs Application Conclusion

References

Jaulin, L., Kieffer, M., Didrit, O. & Walter, E. (2001). Applied interval

  • analysis. London: Springer-Verlag.

Lugosi, G. and Nobel, A. (1996). Consistency of data-driven histogram methods for density estimation and classification. The Annals of Statistics 24 687–706. Sainudiin, R. and York, T. L. (2005). An Auto-validating Rejection

  • Sampler. BSCB Dept. Technical Report BU-1661-M, Cornell University,

Ithaca, New York. Tucker, W. (2004). Auto-validating numerical methods. Lecture Notes, Uppsala University, Sweden.

Thank you!

Teng, Harlow and Sainudiin Adaptive Histograms from SEB-based PQ