- V. Megalooikonomou
Fractals and Databases
(based on notes by C. Faloutsos at CMU)
Principles of Database Systems V. Megalooikonomou Fractals and - - PowerPoint PPT Presentation
Principles of Database Systems V. Megalooikonomou Fractals and Databases (based on notes by C. Faloutsos at CMU) Indexing - Detailed outline fractals intro applications 2 Intro to fractals - outline Motivation 3 problems /
(based on notes by C. Faloutsos at CMU)
2
fractals
intro applications
3
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
4
Road end-points of Montgomery county:
an R-tree?
5
6
disk trace (from HP - J. Wilkes); Web
time # bytes
7
Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski,
8
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
9
zero area; infinite perimeter!
10
Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58...
11
a perfectly self-similar object with n similar pieces each scaled down by a factor f
zero area; infinite length!
12
Q: fractal dimension of
a line?
A: 1 (= log(2)/log(2)!)
13
Q: dfn for a given
4 2 3 3 2 4 1 5 y x
14
Q: fractal dimension of
a line?
A: nn ( < = r ) ~ r^ 1
(‘power law’: y= x^ a)
Q: fd of a plane? A: nn ( < = r ) ~ r^ 2
fd= = slope of (log(nn) vs log(r) )
15
Algorithm, to estimate it?
avg nn(< = r) is exactly
including ‘mirror’ pairs
16
log( r ) log(# pairs within < = r ) 1.58
17
Euclidean objects have integer fractal
point: 0 lines and smooth curves: 1 smooth surfaces: 2
fractal dimension -> roughness of the
18
fd = embedding dimension -> uniform
a point set may have several fd,
19
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
20
Cross-roads of Montgomery county:
21
< = > fractals < = > scale-free < = > power-laws
avg# neighbors(< = r
log( r ) log(# pairs(within < = r))
22
avg# neighbors(< = r
log( r ) log(# pairs(within < = r))
23
Montgomery County of MD (road end-
24
Long Beach county of CA (road end-
25
26
27
28
Heuristic on choosing # of clusters
29
30
31
disk traces: self-similar:
time # bytes
32
disk traces (80-20 ‘law’ = ‘multifractal’)
time # bytes
33
fractal: a set of points that is self-similar multifractal: a probability density function
34
time Tape# 1 Tape# N # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...)
35
time Tape# 1 Tape# N # tapes retrieved # qual. records
36
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More tools and examples Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
37
Zipf’s law Korcak’s law / “fat fractals”
38
aaron zoo freq.
39
log(rank) log(freq) “a” “the”
40
log(rank) log(freq)
41
log(rank) log(freq)
42
y = -0.9676x + 2.3054 R2 = 0.9458 0.5 1 1.5 2 2.5 0.5 1 1.5 2 Series1 Linear (Series1)
rank log(# medals)
43
Scandinavian lakes Any pattern?
44
Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( > = area)) log(area)
45
Japan islands
46
Japan islands; area vs cumulative count (log-log axes) log(area) log(count( > = area))
47
48
How to generate such regions?
49
Q: How to generate such regions? A: recursively, from a single region
50
concepts:
fractals, multifractals and fat fractals
tools:
correlation integral (= pair-count plot) rank/frequency plot (Zipf’s law) CCDF (Korcak’s law)
51
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More tools and examples Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
52
How does the internet look like?
53
How does the internet look like? Internet routers: how many neighbors
54
concepts:
fractals, multifractals and fat fractals
tools:
correlation integral (= pair-count plot) rank/frequency plot (Zipf’s law) CCDF (Korcak’s law)
55
Internet routers: how many neighbors
Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops ) log(# pairs) 2.8
56
degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree)
57
pdf of degrees: (slope: 2.2 )
58
Scree plot for Internet domains (log-log) [sigcomm99] log(i) log( i-th eigenvalue)
59
Oct-trees; brain-scans
Log(# octants) 2.63 = fd
60
benign tumors: fd ~ 2.37 malignant: fd ~ 2.56
61
cardiovascular system: 3 (!) stock prices (LYCOS) - random walks: 1.5 Coastlines: 1.2-1.58 (Norway!)
62
63
duration of UNIX jobs [Harchol-Balter] Energy of earthquakes (Gutenberg-
64
publication counts (Lotka’s law) Distribution of UNIX file sizes Income distribution (Pareto’s law) web hit counts [Huberman]
65
In- and out-degree distribution of web
length of file transfers [Bestavros+ ] Click-stream data (w/ A. Montgomery
66
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
67
68
cities/stores/hospitals, over earth’s
time-stamps of events (customer
regions (sales areas, islands, patches of
69
customer feature vectors (age, income,
70
Detect non-existence of rules (if points
Detect non-homogeneous regions (eg.,
Estimate number of neighbors /
71
cities w/ populations positions on earth and amount of
product ids and sales per product people and their salaries months and count of accidents
72
Estimate tape/disk accesses
how many of the 100 tapes contain my 50
how many days without an accident?
time Tape# 1 Tape# N
73
how often do we exceed the threshold?
time # bytes
74
Extrapolations for/from samples
time # bytes
75
How many distinct products account for
76
Motivation – 3 problems / case studies Definition of fractals and power laws Solutions to posed problems More examples and tools Discussion - putting fractals to work! Conclusions – practitioner’s guide Appendix: gory details - boxcounting
77
Real data often disobey textbook
avoid ‘mean’ - use median, or even better,
fractals, self-similarity, and power laws,
78
tool# 1: (for points) ‘correlation
tool# 2: (for categorical values)
tool# 3: (for numerical values)
79
tool# 1: # pairs vs distance, for a set of objects,
log(hops)
log(# pairs)
2.8 log( r ) log(# pairs(within < = r))
80
tool# 2: rank-frequency plot (for categorical
log(rank) log(degree)
81
tool# 3: CCDF, for (skewed) numerical
log(count( > = area)) log(area)
82
Strongly recommended intro book:
Manfred Schroeder Fractals, Chaos, Power
Classic book on fractals:
B. Mandelbrot Fractal Geometry of Nature,
83
[ieeeTN94] W. E. Leland, M.S. Taqqu, W.
Willinger, D.V. Wilson, On the Self-Similar Nature
Networking, 2, 1, pp 1-15, Feb. 1994.
[pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4- 13
84
[vldb95] Alberto Belussi and Christos Faloutsos,
Estimating the Selectivity of Spatial Queries Using the ` Correlation' Fractal Dimension Proc. of VLDB,
[vldb96] Christos Faloutsos, Yossi Matias and Avi
Silberschatz, Modeling Skewed Distributions Using Multifractals and the ` 80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996.
85
[vldb96] Christos Faloutsos and Volker Gaede
Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India,
[sigcomm99] Michalis Faloutsos, Petros Faloutsos
and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999
86
[icde99] Guido Proietti and Christos Faloutsos,
I/O complexity for range queries on region data stored using an R-tree International Conference
March 23-26, 1999
[sigmod2000] Christos Faloutsos, Bernhard
Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000
87
Bad news: There are more than one
Minkowski fd; Hausdorff fd; Correlation fd;
Great news:
they can all be computed fast! they usually have nearby values
88
How, for the (correlation) fractal
A: Box-counting plot:
log( r )
89
pi : the percentage (or count) of points
r: the side of the grid
90
compute sum(pi^ 2) for another grid
log( r )
91
etc; if the resulting plot has a linear part,
log( r )
92
Many more fractal dimensions Dq (related
1
i i q i q
93
Box counting plot: Log( N ( r ) ) vs Log (
r: grid side N (r ): count of non-empty cells (Hausdorff) fractal dimension D0:
94
Hausdorff fd:
r log(r) log(# non-empty cells) D0
95
q= 0: Hausdorff fractal dimension q= 2: Correlation fractal dimension
q= 1: Information fractal dimension
96
in general, the Dq’s take similar, but not
except for perfectly self-similar point-
97
Montgomery County of MD (road end-
98
Long Beach county of CA (road end-
99
many fractal dimensions, with nearby
can be computed quickly
(code: on the web)