Principles of Database Systems V. Megalooikonomou Fractals and - - PowerPoint PPT Presentation

principles of database systems
SMART_READER_LITE
LIVE PREVIEW

Principles of Database Systems V. Megalooikonomou Fractals and - - PowerPoint PPT Presentation

Principles of Database Systems V. Megalooikonomou Fractals and Databases (based on notes by C. Faloutsos at CMU) Indexing - Detailed outline fractals intro applications 2 Intro to fractals - outline Motivation 3 problems /


slide-1
SLIDE 1
  • V. Megalooikonomou

Fractals and Databases

(based on notes by C. Faloutsos at CMU)

Principles of Database Systems

slide-2
SLIDE 2

2

Indexing - Detailed outline

 fractals

 intro  applications

slide-3
SLIDE 3

3

Intro to fractals - outline

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More examples and tools  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-4
SLIDE 4

4

Road end-points of Montgomery county:

  • Q1: how many d.a. for

an R-tree?

  • Q2 : distribution?
  • not uniform
  • not Gaussian
  • no rules??

Problem # 1: GIS - points

slide-5
SLIDE 5

5

Problem # 2 - spatial d.m.

Galaxies (Sloan Digital Sky Survey -B. Nichol)

  • ‘spiral’ and ‘elliptical’

galaxies (stores and households ...)

  • patterns?
  • attraction/ repulsion?
  • how many ‘spi’ within

r from an ‘ell’?

slide-6
SLIDE 6

6

Problem # 3: traffic

 disk trace (from HP - J. Wilkes); Web

traffic - fit a model

time # bytes

Poisson

  • how many

explosions to expect?

  • queue length

distr.?

slide-7
SLIDE 7

7

Common answer:

 Fractals / self-similarities / power laws  Seminal works from Hilbert, Minkowski,

Cantor, Mandelbrot, (Hausdorff, Lyapunov, Ken Wilson, …)

slide-8
SLIDE 8

8

Road map

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More examples and tools  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-9
SLIDE 9

9

What is a fractal?

= self-similar point set, e.g., Sierpinski triangle:

...

zero area; infinite perimeter!

slide-10
SLIDE 10

10

Definitions (cont’d)

 Paradox: Infinite perimeter ; Zero area!  ‘dimensionality’: between 1 and 2  actually: Log(3)/Log(2) = 1.58...

slide-11
SLIDE 11

11

Dfn of fd:

ONLY for a perfectly self-similar point set:

= log(n)/ log(f) = log(3)/ log(2) = 1.58

a perfectly self-similar object with n similar pieces each scaled down by a factor f

...

zero area; infinite length!

slide-12
SLIDE 12

12

Intrinsic (‘fractal’) dimension

 Q: fractal dimension of

a line?

 A: 1 (= log(2)/log(2)!)

slide-13
SLIDE 13

13

Intrinsic (‘fractal’) dimension

 Q: dfn for a given

set of points?

4 2 3 3 2 4 1 5 y x

slide-14
SLIDE 14

14

Intrinsic (‘fractal’) dimension

 Q: fractal dimension of

a line?

 A: nn ( < = r ) ~ r^ 1

(‘power law’: y= x^ a)

 Q: fd of a plane?  A: nn ( < = r ) ~ r^ 2

fd= = slope of (log(nn) vs log(r) )

slide-15
SLIDE 15

15

Intrinsic (‘fractal’) dimension

 Algorithm, to estimate it?

Notice

 avg nn(< = r) is exactly

tot# pairs(< = r) / (2* N)

including ‘mirror’ pairs

slide-16
SLIDE 16

16

Sierpinsky triangle

log( r ) log(# pairs within < = r ) 1.58

= = ‘correlation integral’

slide-17
SLIDE 17

17

Observations:

 Euclidean objects have integer fractal

dimensions

 point: 0  lines and smooth curves: 1  smooth surfaces: 2

 fractal dimension -> roughness of the

periphery

slide-18
SLIDE 18

18

Important properties

 fd = embedding dimension -> uniform

pointset

 a point set may have several fd,

depending on scale

slide-19
SLIDE 19

19

Road map

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More examples and tools  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-20
SLIDE 20

20

Cross-roads of Montgomery county:

  • any rules?

Problem # 1: GIS points

slide-21
SLIDE 21

21

Solution # 1

A: self-similarity ->

 < = > fractals  < = > scale-free  < = > power-laws

(y= x^ a, F= C* r^ (- 2))

 avg# neighbors(< = r

) = r^ D

log( r ) log(# pairs(within < = r))

1.51

slide-22
SLIDE 22

22

Solution # 1

A: self-similarity

 avg# neighbors(< = r

) ~ r^ (1.51)

log( r ) log(# pairs(within < = r))

1.51

slide-23
SLIDE 23

23

Examples:MG county

 Montgomery County of MD (road end-

points)

slide-24
SLIDE 24

24

Examples:LB county

 Long Beach county of CA (road end-

points)

slide-25
SLIDE 25

25

Solution# 2: spatial d.m.

Galaxies ( ‘BOPS’ plot - [sigmod2000])

log(# pairs) log(r)

slide-26
SLIDE 26

26

Solution# 2: spatial d.m.

log(r) log(# pairs within < = r ) spi-spi spi-ell ell-ell

  • 1.8

slope

  • plateau!
  • repulsion!
slide-27
SLIDE 27

27

spatial d.m.

log(r) log(# pairs within < = r ) spi-spi spi-ell ell-ell

  • 1.8

slope

  • plateau!
  • repulsion!
slide-28
SLIDE 28

28

spatial d.m.

r1 r2 r1 r2

Heuristic on choosing # of clusters

slide-29
SLIDE 29

29

spatial d.m.

log(r) log(# pairs within < = r ) spi-spi spi-ell ell-ell

  • 1.8

slope

  • plateau!
  • repulsion!
slide-30
SLIDE 30

30

spatial d.m.

log(r) log(# pairs within < = r ) spi-spi spi-ell ell-ell

  • 1.8

slope

  • plateau!
  • repulsion

!!

  • duplicates
slide-31
SLIDE 31

31

Solution # 3: traffic

 disk traces: self-similar:

time # bytes

slide-32
SLIDE 32

32

Solution # 3: traffic

 disk traces (80-20 ‘law’ = ‘multifractal’)

time # bytes

20% 80%

slide-33
SLIDE 33

33

Solution# 3: traffic

Clarification:

 fractal: a set of points that is self-similar  multifractal: a probability density function

that is self-similar Many other time-sequences are bursty/clustered: (such as?)

slide-34
SLIDE 34

34

Tape accesses

time Tape# 1 Tape# N # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...)

slide-35
SLIDE 35

35

Tape accesses

time Tape# 1 Tape# N # tapes retrieved # qual. records

50-50 = Poisson real

slide-36
SLIDE 36

36

Road map

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More tools and examples  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-37
SLIDE 37

37

More tools

 Zipf’s law  Korcak’s law / “fat fractals”

slide-38
SLIDE 38

38

A famous power law: Zipf’s law

  • Q: vocabulary word frequency in a

document - any pattern?

aaron zoo freq.

slide-39
SLIDE 39

39

A famous power law: Zipf’s law

  • Bible - rank vs

frequency (log-

log)

log(rank) log(freq) “a” “the”

slide-40
SLIDE 40

40

A famous power law: Zipf’s law

  • Bible - rank vs

frequency (log-log)

  • similarly, in

many other

languages; for customers and sales volume; city populations etc etc

log(rank) log(freq)

slide-41
SLIDE 41

41

A famous power law: Zipf’s law

  • Zipf distr:

freq = 1/ rank

  • generalized Zipf:

freq = 1 / (rank)^ a

log(rank) log(freq)

slide-42
SLIDE 42

42

Olympic medals (Sidney):

y = -0.9676x + 2.3054 R2 = 0.9458 0.5 1 1.5 2 2.5 0.5 1 1.5 2 Series1 Linear (Series1)

rank log(# medals)

slide-43
SLIDE 43

43

More power laws: areas – Korcak’s law

Scandinavian lakes Any pattern?

slide-44
SLIDE 44

44

More power laws: areas – Korcak’s law

Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( > = area)) log(area)

slide-45
SLIDE 45

45

More power laws: Korcak

Japan islands

slide-46
SLIDE 46

46

More power laws: Korcak

Japan islands; area vs cumulative count (log-log axes) log(area) log(count( > = area))

slide-47
SLIDE 47

47

(Korcak’s law: Aegean islands)

slide-48
SLIDE 48

48

Korcak’s law & “fat fractals”

How to generate such regions?

slide-49
SLIDE 49

49

Korcak’s law & “fat fractals”

Q: How to generate such regions? A: recursively, from a single region

slide-50
SLIDE 50

50

so far we’ve seen:

 concepts:

 fractals, multifractals and fat fractals

 tools:

 correlation integral (= pair-count plot)  rank/frequency plot (Zipf’s law)  CCDF (Korcak’s law)

slide-51
SLIDE 51

51

Road map

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More tools and examples  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-52
SLIDE 52

52

Other applications: Internet

 How does the internet look like?

CMU

slide-53
SLIDE 53

53

Other applications: Internet

 How does the internet look like?  Internet routers: how many neighbors

within h hops?

CMU

slide-54
SLIDE 54

54

(reminder: our tool-box:)

 concepts:

 fractals, multifractals and fat fractals

 tools:

 correlation integral (= pair-count plot)  rank/frequency plot (Zipf’s law)  CCDF (Korcak’s law)

slide-55
SLIDE 55

55

Internet topology

 Internet routers: how many neighbors

within h hops?

Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops ) log(# pairs) 2.8

slide-56
SLIDE 56

56

More power laws on the Internet

degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree)

  • 0.82
slide-57
SLIDE 57

57

More power laws - internet

 pdf of degrees: (slope: 2.2 )

Log(count) Log(degree)

  • 2.2
slide-58
SLIDE 58

58

Even more power laws on the Internet

Scree plot for Internet domains (log-log) [sigcomm99] log(i) log( i-th eigenvalue)

0.47

slide-59
SLIDE 59

59

More apps: Brain scans

 Oct-trees; brain-scans

  • ctree levels

Log(# octants) 2.63 = fd

slide-60
SLIDE 60

60

More apps: Medical images

[Burdett et al, SPIE ‘93]:

 benign tumors: fd ~ 2.37  malignant: fd ~ 2.56

slide-61
SLIDE 61

61

More fractals:

 cardiovascular system: 3 (!)  stock prices (LYCOS) - random walks: 1.5  Coastlines: 1.2-1.58 (Norway!)

1 year 2 years

slide-62
SLIDE 62

62

slide-63
SLIDE 63

63

More power laws

 duration of UNIX jobs [Harchol-Balter]  Energy of earthquakes (Gutenberg-

Richter law) [simscience.org]

log(freq) magnitude day amplitude

slide-64
SLIDE 64

64

Even more power laws:

 publication counts (Lotka’s law)  Distribution of UNIX file sizes  Income distribution (Pareto’s law)  web hit counts [Huberman]

slide-65
SLIDE 65

65

Power laws, cont’ed

 In- and out-degree distribution of web

sites [Barabasi], [IBM-CLEVER]

 length of file transfers [Bestavros+ ]  Click-stream data (w/ A. Montgomery

(CMU-GSIA) + MediaMetrix)

slide-66
SLIDE 66

66

Road map

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More examples and tools  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-67
SLIDE 67

67

Settings for fractals:

Points; areas (-> fat fractals), eg:

slide-68
SLIDE 68

68

Settings for fractals:

Points; areas, eg:

 cities/stores/hospitals, over earth’s

surface

 time-stamps of events (customer

arrivals, packet losses, criminal actions)

  • ver time

 regions (sales areas, islands, patches of

habitats) over space

slide-69
SLIDE 69

69

Settings for fractals:

 customer feature vectors (age, income,

frequency of visits, amount of sales per visit)

‘good’ customers ‘bad’ customers

slide-70
SLIDE 70

70

Some uses of fractals:

 Detect non-existence of rules (if points

are uniform)

 Detect non-homogeneous regions (eg.,

legal login time-stamps may have different fd than intruders’)

 Estimate number of neighbors /

customers / competitors within a radius

slide-71
SLIDE 71

71

Multi-Fractals

Setting: points or objects, w/ some value, eg:

 cities w/ populations  positions on earth and amount of

gold/water/oil underneath

 product ids and sales per product  people and their salaries  months and count of accidents

slide-72
SLIDE 72

72

Use of multifractals:

 Estimate tape/disk accesses

 how many of the 100 tapes contain my 50

phonecall records?

 how many days without an accident?

time Tape# 1 Tape# N

slide-73
SLIDE 73

73

Use of multifractals

 how often do we exceed the threshold?

time # bytes

Poisson

slide-74
SLIDE 74

74

Use of multifractals cont’d

 Extrapolations for/from samples

time # bytes

slide-75
SLIDE 75

75

Use of multifractals cont’d

 How many distinct products account for

90% of the sales?

20% 80%

slide-76
SLIDE 76

76

Road map

 Motivation – 3 problems / case studies  Definition of fractals and power laws  Solutions to posed problems  More examples and tools  Discussion - putting fractals to work!  Conclusions – practitioner’s guide  Appendix: gory details - boxcounting

plots

slide-77
SLIDE 77

77

Conclusions

 Real data often disobey textbook

assumptions (Gaussian, Poisson,

uniformity, independence)

 avoid ‘mean’ - use median, or even better,

use:

 fractals, self-similarity, and power laws,

to find patterns - specifically:

slide-78
SLIDE 78

78

Conclusions

 tool# 1: (for points) ‘correlation

integral’: (# pairs within < = r) vs

(distance r)

 tool# 2: (for categorical values)

rank-frequency plot (a’la Zipf)

 tool# 3: (for numerical values)

CCDF: Complementary cumulative

  • distr. function (# of elements with value

> = a )

slide-79
SLIDE 79

79

Practitioner’s guide:

 tool# 1: # pairs vs distance, for a set of objects,

with a distance function (slope = intrinsic dimensionality)

log(hops)

log(# pairs)

2.8 log( r ) log(# pairs(within < = r))

1.51 internet MGcounty

slide-80
SLIDE 80

80

Practitioner’s guide:

 tool# 2: rank-frequency plot (for categorical

attributes)

log(rank) log(degree)

  • 0.8

2 internet domains Bible log(freq) log(rank)

slide-81
SLIDE 81

81

Practitioner’s guide:

 tool# 3: CCDF, for (skewed) numerical

attributes, eg. areas of islands/lakes, UNIX jobs...)

log(count( > = area)) log(area)

scandinavian lakes

slide-82
SLIDE 82

82

Books

 Strongly recommended intro book:

 Manfred Schroeder Fractals, Chaos, Power

Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991

 Classic book on fractals:

 B. Mandelbrot Fractal Geometry of Nature,

W.H. Freeman, 1977

slide-83
SLIDE 83

83

References

 [ieeeTN94] W. E. Leland, M.S. Taqqu, W.

Willinger, D.V. Wilson, On the Self-Similar Nature

  • f Ethernet Traffic, IEEE Transactions on

Networking, 2, 1, pp 1-15, Feb. 1994.

[pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R-trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4- 13

slide-84
SLIDE 84

84

References

 [vldb95] Alberto Belussi and Christos Faloutsos,

Estimating the Selectivity of Spatial Queries Using the ` Correlation' Fractal Dimension Proc. of VLDB,

  • p. 299-310, 1995

 [vldb96] Christos Faloutsos, Yossi Matias and Avi

Silberschatz, Modeling Skewed Distributions Using Multifractals and the ` 80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996.

slide-85
SLIDE 85

85

References

 [vldb96] Christos Faloutsos and Volker Gaede

Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India,

  • Sept. 1996

 [sigcomm99] Michalis Faloutsos, Petros Faloutsos

and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999

slide-86
SLIDE 86

86

References

 [icde99] Guido Proietti and Christos Faloutsos,

I/O complexity for range queries on region data stored using an R-tree International Conference

  • n Data Engineering (ICDE), Sydney, Australia,

March 23-26, 1999

 [sigmod2000] Christos Faloutsos, Bernhard

Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000

slide-87
SLIDE 87

87

Appendix - Gory details

 Bad news: There are more than one

fractal dimensions

 Minkowski fd; Hausdorff fd; Correlation fd;

Information fd

 Great news:

 they can all be computed fast!  they usually have nearby values

slide-88
SLIDE 88

88

Fast estimation of fd(s):

 How, for the (correlation) fractal

dimension?

 A: Box-counting plot:

log( r )

r pi log(sum(pi ^ 2))

slide-89
SLIDE 89

89

Definitions

 pi : the percentage (or count) of points

in the i-th cell

 r: the side of the grid

slide-90
SLIDE 90

90

Fast estimation of fd(s):

 compute sum(pi^ 2) for another grid

side, r’

log( r )

r’ pi’ log(sum(pi ^ 2))

slide-91
SLIDE 91

91

Fast estimation of fd(s):

 etc; if the resulting plot has a linear part,

its slope is the correlation fractal dimension D2

log( r )

log(sum(pi ^ 2))

slide-92
SLIDE 92

92

Definitions (cont’d)

 Many more fractal dimensions Dq (related

to Renyi entropies):

) log( ) log( 1 ) log( ) log( 1 1

1

r p p D q r p q D

i i q i q

∂ ∂ = ≠ ∂ ∂ − =

∑ ∑

slide-93
SLIDE 93

93

Hausdorff or box-counting fd:

 Box counting plot: Log( N ( r ) ) vs Log (

r)

 r: grid side  N (r ): count of non-empty cells  (Hausdorff) fractal dimension D0:

) log( )) ( log( r r N D ∂ ∂ − =

slide-94
SLIDE 94

94

Definitions (cont’d)

 Hausdorff fd:

r log(r) log(# non-empty cells) D0

slide-95
SLIDE 95

95

Observations

 q= 0: Hausdorff fractal dimension  q= 2: Correlation fractal dimension

(identical to the exponent of the number of neighbors vs radius)

 q= 1: Information fractal dimension

slide-96
SLIDE 96

96

Observations, cont’d

 in general, the Dq’s take similar, but not

identical, values.

 except for perfectly self-similar point-

sets, where Dq= Dq’ for any q, q’

slide-97
SLIDE 97

97

Examples:MG county

 Montgomery County of MD (road end-

points)

slide-98
SLIDE 98

98

Examples:LB county

 Long Beach county of CA (road end-

points)

slide-99
SLIDE 99

99

Conclusions

 many fractal dimensions, with nearby

values

 can be computed quickly

(O(N) or O(N log(N))

 (code: on the web)