Power-Law Distributions in Empirical Data Article for Advanced - - PowerPoint PPT Presentation

▶

Oct 17, 2023 103 likes •265 views

Power-Law Distributions in Empirical Data Article for Advanced Methods in Applied Statistics Christian Anker Rosiek 8th March 2018 Christian Anker Rosiek Power-Law Distributions in Empirical Data 1 / 14 SIAM REVIEW ? 2009 Society for

SLIDE 1

Power-Law Distributions in Empirical Data

Article for Advanced Methods in Applied Statistics Christian Anker Rosiek 8th March 2018

Christian Anker Rosiek Power-Law Distributions in Empirical Data 1 / 14

SLIDE 2

SIAM REVIEW ? 2009 Society for Industrial and Applied Mathematics

Vol. 51, No. 4, pp. 661-703

Power-Law Distributions in

Empirical Data*

Aaron Claused

Cosma Rohilla Shalizi*

M. E. J. Newman^
Abstract. Power-law distributions occur in many situations of scientific interest and have significant

consequences for our understanding of natural and man-made phenomena. Unfortunately, the detection and characterization of power laws is complicated by the large fluctuations that occur in the tail of the distribution?the part of the distribution representing large but rare events?and by the difficulty of identifying the range over which power-law behav ior holds. Commonly used methods for analyzing power-law data, such as least-squares fitting, can produce substantially inaccurate estimates of parameters for power-law dis tributions, and even in cases where such methods return accurate answers they are still unsatisfactory because they give no indication of whether the data obey a power law at

all. Here we present a principled statistical framework for discerning and quantifying

power-law behavior in empirical data. Our approach combines maximum-likelihood fitting methods with goodness-of-fit tests based on the Kolmogorov-Smirnov (KS) statistic and likelihood ratios. We evaluate the effectiveness of the approach with tests on synthetic data and give critical comparisons to previous approaches. We also apply the proposed methods to twenty-four real-world data sets from a range of different disciplines, each of which has been conjectured to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data, while in others the power law is ruled out. Key words, power-law distributions, Pareto, Zipf, maximum likelihood, heavy-tailed distributions, likelihood ratio test, model selection AMS subject classifications. 62-07, 62P99, 65C05, 62F99

DOI. 10.1137/070710111
I. Introduction. Many empirical quantities cluster around a typical value. The

speeds of cars on a highway, the weights of apples in a store, air pressure, sea level, the temperature in New York at noon on a midsummer's day: all of these things vary somewhat, but their distributions place a negligible amount of probability far from the typical value, making the typical value representative of most observations. For instance, it is a useful statement to say that an adult male American is about 180cm tall because no one deviates very far from this height. Even the largest deviations, which are exceptionally rare, are still only about a factor of two from the mean in

* Received by the editors December 2, 2007; accepted for publication (in revised form) February 2, 2009; published electronically November 6, 2009. This work was supported in part by the Santa Fe Institute (AC) and by grants from the James S. McDonnell Foundation (CRS and MEJN) and the National Science Foundation (MEJN).

htt p: / / www. siam. org / j our nals / sirev /51-4/71011.html

f Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, and Department of Computer Science, University of New Mexico, Albuquerque, NM 87131. * Department of Statistics, Carnegie Mellon University, Pittsburgh, PA 15213. ? Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI 48109. 661

doi:10.1137/070710111 http://tuvalu.santafe.edu/~aaronc/powerlaws/

Christian Anker Rosiek Power-Law Distributions in Empirical Data 2 / 14

SLIDE 3

Power law distributions

Continuous distribution p(x) = α − 1 xmin

x

xmin

−α

(1) Discrete distribution p(x) = x−α ζ(α, xmin) (2)

Christian Anker Rosiek Power-Law Distributions in Empirical Data 3 / 14

SLIDE 4

Power-law histogram (continuous distribution)

50 100 150 x 0.0 0.2 0.4 0.6 0.8 1.0 p(x)

n = 10 000 , α = 3.5 , xmin = 1 .

Christian Anker Rosiek Power-Law Distributions in Empirical Data 4 / 14

SLIDE 5

Power-law histogram (continuous distribution)

100 101 102 x 10−7 10−5 10−3 10−1 p(x)

n = 10 000 , α = 3.5 , xmin = 1 .

Christian Anker Rosiek Power-Law Distributions in Empirical Data 4 / 14

SLIDE 6

Linear least squares fit

100 101 102 x 10−6 10−4 10−2 100 p(x) LS + PDF True α

n = 10 000 , α = 3.5 , xmin = 1 → ˆ αLS = 3.34(10) .

Christian Anker Rosiek Power-Law Distributions in Empirical Data 5 / 14

SLIDE 7

Maximum likelihood parameter estimation

Continuous distribution ˆ αMLE = 1 + n

n

ln xi xmin

−1

. (3) Discrete distribution ˆ αMLE = argmax

L (4a) with L = −n ln ζ(α, xmin) − α

ln xi . (4b)

Christian Anker Rosiek Power-Law Distributions in Empirical Data 6 / 14

SLIDE 8

Maximum likelihood parameter estimation

100 101 102 x 10−6 10−4 10−2 100 p(x) LS + PDF True α

Cont. MLE

n = 10 000 , α = 3.5 , xmin = 1 → ˆ αMLE = 3.51(2) .

Christian Anker Rosiek Power-Law Distributions in Empirical Data 7 / 14

SLIDE 9

Parameter estimation comparison

1.5 2 2.5 3 3.5 1.5 2 2.5 3 3.5

est. α

1.5 2 2.5 3 3.5 1.5 2 2.5 3 3.5

est. α

true (a) (b) α

Disc. MLE
Cont. MLE

LS + PDF LS + CDF

Article [1] Figure 3.2. Different α-estimators used with (a) discrete and (b) continuous power-laws.

Christian Anker Rosiek Power-Law Distributions in Empirical Data 8 / 14

SLIDE 10

US city population

100 101 102 103 104 105 106 107 City population x 10−4 10−3 10−2 10−1 100 P(x)

P(x) = ∞

x p(x′)dx′

Christian Anker Rosiek Power-Law Distributions in Empirical Data 9 / 14

SLIDE 11

Estimating cut-off xmin

10 10

10 1 1.5 2 2.5 3 3.5 4 4.5 estimated α estimated xmin Article [1] Figure 3.3. 5000 samples with α = 2.5 , xmin = 100 averaged over 2500 trials.

Christian Anker Rosiek Power-Law Distributions in Empirical Data 10 / 14

SLIDE 12

Estimating cut-off xmin

100 101 102 103 104 Estimated xmin 1 2 3 4 5 Estimated α

5000 samples with α = 2.5 , xmin = 100. 10 individual trials.

Christian Anker Rosiek Power-Law Distributions in Empirical Data 11 / 14

SLIDE 13

Estimating cut-off xmin

One method: Maximize similarity between measured data distribution and best-fit distribution. Similarity is here measured with Kolmogorov-Smirnov test statistic: D = max

x≥xmin |S(x) − P(x)|

(5) where P(x) is measured data CDF and S(x) is best-fit CDF. Additionally, proposed Monte Carlo GOF: Sample a large number of artificial observations from distributions with the best-fit parameters. p -value is now the ratio of simulated samples that have worse D. (Note: Greater p -value is better.)

Christian Anker Rosiek Power-Law Distributions in Empirical Data 12 / 14

SLIDE 14

US city population

100 101 102 103 104 105 106 107 City population x 10−4 10−3 10−2 10−1 100 P(x)

P(x) = ∞

x p(x′)dx′

Christian Anker Rosiek Power-Law Distributions in Empirical Data 13 / 14

SLIDE 15

Rounding off

Not covered here: Model comparison using likelihood ratios. Application to real-world datasets. Appendices: Mathematical and computational details, e.g. MLE convergence, sampling from power-law distributions, etc. Follow-up article [2]: Power-law distributions in binned empirical data. References: [1] Aaron Clauset, Cosma Rohilla Shalizi, and M.E.J. Newman. Power-law distributions in empirical data. SIAM Review 51, 661–703 (2009). [2] Y. Virkar, and A. Clauset. Power-law distributions in binned empirical

data. The Annals of Applied Statistics 8, 89–119 (2014).

Christian Anker Rosiek Power-Law Distributions in Empirical Data 14 / 14