Challenges of forecasting with fat tailed data Aaron Clauset - PowerPoint PPT Presentation

Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor, Computer Science and BioFrontiers Institute, University of Colorado Boulder External Faculty, Santa Fe Institute 15 October 2013 lion people, 1

joint work with Mark Newman Cosma Shalizi Ryan Woodard 2

1. predicting the unpredictable 2. modeling rare events 3. historical probability 4. statistical forecast 5. financial data 6. outlook 3

1. predicting the unpredictable complex systems “heavy” or “fat” tailed quantities • book sales • earthquakes • terrorist attacks • civil or international wars • stock market crashes • electrical power outages • solar flare intensity • etc. etc. 4

24 empirical data sets 0 0 0 0 0 0 10 10 10 10 10 10 (m) (n) (o) (a) (b) (c) � 1 � 1 � 1 10 10 � 1 10 � 1 10 10 � 2 10 � 2 � 2 10 10 P(x) P(x) � 2 � 3 � 2 � 2 10 10 10 10 � 3 � 3 10 10 � 4 10 � 3 � 3 10 10 � 4 � 4 � 5 10 10 10 cities email fires words proteins metabolic � 5 � 4 � 6 � 5 � 4 � 4 10 10 10 10 10 10 0 2 4 6 8 0 1 2 3 0 2 4 0 2 4 0 1 2 0 1 2 3 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 10 10 10 10 10 10 (p) (q) (r) (d) (e) (f) � 1 � 1 � 1 10 10 10 � 2 10 � 2 � 2 � 2 10 10 10 P(x) P(x) � 1 � 1 10 10 � 4 � 3 � 3 � 3 10 10 10 10 � 4 � 4 � 4 10 10 10 � 6 10 flares quakes religions Internet calls wars � 5 � 5 � 2 � 5 � 2 10 10 10 10 10 1 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 6 7 8 0 2 4 0 2 4 6 0 1 2 3 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 10 10 10 10 10 10 (s) (t) (u) (g) (h) (i) � 1 � 1 10 10 � 1 10 � 1 � 2 � 2 � 1 10 10 10 10 � 2 10 P(x) P(x) � 2 � 3 10 10 � 3 10 � 2 � 4 � 4 � 2 10 10 10 10 � 3 10 � 4 � 5 10 10 surnames wealth citations terrorism HTTP species � 4 � 3 � 6 � 5 � 6 � 3 10 10 10 10 10 10 4 5 6 7 8 9 10 11 0 1 2 3 0 2 4 2 4 6 8 0 1 2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 10 10 10 10 10 10 (v) (w) (x) (j) (k) (l) � 1 � 1 � 2 10 10 10 � 2 � 1 � 1 � 1 � 2 10 10 10 10 10 � 4 10 P(x) P(x) � 3 10 � 3 10 � 6 10 � 4 � 2 � 2 � 2 10 10 10 10 � 4 10 � 8 � 5 10 10 � 5 authors web hits web links birds blackouts book sales 10 � 6 � 10 � 3 � 3 � 3 10 10 10 10 10 0 10 1 10 2 10 3 10 4 10 0 1 2 3 5 0 2 4 6 0 2 4 6 3 4 5 6 7 6 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 x x x x x x 5

1. predicting the unpredictable complex systems “heavy” or “fat” tailed quantities • book sales • earthquakes • terrorist attacks • civil or international wars • stock market crashes • electrical power outages • solar flare intensity • etc. etc. 6

1906 San Francisco, M7.8 2008 Sichuan, M7.9 2011 Japan, M8.9 7

earthquake physics Gutenberg-Richter law frequency vs. size 1 Proportion � M 9 0.1 8 0.01 7 Magnitude, M 6 0.001 0 1 2 3 4 5 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Earthquake number (frequency) ∝ (seismic moment) − α 8

0 10 earthquakes vs. wars Proportion � S − 1 10 10 10 9 10 Battle deaths, S (severity) − 2 10 8 3 4 5 6 7 8 10 10 10 10 10 10 10 WWII 7 10 WWI inter-state wars 6 10 1816-2007 5 10 4 10 3 10 20 40 60 80 Interstate war number (1816 − 2007) (frequency) ∝ (deaths) − α 9

earthquakes vs. global terrorism 4 10 0 10 Proportion � S − 1 10 9 − 11 Jan. 1998-2008 − 2 10 Severity, S (deaths) − 3 10 3 10 13,274 deadly attacks − 4 10 0 1 2 3 4 worldwide 2 10 Richardson’s law (1941) 1 10 2000 4000 6000 8000 10000 Attack number (Jan 1998 − 2008) (frequency) ∝ (deaths) − α 10

terrorism & insurgency earthquakes Gutenberg-Richter law Richardson’s law F ∝ M − α F ∝ S − α physics largely known processes largely unknown processes fixed processes dynamic, adaptive forecasting possible how do we forecast? (years of successes) prediction very hard what can we predict? (years of failures) what can we not predict? 11

2. modeling rare events • not in financial markets (yet) • but in global terrorism • how probable was a 9/11-sized event? • how probable is another 9/11-sized event? 13

deadly terrorist events, 1968-2008 14000 12280 12000 number of incidents 10000 8000 6000 4000 2000 957 36 1 0 1 − 9 10 − 99 100 − 999 1000+ deaths per attack RAND-MIPT event database 14

deadly terrorist events, 1968-2008 14000 12280 { 12000 number of incidents 10000 “normal,” 92% 8000 6000 large, 8% { 4000 very large, 0.3% 2000 { 957 36 1 0 1 − 9 10 − 99 100 − 999 1000+ deaths per attack RAND-MIPT event database 15

how probable was a 9/11-sized event? requires a probability model Pr( x ) 16

how probable was a 9/11-sized event? requires a probability model Pr( x ) key observations • care only about large events disproportionate consequences • unknown upper tail structure several models fit well • little data in upper tail large statistical uncertainty 17

how probable was a 9/11-sized event? requires a probability model Pr( x ) key observations • care only about large events separate tail from body disproportionate consequences • unknown upper tail structure multiple tail models several models fit well • little data in upper tail distribution over conclusions large statistical uncertainty model-based, data-driven forecasts 18

step 1: the data Terrorism event data from 0 10 RAND-MIPT Terrorism Knowledge Base (2008). − 1 10 40 years of data (1968-2007) − 2 10 Worldwide (~200 countries) Pr(X � x) 13,274 deadly events − 3 Each event is localized in time 10 and space, and MIPT records its severity (deaths). − 4 9/11 10 9/11 recorded as three events; the NYC event records 2749 deaths. − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 20

step 2: separate tail from body Choose such that x min = y 0 10 h i S ( x ≥ y ) , F ( x ≥ y | ˆ d θ ) is minimized. Here, we let d[ · , · ] tail be the KS-statistic. − 1 10 body − 2 10 Pr(X � x) − 3 10 − 4 9/11 10 − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 21

step 2: model the upper tail Let for values Pr( x ) ∝ x − α 0 10 Pareto distribution . x ≥ x min For the empirical data, we tail − 1 estimate , α = 2 . 4 ± 0 . 1 ˆ 10 with . x min = 10 body This yields 994 tail events (7.5%). − 2 A Monte Carlo hypothesis test 10 Pr(X � x) finds , p = 0 . 68 ± 0 . 03 meaning the power law cannot be rejected as a model of these − 3 10 data. A likelihood ratio test finds the − 4 9/11 stretched exponential and log- 10 normal distributions also plausible. − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 22

step 3: bootstrap the data and repeat Given observed event sizes, n 0 10 generate by drawing , Y Pareto distribution y j , uniformly at j = 1 , . . . , n random, with replacement from the observed events. X = { x i } − 1 10 For each tail model the MLE Pr( x | θ , x min ) parameter choice is θ ( Y, x min ) − 2 10 deterministic. Pr(X � x) The produces a bootstrap distribution that Pr( θ , x min ) − 3 10 capture the statistical uncertainty Pr( � ) within this model. − 4 9/11 10 2.2 2.4 2.6 − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 23

step 4: repeat with alternative models Repeat the above steps, but with 0 10 additional tail models. Here, we Pareto distribution choose: Stretched exponential Stretched exponential − 1 Log-normal Pr( x ) ∝ x β − 1 e − λ x − β 10 Log-normal Pr( x ) ∝ 1 x e − (ln x − µ )2 − 2 10 2 σ 2 Pr(X � x) Both of which cannot be rejected, under a LRT, as a model of − 3 events . x ≥ x min = 10 10 Multiple tail models better represents model uncertainty. − 4 9/11 10 − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 24

Challenges of forecasting with fat tailed data Aaron Clauset - PowerPoint PPT Presentation

Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor, Computer Science and BioFrontiers Institute, University of Colorado Boulder External Faculty, Santa Fe Institute 15 October 2013 lion people, 1

FAT File system Case studies Microsoft, late 70s FAT late 70s; Microsoft File Allocation Table

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed

Flood Forecasting Initiative Guy Shalev Flooding impact Flood Forecasting Flood Forecasting

Forecasts and potential futures Rob Hyndman Author, forecast Forecasting Using R Sample

Fat Determination Study - Hydrotherm vs. AOAC 922.06 What is Fat? Nutrient used as an energy

Nutritional Information Calories Saturated Trans Fat Cholestero Carbohydrate Dietary Vitamin

Fat jets for t tH production Tilman Plehn Heidelberg University Pheno, 5/2010 Fat jets

Forecasting 21 January 2013 1 FCAS Agenda Business Goals & Forecasting Approach

Lecture 10 Forecasting and Model Fitting Colin Rundel 02/20/2017 1 Forecasting 2 Forecasting

Welcome to Forecasting Using R Rob Hyndman Author, forecast Forecasting Using R What you will

Processing Quantities with Result for Addition . . . Heavy-Tailed Distribution of Case of a

Mammal PBL Project By: Anastasia and Amauree: gray wolf, white tailed jaguar What regions of

Optimizing performance in heavy-tailed system: a case study Lyubov V. Potakhina Alexander S.

Importance Sampling Methodology for Multidimensional Heavy-tailed Random Walks Jose Blanchet

Rethinking Class-Balanced Methods for Long-tailed Visual Recognition from a Domain Adaptation

Decoupling Representation and Classifier for Long-Tailed Recognition Bingyi Kang , Saining Xie,

Shared-Memory Programming Models Programmierung Paralleler und Verteilter Systeme (PPV) Sommer

04/09/2018 Linear algebra A brush-up course Jeff Hindsborg 04/09/2018 2 Agenda 1. Real

GSP Coordinating Committee Coordinating Committee Meeting March 26, 2018 Merced

Abstract Classes and Interfaces Mark Austin E-mail: austin@isr.umd.edu Institute for Systems

Traditional and Heavy-Tailed Self Regularization in Neural Network Models Charles H. Martin &

Markets take the stairs up, but the elevator down Kris Boudt Professor of finance and

Estimation of moment-based models with latent variables work in progress Raaella Giacomini and

Measurably Entire Functions and Their Growth Adi Glcksam University of Toronto AMS Sectional

Sambuz

Useful Links

Newsletter

Mail Us

Challenges of forecasting with fat tailed data Aaron Clauset - PowerPoint PPT Presentation

Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor, Computer Science and BioFrontiers Institute, University of Colorado Boulder External Faculty, Santa Fe Institute 15 October 2013 lion people, 1

FAT File system Case studies Microsoft, late 70s FAT late 70s; Microsoft File Allocation Table

Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed

Flood Forecasting Initiative Guy Shalev Flooding impact Flood Forecasting Flood Forecasting

Forecasts and potential futures Rob Hyndman Author, forecast Forecasting Using R Sample

Fat Determination Study - Hydrotherm vs. AOAC 922.06 What is Fat? Nutrient used as an energy

Nutritional Information Calories Saturated Trans Fat Cholestero Carbohydrate Dietary Vitamin

Fat jets for t tH production Tilman Plehn Heidelberg University Pheno, 5/2010 Fat jets

Forecasting 21 January 2013 1 FCAS Agenda Business Goals &amp; Forecasting Approach

Lecture 10 Forecasting and Model Fitting Colin Rundel 02/20/2017 1 Forecasting 2 Forecasting

Welcome to Forecasting Using R Rob Hyndman Author, forecast Forecasting Using R What you will

Processing Quantities with Result for Addition . . . Heavy-Tailed Distribution of Case of a

Mammal PBL Project By: Anastasia and Amauree: gray wolf, white tailed jaguar What regions of

Optimizing performance in heavy-tailed system: a case study Lyubov V. Potakhina Alexander S.

Importance Sampling Methodology for Multidimensional Heavy-tailed Random Walks Jose Blanchet

Rethinking Class-Balanced Methods for Long-tailed Visual Recognition from a Domain Adaptation

Decoupling Representation and Classifier for Long-Tailed Recognition Bingyi Kang , Saining Xie,

Shared-Memory Programming Models Programmierung Paralleler und Verteilter Systeme (PPV) Sommer

04/09/2018 Linear algebra A brush-up course Jeff Hindsborg 04/09/2018 2 Agenda 1. Real

GSP Coordinating Committee Coordinating Committee Meeting March 26, 2018 Merced

Abstract Classes and Interfaces Mark Austin E-mail: austin@isr.umd.edu Institute for Systems

Traditional and Heavy-Tailed Self Regularization in Neural Network Models Charles H. Martin &amp;

Markets take the stairs up, but the elevator down Kris Boudt Professor of finance and

Estimation of moment-based models with latent variables work in progress Raaella Giacomini and

Measurably Entire Functions and Their Growth Adi Glcksam University of Toronto AMS Sectional

Sambuz

Useful Links

Newsletter

Mail Us

Forecasting 21 January 2013 1 FCAS Agenda Business Goals & Forecasting Approach

Traditional and Heavy-Tailed Self Regularization in Neural Network Models Charles H. Martin &