Working with Time Series: Smoothing and Imputation Frameworks to - - PowerPoint PPT Presentation

working with time series smoothing and imputation
SMART_READER_LITE
LIVE PREVIEW

Working with Time Series: Smoothing and Imputation Frameworks to - - PowerPoint PPT Presentation

Working with Time Series: Smoothing and Imputation Frameworks to Improve Data Density Anjali Samani Director of Data Science asamani@circleup.com @anjalisamani @circleup September 26, 2019 This presentation was prepared by CircleUp Network,


slide-1
SLIDE 1

This presentation was prepared by CircleUp Network, Inc. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Network, Inc.

  • r any products or offerings of its subsidiaries or affiliates.

Working with Time Series: Smoothing and Imputation Frameworks to Improve Data Density

Anjali Samani Director of Data Science

asamani@circleup.com @anjalisamani @circleup

September 26, 2019

slide-2
SLIDE 2

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

2

SETTING EXPECTATIONS

Conceptual frameworks and approach rather low level technical details

  • When you should and should not use smoothing or denoising
  • How to correctly approach imputation for time-series datasets

and quantitative evaluation of techniques

  • Downstream implications and considerations of different choices
  • Mathematical theory and implementation details

What will be covered What will not be covered in this presentation

slide-3
SLIDE 3

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

3

OUTLINE

Roadmap to the frameworks

  • Introductions: About CircleUp
  • Problem Statement: Why is smoothing and imputation needed?
  • Semantics: Denoising vs Smoothing
  • Smoothing Framework: To smooth or not to smooth…
  • Missing Value Imputation: How to approach
  • Imputation Framework: Making data driven choices
slide-4
SLIDE 4

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

4

slide-5
SLIDE 5

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

5

CIRCLEUP

A data-driven investment platform for the consumer sector

PLATFORM MODELS DATA

H E L I O

Equity Funds Credit Fund Insights

APPLICATIONS

Interpretable Models Black-Box Models Public Sources Partnerships Proprietary

slide-6
SLIDE 6

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

6

CIRCLEUP

Unlocking private market investing with data

Brand Discovery Data Acquisition & Ingestion Knowledge Graph Data Aggregation Descriptive Analytics Predictive Modeling Prescriptive Analytics

  • Helio finds, classifies, and

evaluates a universe of ~1.4 million companies across more than 200 sources through a combination of practitioner, public, and partner data.

  • With this data, Helio evaluates

businesses on an array of factors to spotlight breakout brands.

slide-7
SLIDE 7

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

7

Why is smoothing and imputation needed?

slide-8
SLIDE 8

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

8

IF YOUR DATA IS BAD, YOUR ANALYTICS TOOLS ARE USELESS

Your predictions are only as good as the data used by the model to learn patterns

  • A model’s ability to learn and correctly predict future outcomes is

greatly influenced by underlying data

  • Noisy, incomplete data can restrict its application to only a small

set of techniques

  • Business decisions made using poor quality data and hence

modelling outcomes can be very costly

slide-9
SLIDE 9

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

9

ALTERNATIVE DATA: THE NEW GAME CHANGER

Complementary to conventional data

  • Data curated from non-traditional sources
  • Typically complementary to traditional data
  • In terms of additional signal that can be derived from it

Alternative data is noisy and ephemeral

slide-10
SLIDE 10

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

10

ALTERNATIVE DATA: SOURCES

Nontraditional sources

Alternative data is often generated by

  • Connected devices and varied sensors: e.g. satellite images,

geolocation data

  • Transactional systems: e.g. credit card transactions
  • Social networking sites: e.g. interests and affiliations
  • The internet: e.g. online browsing activity

And many more…

slide-11
SLIDE 11

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

11

ALTERNATIVE DATA: DIAMOND IN THE ROUGH

Requires significant cleaning and process to mine signal from alternative data

To extract meaningful signals from alternative data, it is necessary to apply appropriate smoothing or denoising and imputation techniques to generate clean and complete time-series.

slide-12
SLIDE 12

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

12

Denoising vs Smoothing signals

slide-13
SLIDE 13

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

13

DE-NOISING: MATHEMATICAL SOLUTIONS

Removal of noise from a mixture of signal and noise to preserve maximum information

  • If the structure of the signal or noise is known, it can be explicitly modeled
  • If the statistical properties of the signal are known, identify and remove

noise

  • Create a series of test functions and noisy versions of test functions to simulate signal and

types of noise it may be susceptible to

  • Apply different de-noising techniques and choose the one that maximizes signal-to-noise

ratio

Observed Signal Signal Noise

slide-14
SLIDE 14

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

14

DE-NOISING: ENGINEERING SOLUTION

Redundancy in data collection can help with noise removal

  • If the system is stable, build redundancy in data collection

May not be possible with ephemeral data

slide-15
SLIDE 15

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

15

DE-NOISING

Not easy, but possible

  • Advantage: Greater confidence in the processed signal and its

information content

  • Disadvantage: Can be challenging - requires technical and domain

expertise

Likely unsuitable for Alternative Data – nature of signal and noise typically unknown

slide-16
SLIDE 16

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

16

SMOOTHING: SIGNAL OR NOISE?

When it is difficult to know what is signal and what is noise

Growth in signal for an emerging brand

slide-17
SLIDE 17

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

17

SMOOTHING: SIGNAL OR NOISE?

When it is difficult to know what is signal and what is noise

Growth in signal for an emerging brand

Brand shows up

  • n our radar
slide-18
SLIDE 18

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

18

SMOOTHING: SIGNAL OR NOISE?

When it is difficult to know what is signal and what is noise

Growth in signal for an emerging brand

Brand shows up

  • n our radar

Signal or Noise?

slide-19
SLIDE 19

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

19

SMOOTHING: SIGNAL OR NOISE?

When it is difficult to know what is signal and what is noise

Growth in signal for an emerging brand

Signal or Noise? Brand shows up

  • n our radar

Actual future values

slide-20
SLIDE 20

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

20

SMOOTHING: SIGNAL OR NOISE?

Reduction of excess variance from the data to highlight patterns and trends

Growth in signal for an emerging brand

Signal or Noise? Brand shows up

  • n our radar
  • Smoothing can be a little experimental in nature
  • Difficult to distinguish legitimate observations from noisy outliers
slide-21
SLIDE 21

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

21

SMOOTHING: SIGNAL OR NOISE?

Smoothing can hide the very patterns you may want to identify

Growth in signal for an emerging brand

Brand shows up

  • n our radar

Unsmoothed values Smoothed values

slide-22
SLIDE 22

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

22

SMOOTHING: SIGNAL OR NOISE?

Smoothing can hide the very patterns you may want to identify

Growth in signal for an emerging brand

Brand shows up

  • n our radar

Unsmoothed values Smoothed values Hypothetical Growth Threshold for investment

slide-23
SLIDE 23

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

23

SMOOTHING: SIGNAL OR NOISE?

Smoothing can hide the very patterns you may want to identify

Growth in signal for an emerging brand

Brand shows up

  • n our radar

Unsmoothed values Smoothed values Hypothetical Growth Threshold for investment Actual future values

Cost of missed opportunities can far outweigh any time-saving benefits

  • f smoothing
slide-24
SLIDE 24

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

24

DANGERS OF SMOOTHING

Smoothing can be misleading and cost of missed opportunities can be high

  • Smoothing make a series appear less volatile than it is
  • It may also mask the very patterns a practitioner is seeking to identify

Example borrowed from https://serialmentor.com/dataviz/visualizing-trends.html

slide-25
SLIDE 25

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

25

SMOOTHING

Reduction of excess variance from the data to highlight patterns and trends

  • Identify trends and other

patterns (e.g. seasonality) more easily

  • Useful for simple

forecasting when patterns

  • ver mid- to long-term

matter more than short- term fluctuations

  • Will reduce variance, not

eliminate completely

Observed Signal Residual Trend Seasonal

slide-26
SLIDE 26

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

26

TECHNICAL ISSUES WITH SMOOTHING

Correctly using smoothing requires thorough statistical knowledge and expertise

  • Higher Prediction Errors: If smoothed series used as inputs to
  • Uncertainty due to smoothing needs to be explicitly accounted for
  • Information loss
  • Bias – variance tradeoff
  • Correlations may be distorted
  • Parameter Errors Underestimated: Predicted by
  • Algebraic propagation-of-error
  • Bootstrapping
slide-27
SLIDE 27

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

27

COMMON SMOOTHING MISTAKES TO AVOID

Mediocrity vs excellence

  • Generalising too much too soon
  • Not inquiring about root cause of variance / noise
  • Not checking implicitly assumptions and implications of the

chosen solution

  • Using “pre-smoothed” series in machine learning models

Beware of data leakage from “pre-smoothed” series

slide-28
SLIDE 28

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

28

SMOOTHING: KEY TAKEAWAYS

Proceed with caution!

There is no one-size-fits-all solution Treat each time-series with equal respect Approach scaling to large number of series with extreme caution

slide-29
SLIDE 29

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

29

Smoothing Framework

slide-30
SLIDE 30

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

30

TO SMOOTH OR NOT TO SMOOTH…

It depends…

Data Science or Business Use? Are you quantifying uncertainty or forecasting unseen data? Is the purpose simplification for heuristics or audience understanding?

Proceed

Are you going to be making inferences from the data? Will the in-sample smoothed data be used for more analysis/ model? Are you sure you’re not fooling yourself? Do you believe it will support simpler conclusions than raw data? Do you believe that the smoothed data will be more accurate than the raw data? Are you accounting for uncertainty of smoothing in results?

Proceed with Caution

Some methods may be more appropriate than others

Not Recommended

High chance of statistical malpractice

Do you have exogenous data or theory to distinguish signal from noise?

Not Recommended

Compromising veracity

Not Recommended

Errors underestimated. Simulates higher certainty than warranted

YES NO YES NO YES NO YES NO BIZ DS YES NO

Proceed with Caution

Some methods may be more appropriate than others

NO YES NO YES NO YES NO YES

Are small scale structures in the data very important?

Not Recommended

Small scale structures are likely not preserved

NO YES

Should I smooth the data? Will the data be undergoing any other statistical procedures?

NO YES DS

slide-31
SLIDE 31

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

31

Imputing Missing Values

slide-32
SLIDE 32

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

32

A PRACTITIONER’S APPROACH

Missing data is not the same as missing information

  • Missing data is ubiquitous in data science
  • Academic literature will only get you so far
  • Pay attention to your business requirements
  • Computational complexity is a real consideration
  • Use prudence over theory
slide-33
SLIDE 33

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

33

Imputation Framework

slide-34
SLIDE 34
  • 1. Start with some dense but

representative subset of your data

Actual data Distribution of actual data vs that used for imputation testing Test data

NOT representative!

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

34

KNOW YOUR DATA

If you do not know your data, you do not know what is good for it

If you have sufficient data, repeat the exercise with multiple cuts of the data

Actual data

slide-35
SLIDE 35
  • 2. Simulate the missingness in the

data

  • Missing at random vs

not missing at random

  • Fraction of missingness
  • Location of missingness in the

series

  • Single value vs multiple

consecutive values

  • Simulate incrementally longer

missing streaks, starting with a single value

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

35

KNOW YOUR DATA: MISSINGNESS

Your simulation is only as good as your understanding of your data and its missingness

slide-36
SLIDE 36
  • 3. Use imputation strategies to

estimate the artificially missing values Effectiveness of imputation strategies will vary based on

  • Type of missingness
  • Fraction of missingness
  • Distribution of data
  • Correlations within the data

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

36

IMPUTE MISSING VALUES: TEST MULTIPLE STRATEGIES

Different approaches may work for different segments of the population

Incomplete Data Imputed Data

slide-37
SLIDE 37

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

37

EVALUATE IMPUTATION STRATEGIES

A useful metric is both accurate (in that it measures what it says it measures) and aligned with your goals ~ Seth Godin

  • 4. Evaluate the imputed values against actual values

Ensure statistical properties of final data (with imputed values) have not changed materially from observed data

slide-38
SLIDE 38

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

38

MULTIPLE CONSECUTIVE MISSING VALUES

Although possible, it may not make sense to impute all consecutive missing values

How many values can you safely impute?

  • Look for step-change

in error

  • Determine based on

maximum tolerable error for your specific application

Error vs the number of consecutive missing values using last-value- fill-forward and linear interpolation Number of consecutive missing values Error Metric: MAPE

Beyond this point, error between imputed and actual values rise sharply Error between imputed and actual values is already >100% Both strategies perform equally poorly at this point

slide-39
SLIDE 39

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

39

Final Thoughts

slide-40
SLIDE 40

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

40

FINAL THOUGHTS

If you remember nothing else, please remember

  • Smoothing: Unless visualising or doing EDA, approach with extreme caution
  • Imputation: Know your data and its missingness well and then simulate
  • No one-size-fits-all solution
  • Treat each time series with equal respect
  • Approach scaling smoothing and imputation with extreme caution!!!

Thank You!

slide-41
SLIDE 41

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

41

RATE TODAY’S SESSION

How did I do?

Session page on conference website O’Reilly Events App

slide-42
SLIDE 42

This presentation was prepared by CircleUp Advisors LLC for individuals believed to be Qualified Institutional Buyers and interested in learning more about CircleUp Advisors LLC and the technological platform of its affiliate. This is for information purposes only and is not intended to be an offer or solicitation of any services or products offered by CircleUp Advisors LLC or any products or offerings of its affiliate(s).

42

Appendix

slide-43
SLIDE 43

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

43

ERROR METRICS

A useful metric is both accurate (in that it measures what it says it measures) and aligned with your goals ~ Seth Godin

  • Root Mean Squared Error (RMSE): Penalises higher differences more
  • Mean Absolute Error (MAE): More interpretable and robust to outliers
  • Mean Absolute % Error (MAPE): Relative error - needs to be adjusted if

the true value is 0

  • Will be higher for more volatile series
  • Useful when range of values is large or data is skewed
  • Correlation between estimated and actual values
slide-44
SLIDE 44

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

44

MULTIPLE IMPUTATION

Once is luck, twice is coincidence, three times is a pattern

Incomplete Data Imputed Data Analysis Data Final outcome + Uncertainty Estimates

Multiple Imputation

slide-45
SLIDE 45

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

45

References

Problem Statement

  • “4 Most Popular Alternative Data Sources Explained”:

https://www.kdnuggets.com/2019/07/4-most-popular-alternative-data-sources-explained.html

  • “Bad Data Costs the U.S. $3 Trillion Per Year”: https://hbr.org/2016/09/bad-data-

costs-the-u-s-3-trillion-per-year

  • “If Your Data Is Bad, Your Machine Learning Tools Are Useless”:

https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless

  • “Can Your Data Be Trusted?”: https://hbr.org/2015/10/can-your-data-be-trusted
  • “Historical and impact analysis of API breaking changes: A large-scale

study”: https://ieeexplore.ieee.org/document/7884616

slide-46
SLIDE 46

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

46

References

Smoothing

  • “Forecasting: Principles and Practice”; Rob J Hyndman and George

Athanasopoulos: https://otexts.com/fpp2/

  • ” Do NOT smooth time series before computing forecast skill”:

https://wmbriggs.com/post/735/

  • “Do not smooth times series, you hockey puck!”: https://wmbriggs.com/post/195/
  • “Signals and noise”: https://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html
  • “Smoothing”: https://terpconnect.umd.edu/~toh/spectrum/Smoothing.html
slide-47
SLIDE 47

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

47

References

Imputation

  • ”Missing-Data Imputation”: http://www.stat.columbia.edu/~gelman/arm/missing.pdf
  • ” From Predictive Methods to Missing Data Imputation: An Optimization

Approach”: http://www.jmlr.org/papers/volume18/17-073/17-073.pdf

  • “To impute or not to impute: that is the question”: https://www.paultwin.com/wp-

content/uploads/Lodder_1140873_Paper_Imputation.pdf

  • “A Review of Methods for Missing Data”:

http://galton.uchicago.edu/~eichler/stat24600/Admin/MissingDataReview.pdf

  • “What to Do about Missing Values in Time-Series Cross-Section Data”:

https://gking.harvard.edu/files/pr.pdf

  • “Statistical analysis with missing data”; Little, Roderick JA, and Donald B.

Rubin; Vol. 793. John Wiley & Sons, 2014.

  • “Inference and missing data”; Rubin, Donald B.; Biometrika 63.3 (1976):

581–592.

slide-48
SLIDE 48

O’Reilly Strata Data Conference, New York, 2019 strataconf.com #stratadata

48

References

Imputation

  • ”How to handle missing values”: https://towardsdatascience.com/how-to-handle-

missing-data-8646b18db0d4

  • “Multiple imputation for missing data in epidemiological and clinical

research: potential and pitfalls”: https://www.bmj.com/content/338/bmj.b2393