Alternative Data in Finance Example: Lodging Key Metrics Occupancy - - PowerPoint PPT Presentation

alternative data in finance example lodging key metrics
SMART_READER_LITE
LIVE PREVIEW

Alternative Data in Finance Example: Lodging Key Metrics Occupancy - - PowerPoint PPT Presentation

Alternative Data in Finance Example: Lodging Key Metrics Occupancy x Room Rate ~ Revenues Online Room Number of lights on Rates Alternative Data Alternative Data 1. Point of sale transactions 2. Online behavior 3. Purchases 1. Online


slide-1
SLIDE 1

Alternative Data in Finance

slide-2
SLIDE 2

Example: Lodging Key Metrics

Occupancy Room Rate Revenues x ~

Number of lights on

Online Room Rates

Alternative Data

slide-3
SLIDE 3

Alternative Data

  • 1. Point of sale transactions
  • 2. Online behavior
  • 3. Purchases

1. Online 2. Brick and mortar

  • 4. Obscure public records
  • 5. Drone footage analysis ;)
  • 6. Etc etc etc
slide-4
SLIDE 4

Supply Chain

  • 1. Data Vendors / Suppliers
  • 2. Aggregators and Analysts
  • 3. Clients / Funds
slide-5
SLIDE 5

Outline

  • Basic Example (done)
  • What's Alternative Big Data (done)
  • Sourcing
  • Compliance and ethics
  • Predicting revenue and other uses
  • Walk though of common technical challenges
  • Basic trading strategy
  • Q & A
slide-6
SLIDE 6

Data Sourcing

  • Direct data gathering
  • Data vendors
  • Just download the data (JDD)
slide-7
SLIDE 7

Data gathering / Sourcing

  • Harvest the web
  • Primary Research
slide-8
SLIDE 8

Harvesting: Build or Buy?

Build Buy

Control over compliance procedures Faster to scale All IP and harvesting target information stays in house Back data Complete control over costs Risk mitigated by an intermediary Some structuring of the data done by vendor Leverage vendors’ expertise in the data and spidering * Tip for finding web harvesting firms: Look on LinkedIn for folks with web scraping skills and see who they work for.

slide-9
SLIDE 9

Harvesting: Symantec web

  • Diffbot recognizes the content of web pages
  • Compares against schema.org’s structures
  • Automatically collect structured data without explicit structure

definitions

  • Adjusts for changes in page layouts
slide-10
SLIDE 10

Primary Research

  • Expert networks
  • Surveys
  • New ways to look at the world
  • Receipts
  • Serial numbers
  • Alexa or other web monitoring tools
  • Google trends
  • Classified
  • Drone footage
slide-11
SLIDE 11

Evaluating Datasets

  • Scarcity
  • How widely used or marketed is it?
  • Granularity
  • Time
  • Aggregation levels
  • How structured is it?
  • Coverage
  • Sectors / Stocks – Hedge fund motels?
  • Geo

* Creating a standardized quantitative scoring system or ROI matrix to evaluate datasets based on these criteria is a worthwhile endeavor

slide-12
SLIDE 12

Evaluating Vendors

  • Companies monetizing their exhaust data
  • High quality high margin revenue
  • Upstream insights from buyer
  • Traditional data vendors
  • Survey data
  • Financial data aggregation
  • Hybrids
  • 1010 / ITG
slide-13
SLIDE 13

Free Datasets

http://aws.amazon.com/datasets http://databib.org http://datacite.org http://figshare.com http://linkeddata.org http://reddit.com/r/datasets http://thedatahub.org alias http://ckan.net http://quandl.com http://enigma.io Hundreeds more! http://www.quora.com/Where-can-I-find-large-datasets-

  • pen-to-the-public
slide-14
SLIDE 14

High opportunity datasets

  • International
  • Asia
  • Latam
  • Insight into margins
  • Companies are more EPS surprise sensitive than revenue surprise sensitive
  • COGS
  • SG&A
  • Etc
  • B2B
slide-15
SLIDE 15

Compliance overview

  • Intent / Ethics
  • Regulatory
slide-16
SLIDE 16

Compliance overview

Restricted Environment Production Environment Data Vendor PII Scrubbing Process / Encrypted Archiving Organization

slide-17
SLIDE 17

Compliance overview: Guidelines / Control Frameworks

  • NIST 800-122
  • GLBA (Gramm-Leach-Bliley Act)
  • COBIT 5
  • COSO 2013
slide-18
SLIDE 18

Compliance overview

  • Just use regular expressions

^(?:(?=.*\d)(?=.*[A-Z])(?=.*[a-z])|(?=.*\d)(?=.*[^A-Za-z0-9])(?=.*[a-z])|(?=.*[^A-Za-z0-9])(?=.*[A- Z])(?=.*[a-z])|(?=.*\d)(?=.*[A-Z])(?=.*[^A-Za-z0-9]))(?!.*(.)\1{2,})[A-Za-z0- 9!~<>,;:_=?*+#."&§%°()\|\[\]\-\$\^\@\/]{8,32} [a-zA-Z]:|\\)\\)?(((\.)|(\.\.)|([^\\/:*?"|<>. ](([^\\/:*?"|<>. ])|([^\\/:*?"|<>]*[^\\/:*?"|<>. ]))?))\\)*[^\\/:*?"|<>. ](([^\\/:*?"|<>. ])|([^\\/:*?"|<>]*[^\\/:*?"|<>. ]))? ((25[0-5]|2[0-4][0-9]|19[0-1]|19[3-9]|18[0-9]|17[0-1]|17[3- 9]|1[0-6][0-9]|1[1-9]|[2-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9%[0-9A-Fa- f]{2}|[-()_.!~*';/?:@&=+$,A-Za-z0- 9])+)([).!';/?:^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0- 9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$ * Use Regexp Buddy.

slide-19
SLIDE 19

Compliance overview: Web Harvesting Precedent Cases

  • Major (and the majority) of cases. Its an uncharted territory
  • Feist Publications, Inc., v. Rural Telephone Service Co.,
  • Ryanair Scraping Cases
  • Ebay vs Bidders Edge
  • Intel vs Hamidi
  • Cases discussing Browserwrap vs clickwraps
  • Cvent, Inc. v. Eventbrite, Inc
  • 3taps vs Craigslist
  • These do not apply to investment research
slide-20
SLIDE 20

Compliance overview

  • Respect website’s TOS especially if in a Clickwrap
  • Sensibly web harvesting policy
  • Address incoming complaints
  • Limit number of http requests
  • Stay recent on laws and cases
  • Explicitly address headline risk and regulatory risk, create a cost benefit

analysis for headline risk

slide-21
SLIDE 21

Generating value with alternative data

  • Revenue surprise estimates
  • Operating GAAP measures
  • Non GAAP measures
  • Churn, etc
  • Fully or partially automated quant strategies
  • Non equity asset classes
  • PE could benefit from the same operating metrics for diligence
  • PM Development and Big Data Thought Leadership
  • Strategic Investments
  • Marketing Tool for Raising Capital and Talent Recruitment
slide-22
SLIDE 22

Workflow and Process

Data

  • Data Partners
  • Web Collection
  • Storage optimization

Normalization

  • Cleansing
  • Benchmarking
  • De-biasing, Enrichment

Modeling

  • GAAP / Operating Metrics
  • Quant Signals
  • Investment Thesis Insights

Deliverable

  • Metrics Reporting
  • R&D Portfolio
  • Published Signal

Data Analysts

High Performance Computing R&D Quant

Visualizations Sector Research

Data Vendors Third Party Sources

Published Signal Interpretive Research Metrics

L/S Teams & Quant Teams Raw Data Production

Data Acquisition

slide-23
SLIDE 23

The shifting bias longitudinal panel problem

Full Panel

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10

Panel with user add and churn (missing data MAR)

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10

slide-24
SLIDE 24

The 200k and the ~800k are different

The complete panel - ~200k users Users who have the second year of data, but not the first

……………. Dashed Line - 95% confidence N(μ,σ2).

Solutions:

  • Imputation
  • Complete case analysis
  • Weighting methods

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Total Spend Index

slide-25
SLIDE 25

Complete Panel and the rest of users are different

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10

slide-26
SLIDE 26

Complete Panel and the rest of users are different

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 680K Panel 2 720K Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10

slide-27
SLIDE 27

Complete Panel and the rest of users are different

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 680K Panel 2 720K Panel 3 740K Panel 4 760K Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10

Many users are the same, ~90% overlap The further apart the panels, the less user overlap, P1 – P22 only ~32% overlap, most users different

slide-28
SLIDE 28

50 100 150 200 250 300 350 5 10 15 20 25 Sum of Cnt.1 Sum of DPT.1 50 100 150 200 250 300 350 5 10 15 20 25 Sum of Cnt.1 Sum of DPT.1 20 40 60 80 100 120 140 160 2 4 6 8 10 12 14 16 18 20 Sum of Cnt.2 Sum of DPT.2 20 40 60 80 100 120 140 160 180 5 10 15 20 25 Sum of Cnt.4 Sum of DPT.4

User A User B User C User D

Multivariate Time Series Clustering

slide-29
SLIDE 29

Multivariate Time Series Clustering

The pdc package) takes a permutation distribution, which is as measure of the complexity of a time series. Similarity of time series' is constructed as the distance between their permutation distributions. It allows us to make groupings, based on multiple variables, over time. clust<-pdclust(datamatrix, m=4) plot(clust, cols=c("red", "blue", "red", "blue")) User A User B User C User D

slide-30
SLIDE 30

User dropout in a longitudinal panel

  • We cluster each panel
  • Can use multivariate time series clustering like pdclust
  • Cluster on number of transactions and avg transaction

amount, low covariance features

  • Each panel’s cluster boundaries are independently defined

January February March April May June July August SeptemberOctober Panel 1 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Panel 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

slide-31
SLIDE 31

Create Global Clusters

January February March April May June July August SeptemberOctober November December User A User B User C

Our data has the following toy examples:

  • User B and User C have no overlapping data
  • User A overlaps with both User B and C
  • During the overlap A and each of B and C the same patterns of

behavior during the overlap period Our methodology needs to have the following property:

  • A C and A B are clustered together
  • Thus B and C are also clustered together

Kmeans or even hclust cannot make the B&C clustered together inference

slide-32
SLIDE 32

Global Clusters with Latent Class Analysis

Latent Class Analysis

  • library(poLCA)
  • mod5=poLCA(f, maxiter=50000, nclass=5,

nrep=10, na.rm=FALSE, data=wclusters1)

USER REGION PANEL1 PANEL2 PANEL3 101 NORTHEAST A B B 102 SOUTHEAST A A D 103 SOUTHEAST NA B B 104 PACIFIC C C E 105 NORTHEAST D D C 106 NORTHEAST E NA NA 107 NORTHEAST A A B

slide-33
SLIDE 33

Global Clusters with Latent Class Analysis

  • Specialized for categorical

data.

  • Iteratively takes each response

pattern, and assigns that pattern a probability of being in some latent class.

  • Adjusts that probability, based
  • n associations in the data.
  • Co-occurring patterns are

paired together.

USER REGION PANEL1 PANEL2 PANEL3 GLOBAL.Clust 101 NORTHEAST A B B A 102 SOUTHEAST A A D B 103 SOUTHEAST NA B B A 104 PACIFIC C C E D 105 NORTHEAST D D C E 106 NORTHEAST E NA NA C 107 NORTHEAST A A B B

slide-34
SLIDE 34

Global Clusters with Latent Class Analysis - Graph

  • Create a membership roster, in how many of the 22 panels, does a pair of

users show up in the same cluster

  • Network map cluster this data to create second order, global clusters
  • I.e. if user B and user D share 20/22 panels together, they should be put

into the same second order cluster

Number of clusters with shared membership User A User B User C User D User A 22 User B 14 22 User C 4 22 User D 3 20 9 22

Should be in same global cluster If cluster probabilities instead of hard memberships are derived from the in panel clusters, those can be used instead of hard mutual membership counts.

slide-35
SLIDE 35

User dropout in a longitudinal panel - Results

Example memberships in first two and last panel Global cluster memberships

slide-36
SLIDE 36

User dropout in a longitudinal panel - Results

Spend Distribution – Pre Weighted Spend Distribution – Post Weighted Y1 Q1 Y2 Q4

slide-37
SLIDE 37

Next steps after bias stabilization

  • Triangulate “stable” longitudinal data
  • External benchmarks
  • Census
  • CE Survey
  • “Pure Comps” revenues
  • Create distance metric summing across errors from our data to

benchmarks

  • Rev Y/Y (leave some out for CV)
  • Census geo proportions
  • Spend ratios
  • Can manually weight each distance metric before aggregation to relate

importance

  • Use the same global clusters as before
  • Optimize cluster multiplier to minimize distance
  • Examine solution surface
  • Cross Validate on revenues
  • Create company specific models only from representative data (avoids

spurious correlations) * Aim of company specific models is to not reduce data bias, but to model revs from already repetitive data

slide-38
SLIDE 38

Trading revenues from alt data: Basic strategy

  • Input
  • Three scores
  • Our surprise estimate as a % of revs
  • Rev estimate confidence bands – nonparametric
  • Expected stock sensitivity to rev. surprises
  • Desired trading window
  • Around announcement
  • As data comes in and scores are updated
  • Output
  • Positions
  • Quantities
slide-39
SLIDE 39

Prediction: In 3 to 6 years, the same kind of presentation would not have the word “alternative” in the title.

slide-40
SLIDE 40

Questions? Gene Ekster geneman at Google’s email service 10/23/2014