Alternative Data in Finance Example: Lodging Key Metrics Occupancy - - PowerPoint PPT Presentation
Alternative Data in Finance Example: Lodging Key Metrics Occupancy - - PowerPoint PPT Presentation
Alternative Data in Finance Example: Lodging Key Metrics Occupancy x Room Rate ~ Revenues Online Room Number of lights on Rates Alternative Data Alternative Data 1. Point of sale transactions 2. Online behavior 3. Purchases 1. Online
Example: Lodging Key Metrics
Occupancy Room Rate Revenues x ~
Number of lights on
Online Room Rates
Alternative Data
Alternative Data
- 1. Point of sale transactions
- 2. Online behavior
- 3. Purchases
1. Online 2. Brick and mortar
- 4. Obscure public records
- 5. Drone footage analysis ;)
- 6. Etc etc etc
Supply Chain
- 1. Data Vendors / Suppliers
- 2. Aggregators and Analysts
- 3. Clients / Funds
Outline
- Basic Example (done)
- What's Alternative Big Data (done)
- Sourcing
- Compliance and ethics
- Predicting revenue and other uses
- Walk though of common technical challenges
- Basic trading strategy
- Q & A
Data Sourcing
- Direct data gathering
- Data vendors
- Just download the data (JDD)
Data gathering / Sourcing
- Harvest the web
- Primary Research
Harvesting: Build or Buy?
Build Buy
Control over compliance procedures Faster to scale All IP and harvesting target information stays in house Back data Complete control over costs Risk mitigated by an intermediary Some structuring of the data done by vendor Leverage vendors’ expertise in the data and spidering * Tip for finding web harvesting firms: Look on LinkedIn for folks with web scraping skills and see who they work for.
Harvesting: Symantec web
- Diffbot recognizes the content of web pages
- Compares against schema.org’s structures
- Automatically collect structured data without explicit structure
definitions
- Adjusts for changes in page layouts
Primary Research
- Expert networks
- Surveys
- New ways to look at the world
- Receipts
- Serial numbers
- Alexa or other web monitoring tools
- Google trends
- Classified
- Drone footage
Evaluating Datasets
- Scarcity
- How widely used or marketed is it?
- Granularity
- Time
- Aggregation levels
- How structured is it?
- Coverage
- Sectors / Stocks – Hedge fund motels?
- Geo
* Creating a standardized quantitative scoring system or ROI matrix to evaluate datasets based on these criteria is a worthwhile endeavor
Evaluating Vendors
- Companies monetizing their exhaust data
- High quality high margin revenue
- Upstream insights from buyer
- Traditional data vendors
- Survey data
- Financial data aggregation
- Hybrids
- 1010 / ITG
Free Datasets
http://aws.amazon.com/datasets http://databib.org http://datacite.org http://figshare.com http://linkeddata.org http://reddit.com/r/datasets http://thedatahub.org alias http://ckan.net http://quandl.com http://enigma.io Hundreeds more! http://www.quora.com/Where-can-I-find-large-datasets-
- pen-to-the-public
High opportunity datasets
- International
- Asia
- Latam
- Insight into margins
- Companies are more EPS surprise sensitive than revenue surprise sensitive
- COGS
- SG&A
- Etc
- B2B
Compliance overview
- Intent / Ethics
- Regulatory
Compliance overview
Restricted Environment Production Environment Data Vendor PII Scrubbing Process / Encrypted Archiving Organization
Compliance overview: Guidelines / Control Frameworks
- NIST 800-122
- GLBA (Gramm-Leach-Bliley Act)
- COBIT 5
- COSO 2013
Compliance overview
- Just use regular expressions
^(?:(?=.*\d)(?=.*[A-Z])(?=.*[a-z])|(?=.*\d)(?=.*[^A-Za-z0-9])(?=.*[a-z])|(?=.*[^A-Za-z0-9])(?=.*[A- Z])(?=.*[a-z])|(?=.*\d)(?=.*[A-Z])(?=.*[^A-Za-z0-9]))(?!.*(.)\1{2,})[A-Za-z0- 9!~<>,;:_=?*+#."&§%°()\|\[\]\-\$\^\@\/]{8,32} [a-zA-Z]:|\\)\\)?(((\.)|(\.\.)|([^\\/:*?"|<>. ](([^\\/:*?"|<>. ])|([^\\/:*?"|<>]*[^\\/:*?"|<>. ]))?))\\)*[^\\/:*?"|<>. ](([^\\/:*?"|<>. ])|([^\\/:*?"|<>]*[^\\/:*?"|<>. ]))? ((25[0-5]|2[0-4][0-9]|19[0-1]|19[3-9]|18[0-9]|17[0-1]|17[3- 9]|1[0-6][0-9]|1[1-9]|[2-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9%[0-9A-Fa- f]{2}|[-()_.!~*';/?:@&=+$,A-Za-z0- 9])+)([).!';/?:^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0- 9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$ * Use Regexp Buddy.
Compliance overview: Web Harvesting Precedent Cases
- Major (and the majority) of cases. Its an uncharted territory
- Feist Publications, Inc., v. Rural Telephone Service Co.,
- Ryanair Scraping Cases
- Ebay vs Bidders Edge
- Intel vs Hamidi
- Cases discussing Browserwrap vs clickwraps
- Cvent, Inc. v. Eventbrite, Inc
- 3taps vs Craigslist
- These do not apply to investment research
Compliance overview
- Respect website’s TOS especially if in a Clickwrap
- Sensibly web harvesting policy
- Address incoming complaints
- Limit number of http requests
- Stay recent on laws and cases
- Explicitly address headline risk and regulatory risk, create a cost benefit
analysis for headline risk
Generating value with alternative data
- Revenue surprise estimates
- Operating GAAP measures
- Non GAAP measures
- Churn, etc
- Fully or partially automated quant strategies
- Non equity asset classes
- PE could benefit from the same operating metrics for diligence
- PM Development and Big Data Thought Leadership
- Strategic Investments
- Marketing Tool for Raising Capital and Talent Recruitment
Workflow and Process
Data
- Data Partners
- Web Collection
- Storage optimization
Normalization
- Cleansing
- Benchmarking
- De-biasing, Enrichment
Modeling
- GAAP / Operating Metrics
- Quant Signals
- Investment Thesis Insights
Deliverable
- Metrics Reporting
- R&D Portfolio
- Published Signal
Data Analysts
High Performance Computing R&D Quant
Visualizations Sector Research
Data Vendors Third Party Sources
Published Signal Interpretive Research Metrics
L/S Teams & Quant Teams Raw Data Production
Data Acquisition
The shifting bias longitudinal panel problem
Full Panel
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10
Panel with user add and churn (missing data MAR)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec User 1 User 2 User 3 User 4 User 5 User 6 User 7 User 8 User 9 User 10
The 200k and the ~800k are different
The complete panel - ~200k users Users who have the second year of data, but not the first
……………. Dashed Line - 95% confidence N(μ,σ2).
Solutions:
- Imputation
- Complete case analysis
- Weighting methods
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Total Spend Index
Complete Panel and the rest of users are different
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10
Complete Panel and the rest of users are different
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 680K Panel 2 720K Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10
Complete Panel and the rest of users are different
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 >200K Users (680K) Panel 2 Panel 3 Panel 4 Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Panel 1 680K Panel 2 720K Panel 3 740K Panel 4 760K Panel 5 Panel 6 Panel 7 Panel 8 Panel 9 Panel 10
Many users are the same, ~90% overlap The further apart the panels, the less user overlap, P1 – P22 only ~32% overlap, most users different
50 100 150 200 250 300 350 5 10 15 20 25 Sum of Cnt.1 Sum of DPT.1 50 100 150 200 250 300 350 5 10 15 20 25 Sum of Cnt.1 Sum of DPT.1 20 40 60 80 100 120 140 160 2 4 6 8 10 12 14 16 18 20 Sum of Cnt.2 Sum of DPT.2 20 40 60 80 100 120 140 160 180 5 10 15 20 25 Sum of Cnt.4 Sum of DPT.4
User A User B User C User D
Multivariate Time Series Clustering
Multivariate Time Series Clustering
The pdc package) takes a permutation distribution, which is as measure of the complexity of a time series. Similarity of time series' is constructed as the distance between their permutation distributions. It allows us to make groupings, based on multiple variables, over time. clust<-pdclust(datamatrix, m=4) plot(clust, cols=c("red", "blue", "red", "blue")) User A User B User C User D
User dropout in a longitudinal panel
- We cluster each panel
- Can use multivariate time series clustering like pdclust
- Cluster on number of transactions and avg transaction
amount, low covariance features
- Each panel’s cluster boundaries are independently defined
January February March April May June July August SeptemberOctober Panel 1 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Panel 2 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Create Global Clusters
January February March April May June July August SeptemberOctober November December User A User B User C
Our data has the following toy examples:
- User B and User C have no overlapping data
- User A overlaps with both User B and C
- During the overlap A and each of B and C the same patterns of
behavior during the overlap period Our methodology needs to have the following property:
- A C and A B are clustered together
- Thus B and C are also clustered together
Kmeans or even hclust cannot make the B&C clustered together inference
Global Clusters with Latent Class Analysis
Latent Class Analysis
- library(poLCA)
- mod5=poLCA(f, maxiter=50000, nclass=5,
nrep=10, na.rm=FALSE, data=wclusters1)
USER REGION PANEL1 PANEL2 PANEL3 101 NORTHEAST A B B 102 SOUTHEAST A A D 103 SOUTHEAST NA B B 104 PACIFIC C C E 105 NORTHEAST D D C 106 NORTHEAST E NA NA 107 NORTHEAST A A B
Global Clusters with Latent Class Analysis
- Specialized for categorical
data.
- Iteratively takes each response
pattern, and assigns that pattern a probability of being in some latent class.
- Adjusts that probability, based
- n associations in the data.
- Co-occurring patterns are
paired together.
USER REGION PANEL1 PANEL2 PANEL3 GLOBAL.Clust 101 NORTHEAST A B B A 102 SOUTHEAST A A D B 103 SOUTHEAST NA B B A 104 PACIFIC C C E D 105 NORTHEAST D D C E 106 NORTHEAST E NA NA C 107 NORTHEAST A A B B
Global Clusters with Latent Class Analysis - Graph
- Create a membership roster, in how many of the 22 panels, does a pair of
users show up in the same cluster
- Network map cluster this data to create second order, global clusters
- I.e. if user B and user D share 20/22 panels together, they should be put
into the same second order cluster
Number of clusters with shared membership User A User B User C User D User A 22 User B 14 22 User C 4 22 User D 3 20 9 22
Should be in same global cluster If cluster probabilities instead of hard memberships are derived from the in panel clusters, those can be used instead of hard mutual membership counts.
User dropout in a longitudinal panel - Results
Example memberships in first two and last panel Global cluster memberships
User dropout in a longitudinal panel - Results
Spend Distribution – Pre Weighted Spend Distribution – Post Weighted Y1 Q1 Y2 Q4
Next steps after bias stabilization
- Triangulate “stable” longitudinal data
- External benchmarks
- Census
- CE Survey
- “Pure Comps” revenues
- Create distance metric summing across errors from our data to
benchmarks
- Rev Y/Y (leave some out for CV)
- Census geo proportions
- Spend ratios
- Can manually weight each distance metric before aggregation to relate
importance
- Use the same global clusters as before
- Optimize cluster multiplier to minimize distance
- Examine solution surface
- Cross Validate on revenues
- Create company specific models only from representative data (avoids
spurious correlations) * Aim of company specific models is to not reduce data bias, but to model revs from already repetitive data
Trading revenues from alt data: Basic strategy
- Input
- Three scores
- Our surprise estimate as a % of revs
- Rev estimate confidence bands – nonparametric
- Expected stock sensitivity to rev. surprises
- Desired trading window
- Around announcement
- As data comes in and scores are updated
- Output
- Positions
- Quantities