SLIDE 1 Nov / 14 / 16
Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch
Nick Pentreath
SLIDE 2
§ @MLnick § Principal Engineer, IBM § Apache Spark PMC § Focused on machine learning § Author of Machine Learning with Spark
About
SLIDE 3
§ Recommender systems & the machine learning workflow § Data modelling for recommender systems § Why Spark, Kafka & Elasticsearch? § Kafka & Spark Streaming § Spark ML for collaborative filtering § Deploying & scoring recommender models with Elasticsearch § Monitoring, feedback & re-training § Scaling model serving § Demo
Agenda
SLIDE 4
Recommender Systems & the ML Workflow
SLIDE 5
Recommender Systems
Overview
SLIDE 6
The Machine Learning Workflow
Perception
Data ??? Machine Learning ??? $$$
SLIDE 7 The Machine Learning Workflow
Reality
Data
Ingest Data Processing
transformation & engineering
Model Training
evaluation
Deploy
models
Live System
data
evaluation Feedback Loop Spark DataFrames Spark ML Various ??? Stream (Kafka)
Missing piece!
SLIDE 8 The Machine Learning Workflow
Recommender Version
Data Ingest Data Processing
- Aggregation
- Handle implicit
data
Model Training
evaluation
Deploy
complexity
Live System
recommendations
Feedback => another Event Type Spark DataFrames Spark ML Elasticsearch
Metadata
Elasticsearch Stream (Kafka)
SLIDE 9
Data Modeling for Recommender Systems
SLIDE 10
Data model
User and Item Metadata
!
!
SLIDE 11 System Requirements
User and Item Metadata
!
!
Filtering & Grouping Business Rules
SLIDE 12 User interactions Implicit preference data
- Page view
- eCommerce - cart, purchase
- Media – preview, watch, listen
Intent data
Anatomy of a User Event
Explicit preference data
Social network interactions
User Interactions
!
!
!
!
!
!
!
!
SLIDE 13 Data model
Anatomy of a User Event
!
!
!
!
!
!
!
SLIDE 14 How to handle implicit feedback?
Anatomy of a User Event
!
!
!
!
!
!
!
!
SLIDE 15
Why Kafka, Spark & Elasticsearch?
SLIDE 16
Scalability § De facto standard for a centralized enterprise message / event queue Integration § Integrates with just about every storage & processing system § Good Spark Streaming integration – 1st class citizen § Including for Structured Streaming (but still very new & rough!)
Why Kafka?
SLIDE 17
DataFrames § Events & metadata are “lightly structured” data § Suited to DataFrames § Pluggable external data source support Spark ML § Spark ML pipelines – including scalable ALS model for collaborative filtering § Implicit feedback & NMF in ALS § Cross-validation § Custom transformers & algorithms
Why Spark?
SLIDE 18
Storage § Native JSON § Scalable § Good support for time-series / event data § Kibana for data visualisation § Integration with Spark DataFrames Scoring § Full-text search § Filtering § Aggregations (grouping) § Search ~== recommendation (more later)
Why Elasticsearch?
SLIDE 19
Kafka for Recommender Systems
SLIDE 20 Event Data Pipeline
Kafka Spark Streaming
!
!
Item analytics & aggregation User analytics & aggregation
!
Event store
!
Dashboards
SLIDE 21 Write to Event Store
Spark Streaming Event store
!
SLIDE 22 Kibana Dashboards
Spark Streaming
!
Dashboards
SLIDE 23 Item Metadata Analytics
Spark Streaming
!
Item analytics & aggregation
Aggregated activity metrics
SLIDE 24 User Metadata Analytics
Spark Streaming
!
User analytics & aggregation
Aggregated activity metrics & item exclusions
SLIDE 25
Structured Streaming
Status § Still early days § Initial Kafka support in Spark 2.0.2 § No ES support yet – not clear if it will be a full-blown datasource or ForeachWriter § For now, you can create a custom ForeachWriter for your needs
SLIDE 26
Spark ML for Collaborative Filtering
SLIDE 27
Matrix Factorization
Collaborative Filtering
3 1 5 2 1 2 1
!
!
−1.1 3.2 4.3 0.2 1.4 3.1 2.5 0.3 2.3 4.3 −2.4 0.5 3.6 0.3 1.2 0.2 1.7 2.3 0.1 1.9 0.4 0.8 −0.3 1.5 −1.2 0.3 1.2
!
!
SLIDE 28
Prediction
Collaborative Filtering
3 1 5 2 1 2 1
!
!
−1.1 3.2 4.3 0.2 1.4 3.1 2.5 0.3 2.3 4.3 −2.4 0.5 3.6 0.3 1.2 0.2 1.7 2.3 0.1 1.9 0.4 0.8 −0.3 1.5 −1.2 0.3 1.2
!
!
SLIDE 29
Loading Data in Spark ML
Collaborative Filtering
SLIDE 30
Implicit Preference Data
Alternating Least Squares
SLIDE 31
Deploying & Scoring Recommendation Models
SLIDE 32 Full-text Search & Similarity
Prelude: Search
“cat videos”
!
!
cat videos
⋯ 1 ⋯ 1 ⋯ 1 1 ⋯ 1 1 ⋯ ⋯ 1 ⋯ 1 ⋯
Similarity Sort results
1 ⋯ 1 ⋯
Scoring Ranking Analysis Term vectors
SLIDE 33 Can we use the same machinery?
Recommendation
!
⋯ 1 ⋯ 1 ⋯ 1 1 ⋯ 1 1 ⋯ ⋯ 1 ⋯ 1 ⋯
Sort results
1.2 ⋯ −0.2 0.3
Dot product & cosine similarity … the same as we need for recommendations!
Scoring Ranking Analysis Term vectors
!
! !
Similarity User (or item) vector
?
!
SLIDE 34 Delimited Payload Filter
Elasticsearch Term Vectors
Raw vector
1.2 ⋯ −0.2 0.3
Term vector with payloads
0|1.2 ⋯ 3|-0.2 4|0.3
Custom analyzer
SLIDE 35 Custom scoring function
- Native script (Java), compiled for speed
- Scoring function computes dot product by:
§ For each document vector index (“term”), retrieve payload § score += payload * query(i)
- Normalizes with query vector norm and
document vector norm for cosine similarity
Elasticsearch Scoring
SLIDE 36 Can we use the same machinery?
Recommendation
User (or item) vector
!
Sort results
1.2 ⋯ −0.2 0.3
Scoring Ranking Analysis Term vectors
!
! !
Custom scoring function
! !
Delimited payload filter
−1.1 1.3 ⋯ 0.4 1.2 −0.2 ⋯ 0.3 0.5 0.7 ⋯ −1.3 0.9 1.4 ⋯ −0.8
SLIDE 37
We get search engine functionality for free!
Elasticsearch Scoring
SLIDE 38
Deploying to Elasticsearch
Alternating Least Squares
SLIDE 39
Monitoring & Feedback
SLIDE 40
Logging Recommendations Served
System Events
!
! ! ! ! !
!
SLIDE 41
Logging Recommendation Actions
System Events
!
!
!
!
SLIDE 42 Tracking Performance
Kafka Spark Streaming
!
Impression capping / fatigue Performance monitoring & alerts
!
Event store
!
Dashboards
!
!
!
!
! ! !
!
!
SLIDE 43
Scaling Model Scoring
SLIDE 44 Scoring Performance
100 200 300 400 500 600 100,000 1,000,000 Time (ms) Size of item set
Scoring time per query, by factor dimension & number of items
k=20 k=50 k=100 *3x nodes, 30x shards
SLIDE 45 Scoring Performance
50 100 150 200 250 300 350 400 450 500 100,000 1,000,000 Time (ms) Size of item set
Scoring time per query, by number of shards & number of items
10 shards 30 shards 60 shards 90 shards *3x nodes, k=50
Increasing number of shards
SLIDE 46 Scoring Performance
Locality Sensitive Hashing
- LSH hashes each input vector into L “hash
tables”. Each table contains a “hash signature” created by applying k hash functions.
- Standard for cosine similarity is Sign Random
Projections
- At indexing time, create a “bucket” by combining
hash table id and hash signature
- Store buckets as part of item model metadata
- At scoring time, filter candidate set using term
filter on buckets of query item
- Tune LSH parameters to trade off speed /
accuracy
- LSH coming soon to Spark ML – SPARK-5992
SLIDE 47 Scoring Performance
50 100 150 200 250 Brute force LSH Time (ms)
Scoring time per query - brute force vs LSH
*3x nodes, 30x shards, k=50, 1,000,000 items
Locality Sensitive Hashing
SLIDE 48 Scoring Performance
50 100 150 200 250 Brute force LSH Score-then-search Time (ms)
Scoring time per query – LSH vs score-then-search
Score Sort Search *3x nodes, 30x shards, k=50, 1,000,000 items
Comparison to “score then search”
SLIDE 49
Demo
SLIDE 50
Future Work
SLIDE 51 Future Work
- Apache Solr version of scoring plugin (any
takers?)
- Investigate ways to improve Elasticsearch
scoring performance
§ Performance for LSH-filtered scoring should be better! § Can we dig deep into ES scoring internals to combine efficiency of matrix-vector math with ES search & filter capabilities?
- Investigate more complex models
§ Factorization machines & other contextual recommender models § Scoring performance
- Spark Structured Streaming with Kafka,
Elasticsearch & Kibana
§ Continuous recommender application including data, model training, analytics & monitoring
SLIDE 52 References
- Elasticsearch
- Elasticsearch Spark Integration
- Spark ML ALS for Collaborative Filtering
- Collaborative Filtering for Implicit Feedback Datasets
- Factorization Machines
- Elasticsearch Term Vectors & Payloads
- Delimited Payload Filter
- Vector Scoring Plugin
- Kafka & Spark Streaming
- Kibana
SLIDE 53 Thanks!
https://github.com/MLnick/elasticsearch-vector-scoring