Building a Scalable Recommender System with Apache Spark, Apache - - PowerPoint PPT Presentation

building a scalable recommender system with apache spark
SMART_READER_LITE
LIVE PREVIEW

Building a Scalable Recommender System with Apache Spark, Apache - - PowerPoint PPT Presentation

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning


slide-1
SLIDE 1

Nov / 14 / 16

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Nick Pentreath

slide-2
SLIDE 2

§ @MLnick § Principal Engineer, IBM § Apache Spark PMC § Focused on machine learning § Author of Machine Learning with Spark

About

slide-3
SLIDE 3

§ Recommender systems & the machine learning workflow § Data modelling for recommender systems § Why Spark, Kafka & Elasticsearch? § Kafka & Spark Streaming § Spark ML for collaborative filtering § Deploying & scoring recommender models with Elasticsearch § Monitoring, feedback & re-training § Scaling model serving § Demo

Agenda

slide-4
SLIDE 4

Recommender Systems & the ML Workflow

slide-5
SLIDE 5

Recommender Systems

Overview

slide-6
SLIDE 6

The Machine Learning Workflow

Perception

Data ??? Machine Learning ??? $$$

slide-7
SLIDE 7

The Machine Learning Workflow

Reality

Data

  • Historical
  • Streaming

Ingest Data Processing

  • Feature

transformation & engineering

Model Training

  • Model selection &

evaluation

Deploy

  • Pipelines, not just

models

  • Versioning

Live System

  • Predict given new

data

  • Monitoring & live

evaluation Feedback Loop Spark DataFrames Spark ML Various ??? Stream (Kafka)

Missing piece!

slide-8
SLIDE 8

The Machine Learning Workflow

Recommender Version

Data Ingest Data Processing

  • Aggregation
  • Handle implicit

data

Model Training

  • ALS
  • Ranking-style

evaluation

Deploy

  • Model size &

complexity

Live System

  • User & item

recommendations

  • Monitoring, filters

Feedback => another Event Type Spark DataFrames Spark ML Elasticsearch

  • User & Item

Metadata

  • Events

Elasticsearch Stream (Kafka)

slide-9
SLIDE 9

Data Modeling for Recommender Systems

slide-10
SLIDE 10

Data model

User and Item Metadata

!

!

slide-11
SLIDE 11

System Requirements

User and Item Metadata

!

!

Filtering & Grouping Business Rules

slide-12
SLIDE 12

User interactions Implicit preference data

  • Page view
  • eCommerce - cart, purchase
  • Media – preview, watch, listen

Intent data

  • Search query

Anatomy of a User Event

Explicit preference data

  • Rating
  • Review

Social network interactions

  • Like
  • Share
  • Follow

User Interactions

!

!

!

!

!

!

!

!

slide-13
SLIDE 13

Data model

Anatomy of a User Event

!

!

!

!

!

!

!

slide-14
SLIDE 14

How to handle implicit feedback?

Anatomy of a User Event

!

!

!

!

!

!

!

!

slide-15
SLIDE 15

Why Kafka, Spark & Elasticsearch?

slide-16
SLIDE 16

Scalability § De facto standard for a centralized enterprise message / event queue Integration § Integrates with just about every storage & processing system § Good Spark Streaming integration – 1st class citizen § Including for Structured Streaming (but still very new & rough!)

Why Kafka?

slide-17
SLIDE 17

DataFrames § Events & metadata are “lightly structured” data § Suited to DataFrames § Pluggable external data source support Spark ML § Spark ML pipelines – including scalable ALS model for collaborative filtering § Implicit feedback & NMF in ALS § Cross-validation § Custom transformers & algorithms

Why Spark?

slide-18
SLIDE 18

Storage § Native JSON § Scalable § Good support for time-series / event data § Kibana for data visualisation § Integration with Spark DataFrames Scoring § Full-text search § Filtering § Aggregations (grouping) § Search ~== recommendation (more later)

Why Elasticsearch?

slide-19
SLIDE 19

Kafka for Recommender Systems

slide-20
SLIDE 20

Event Data Pipeline

Kafka Spark Streaming

!

!

Item analytics & aggregation User analytics & aggregation

!

Event store

!

Dashboards

slide-21
SLIDE 21

Write to Event Store

Spark Streaming Event store

!

slide-22
SLIDE 22

Kibana Dashboards

Spark Streaming

!

Dashboards

slide-23
SLIDE 23

Item Metadata Analytics

Spark Streaming

!

Item analytics & aggregation

Aggregated activity metrics

slide-24
SLIDE 24

User Metadata Analytics

Spark Streaming

!

User analytics & aggregation

Aggregated activity metrics & item exclusions

slide-25
SLIDE 25

Structured Streaming

Status § Still early days § Initial Kafka support in Spark 2.0.2 § No ES support yet – not clear if it will be a full-blown datasource or ForeachWriter § For now, you can create a custom ForeachWriter for your needs

slide-26
SLIDE 26

Spark ML for Collaborative Filtering

slide-27
SLIDE 27

Matrix Factorization

Collaborative Filtering

3 1 5 2 1 2 1

!

!

−1.1 3.2 4.3 0.2 1.4 3.1 2.5 0.3 2.3 4.3 −2.4 0.5 3.6 0.3 1.2 0.2 1.7 2.3 0.1 1.9 0.4 0.8 −0.3 1.5 −1.2 0.3 1.2

!

!

slide-28
SLIDE 28

Prediction

Collaborative Filtering

3 1 5 2 1 2 1

!

!

−1.1 3.2 4.3 0.2 1.4 3.1 2.5 0.3 2.3 4.3 −2.4 0.5 3.6 0.3 1.2 0.2 1.7 2.3 0.1 1.9 0.4 0.8 −0.3 1.5 −1.2 0.3 1.2

!

!

slide-29
SLIDE 29

Loading Data in Spark ML

Collaborative Filtering

slide-30
SLIDE 30

Implicit Preference Data

Alternating Least Squares

slide-31
SLIDE 31

Deploying & Scoring Recommendation Models

slide-32
SLIDE 32

Full-text Search & Similarity

Prelude: Search

“cat videos”

!

!

cat videos

⋯ 1 ⋯ 1 ⋯ 1 1 ⋯ 1 1 ⋯ ⋯ 1 ⋯ 1 ⋯

Similarity Sort results

1 ⋯ 1 ⋯

Scoring Ranking Analysis Term vectors

slide-33
SLIDE 33

Can we use the same machinery?

Recommendation

!

⋯ 1 ⋯ 1 ⋯ 1 1 ⋯ 1 1 ⋯ ⋯ 1 ⋯ 1 ⋯

Sort results

1.2 ⋯ −0.2 0.3

Dot product & cosine similarity … the same as we need for recommendations!

Scoring Ranking Analysis Term vectors

!

! !

Similarity User (or item) vector

?

!

slide-34
SLIDE 34

Delimited Payload Filter

Elasticsearch Term Vectors

Raw vector

1.2 ⋯ −0.2 0.3

Term vector with payloads

0|1.2 ⋯ 3|-0.2 4|0.3

Custom analyzer

slide-35
SLIDE 35

Custom scoring function

  • Native script (Java), compiled for speed
  • Scoring function computes dot product by:

§ For each document vector index (“term”), retrieve payload § score += payload * query(i)

  • Normalizes with query vector norm and

document vector norm for cosine similarity

Elasticsearch Scoring

slide-36
SLIDE 36

Can we use the same machinery?

Recommendation

User (or item) vector

!

Sort results

1.2 ⋯ −0.2 0.3

Scoring Ranking Analysis Term vectors

!

! !

Custom scoring function

! !

Delimited payload filter

−1.1 1.3 ⋯ 0.4 1.2 −0.2 ⋯ 0.3 0.5 0.7 ⋯ −1.3 0.9 1.4 ⋯ −0.8

slide-37
SLIDE 37

We get search engine functionality for free!

Elasticsearch Scoring

slide-38
SLIDE 38

Deploying to Elasticsearch

Alternating Least Squares

slide-39
SLIDE 39

Monitoring & Feedback

slide-40
SLIDE 40

Logging Recommendations Served

System Events

!

! ! ! ! !

!

slide-41
SLIDE 41

Logging Recommendation Actions

System Events

!

!

!

!

slide-42
SLIDE 42

Tracking Performance

Kafka Spark Streaming

!

Impression capping / fatigue Performance monitoring & alerts

!

Event store

!

Dashboards

!

!

!

!

! ! !

!

!

slide-43
SLIDE 43

Scaling Model Scoring

slide-44
SLIDE 44

Scoring Performance

100 200 300 400 500 600 100,000 1,000,000 Time (ms) Size of item set

Scoring time per query, by factor dimension & number of items

k=20 k=50 k=100 *3x nodes, 30x shards

slide-45
SLIDE 45

Scoring Performance

50 100 150 200 250 300 350 400 450 500 100,000 1,000,000 Time (ms) Size of item set

Scoring time per query, by number of shards & number of items

10 shards 30 shards 60 shards 90 shards *3x nodes, k=50

Increasing number of shards

slide-46
SLIDE 46

Scoring Performance

Locality Sensitive Hashing

  • LSH hashes each input vector into L “hash

tables”. Each table contains a “hash signature” created by applying k hash functions.

  • Standard for cosine similarity is Sign Random

Projections

  • At indexing time, create a “bucket” by combining

hash table id and hash signature

  • Store buckets as part of item model metadata
  • At scoring time, filter candidate set using term

filter on buckets of query item

  • Tune LSH parameters to trade off speed /

accuracy

  • LSH coming soon to Spark ML – SPARK-5992
slide-47
SLIDE 47

Scoring Performance

50 100 150 200 250 Brute force LSH Time (ms)

Scoring time per query - brute force vs LSH

*3x nodes, 30x shards, k=50, 1,000,000 items

Locality Sensitive Hashing

slide-48
SLIDE 48

Scoring Performance

50 100 150 200 250 Brute force LSH Score-then-search Time (ms)

Scoring time per query – LSH vs score-then-search

Score Sort Search *3x nodes, 30x shards, k=50, 1,000,000 items

Comparison to “score then search”

slide-49
SLIDE 49

Demo

slide-50
SLIDE 50

Future Work

slide-51
SLIDE 51

Future Work

  • Apache Solr version of scoring plugin (any

takers?)

  • Investigate ways to improve Elasticsearch

scoring performance

§ Performance for LSH-filtered scoring should be better! § Can we dig deep into ES scoring internals to combine efficiency of matrix-vector math with ES search & filter capabilities?

  • Investigate more complex models

§ Factorization machines & other contextual recommender models § Scoring performance

  • Spark Structured Streaming with Kafka,

Elasticsearch & Kibana

§ Continuous recommender application including data, model training, analytics & monitoring

slide-52
SLIDE 52

References

  • Elasticsearch
  • Elasticsearch Spark Integration
  • Spark ML ALS for Collaborative Filtering
  • Collaborative Filtering for Implicit Feedback Datasets
  • Factorization Machines
  • Elasticsearch Term Vectors & Payloads
  • Delimited Payload Filter
  • Vector Scoring Plugin
  • Kafka & Spark Streaming
  • Kibana
slide-53
SLIDE 53

Thanks!

https://github.com/MLnick/elasticsearch-vector-scoring