Building a Scalable Recommender System with Apache Spark, Apache - PowerPoint PPT Presentation

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

About § @MLnick § Principal Engineer, IBM § Apache Spark PMC § Focused on machine learning § Author of Machine Learning with Spark

Agenda § Recommender systems & the machine learning workflow § Data modelling for recommender systems § Why Spark, Kafka & Elasticsearch? § Kafka & Spark Streaming § Spark ML for collaborative filtering § Deploying & scoring recommender models with Elasticsearch § Monitoring, feedback & re-training § Scaling model serving § Demo

Recommender Systems & the ML Workflow

Recommender Overview Systems

The Machine Perception Learning Workflow Machine Data ??? ??? $$$ Learning

The Machine Reality Learning Workflow Missing Spark ML piece! Various Spark DataFrames ??? Data Model Data Ingest Deploy Live System Processing Training • Historical • Feature • Model selection & • Pipelines, not just • Predict given new transformation & evaluation models data • Streaming engineering • Versioning • Monitoring & live evaluation Feedback Loop Stream (Kafka)

The Machine Recommender Version Learning Workflow Spark ML Elasticsearch Spark DataFrames Elasticsearch Data Model Data Ingest Deploy Live System Processing Training • Aggregation • ALS • Model size & •User & item • User & Item complexity recommendations Metadata • Handle implicit • Ranking-style data evaluation •Monitoring, filters • Events Feedback => another Event Type Stream (Kafka)

Data Modeling for Recommender Systems

User and Item Data model Metadata ! !

User and Item System Requirements Metadata Filtering & Grouping ! ! Business Rules

Anatomy of a User Interactions User Event ! ! ! User interactions Implicit preference data Intent data • Page view • Search query • eCommerce - cart, purchase ! ! • Media – preview, watch, listen Explicit preference data Social network interactions ! ! • Rating • Like • Review • Share ! • Follow

Anatomy of a Data model User Event ! ! ! ! ! ! !

Anatomy of a How to handle implicit feedback? User Event ! ! ! ! ! ! ! !

Why Kafka, Spark & Elasticsearch?

Why Kafka? Scalability § De facto standard for a centralized enterprise message / event queue Integration § Integrates with just about every storage & processing system § Good Spark Streaming integration – 1 st class citizen § Including for Structured Streaming (but still very new & rough!)

Why Spark? DataFrames § Events & metadata are “lightly structured” data § Suited to DataFrames § Pluggable external data source support Spark ML § Spark ML pipelines – including scalable ALS model for collaborative filtering § Implicit feedback & NMF in ALS § Cross-validation § Custom transformers & algorithms

Why Storage Elasticsearch? § Native JSON § Scalable § Good support for time-series / event data § Kibana for data visualisation § Integration with Spark DataFrames Scoring § Full-text search § Filtering § Aggregations (grouping) § Search ~== recommendation (more later)

Kafka for Recommender Systems

Event Data Pipeline ! User analytics & aggregation Event store ! Spark Kafka Streaming ! Dashboards ! Item analytics & aggregation

Write to Event Store Event store ! Spark Streaming

Kibana Dashboards Spark Streaming ! Dashboards

Item Metadata Analytics Spark Streaming Aggregated activity metrics ! Item analytics & aggregation

User Metadata Analytics ! User analytics & aggregation Spark Streaming Aggregated activity metrics & item exclusions

Structured Status Streaming § Still early days § Initial Kafka support in Spark 2.0.2 § No ES support yet – not clear if it will be a full-blown datasource or ForeachWriter § For now, you can create a custom ForeachWriter for your needs

Spark ML for Collaborative Filtering

Collaborative Matrix Factorization Filtering ! ! ! 3 −1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 ! 5 2 2.5 0.3 2.3 1.9 0.4 0.8 −0.3 4.3 −2.4 0.5 1.5 −1.2 0.3 1.2 1 3.6 0.3 1.2 2 1

Collaborative Prediction Filtering ! ! ! 3 −1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 ! 5 2 2.5 0.3 2.3 1.9 0.4 0.8 −0.3 4.3 −2.4 0.5 1.5 −1.2 0.3 1.2 1 3.6 0.3 1.2 2 1

Collaborative Loading Data in Spark ML Filtering

Alternating Least Implicit Preference Data Squares

Deploying & Scoring Recommendation Models

Prelude: Search Full-text Search & Similarity Analysis Term vectors Scoring Ranking ! Sort results cat videos ! 0 0 ⋯ 0 1 ⋯ 0 1 ⋯ 1 0 ⋯ “cat videos” 0 1 ⋯ 1 1 ⋯ Similarity 1 1 ⋯ 0 0 ⋯ 1 0 ⋯ 0 1 ⋯

Recommendation Can we use the same machinery? ? ! ! ! Analysis Term vectors Scoring Ranking ! Sort results ! 0 0 ⋯ 0 1 ⋯ 1.2 ⋯ −0.2 0.3 0 1 ⋯ 1 1 ⋯ User Similarity (or item) 1 1 ⋯ 0 0 ⋯ vector 1 0 ⋯ 0 1 ⋯ Dot product & cosine similarity … the same as we need for recommendations!

Elasticsearch Delimited Payload Filter Term Vectors Raw vector Custom analyzer Term vector with payloads 0|1.2 ⋯ 3|-0.2 4|0.3 1.2 ⋯ −0.2 0.3

Elasticsearch Custom scoring function Scoring • Native script (Java), compiled for speed • Scoring function computes dot product by: For each document vector index (“term”), retrieve § payload § score += payload * query(i) • Normalizes with query vector norm and document vector norm for cosine similarity

Recommendation Can we use the same machinery? ! ! ! ! Analysis Term vectors Scoring Ranking ! Sort results ! −1.1 1.3 ⋯ 0.4 1.2 ⋯ −0.2 0.3 1.2 −0.2 ⋯ 0.3 Delimited Custom User payload filter scoring (or item) 0.5 0.7 ⋯ −1.3 function vector 0.9 1.4 ⋯ −0.8

Elasticsearch We get search engine functionality for free! Scoring

Alternating Least Deploying to Elasticsearch Squares

Monitoring & Feedback

System Events Logging Recommendations Served ! ! ! ! ! ! !

System Events Logging Recommendation Actions ! ! ! !

Tracking Performance Performance ! monitoring & alerts Event store ! ! ! Spark ! Kafka Streaming ! ! ! ! ! ! Dashboards ! Impression capping / fatigue

Scaling Model Scoring

Scoring Performance Scoring time per query, by factor dimension & number of items 600 k=20 k=50 k=100 500 400 Time (ms) 300 200 100 0 100,000 1,000,000 Size of item set *3x nodes, 30x shards

Scoring Increasing number of shards Performance Scoring time per query, by number of shards & number of items 500 10 shards 30 shards 450 60 shards 90 shards 400 350 Time (ms) 300 250 200 150 100 50 0 100,000 1,000,000 Size of item set *3x nodes, k=50

Scoring Locality Sensitive Hashing Performance • LSH hashes each input vector into L “hash tables”. Each table contains a “hash signature” created by applying k hash functions. • Standard for cosine similarity is Sign Random Projections • At indexing time, create a “bucket” by combining hash table id and hash signature • Store buckets as part of item model metadata • At scoring time, filter candidate set using term filter on buckets of query item • Tune LSH parameters to trade off speed / accuracy • LSH coming soon to Spark ML – SPARK-5992

Scoring Locality Sensitive Hashing Performance Scoring time per query - brute force vs LSH 250 200 150 Time (ms) 100 50 0 Brute force LSH *3x nodes, 30x shards, k=50, 1,000,000 items

Scoring Comparison to “score then search” Performance Scoring time per query – LSH vs score-then-search 250 Score Sort Search 200 150 Time (ms) 100 50 0 Brute force LSH Score-then-search *3x nodes, 30x shards, k=50, 1,000,000 items

Future Work

Future Work • Apache Solr version of scoring plugin (any takers?) • Investigate ways to improve Elasticsearch scoring performance Performance for LSH-filtered scoring should be better! § Can we dig deep into ES scoring internals to combine § efficiency of matrix-vector math with ES search & filter capabilities? • Investigate more complex models Factorization machines & other contextual recommender § models Scoring performance § • Spark Structured Streaming with Kafka, Elasticsearch & Kibana Continuous recommender application including data, § model training, analytics & monitoring

References • Elasticsearch • Elasticsearch Spark Integration • Spark ML ALS for Collaborative Filtering • Collaborative Filtering for Implicit Feedback Datasets • Factorization Machines • Elasticsearch Term Vectors & Payloads • Delimited Payload Filter • Vector Scoring Plugin • Kafka & Spark Streaming • Kibana

Building a Scalable Recommender System with Apache Spark, Apache - PowerPoint PPT Presentation

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Computer Security 3e Dieter Gollmann Security.di.unimi.it/sicurezza1314 Chapter 18: 1 Chapter

Hacking Web Sites Cross Site Scripting Emmanuel Benoist Fall Term 2020/2021 Berner

Real World Java Web Security Java User Group Karlsruhe Dominik Schadow | bridgingIT Who thinks

COSC 4P14 What else could we discuss? Brock University Brock University What else could we

Modeling Social Networking Privacy Carolina Dania IMDEA Software Institute - Spain Universidad

Natural experiments in online social network assembly Abigail Jacobs | University of Colorado

Presenters: Courtney Crowley, Digital Marketing Specialist Emily OMalley, Digital Marketing

Crypto developments A bit about me Daniel J. Bernstein Designer of: qmail , used by Yahoo

Building a Scalable Recommender System with Apache Spark, Apache - PowerPoint PPT Presentation

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Computer Security 3e Dieter Gollmann Security.di.unimi.it/sicurezza1314 Chapter 18: 1 Chapter

Hacking Web Sites Cross Site Scripting Emmanuel Benoist Fall Term 2020/2021 Berner

Real World Java Web Security Java User Group Karlsruhe Dominik Schadow | bridgingIT Who thinks

COSC 4P14 What else could we discuss? Brock University Brock University What else could we

Modeling Social Networking Privacy Carolina Dania IMDEA Software Institute - Spain Universidad

Natural experiments in online social network assembly Abigail Jacobs | University of Colorado

Presenters: Courtney Crowley, Digital Marketing Specialist Emily OMalley, Digital Marketing

Crypto developments A bit about me Daniel J. Bernstein Designer of: qmail , used by Yahoo

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark