building a scalable recommender system with apache spark
play

Building a Scalable Recommender System with Apache Spark, Apache - PowerPoint PPT Presentation

Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning


  1. Nov / 14 / 16 Nick Pentreath Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

  2. About § @MLnick § Principal Engineer, IBM § Apache Spark PMC § Focused on machine learning § Author of Machine Learning with Spark

  3. Agenda § Recommender systems & the machine learning workflow § Data modelling for recommender systems § Why Spark, Kafka & Elasticsearch? § Kafka & Spark Streaming § Spark ML for collaborative filtering § Deploying & scoring recommender models with Elasticsearch § Monitoring, feedback & re-training § Scaling model serving § Demo

  4. Recommender Systems & the ML Workflow

  5. Recommender Overview Systems

  6. The Machine Perception Learning Workflow Machine Data ??? ??? $$$ Learning

  7. The Machine Reality Learning Workflow Missing Spark ML piece! Various Spark DataFrames ??? Data Model Data Ingest Deploy Live System Processing Training • Historical • Feature • Model selection & • Pipelines, not just • Predict given new transformation & evaluation models data • Streaming engineering • Versioning • Monitoring & live evaluation Feedback Loop Stream (Kafka)

  8. The Machine Recommender Version Learning Workflow Spark ML Elasticsearch Spark DataFrames Elasticsearch Data Model Data Ingest Deploy Live System Processing Training • Aggregation • ALS • Model size & •User & item • User & Item complexity recommendations Metadata • Handle implicit • Ranking-style data evaluation •Monitoring, filters • Events Feedback => another Event Type Stream (Kafka)

  9. Data Modeling for Recommender Systems

  10. User and Item Data model Metadata ! !

  11. User and Item System Requirements Metadata Filtering & Grouping ! ! Business Rules

  12. Anatomy of a User Interactions User Event ! ! ! User interactions Implicit preference data Intent data • Page view • Search query • eCommerce - cart, purchase ! ! • Media – preview, watch, listen Explicit preference data Social network interactions ! ! • Rating • Like • Review • Share ! • Follow

  13. Anatomy of a Data model User Event ! ! ! ! ! ! !

  14. Anatomy of a How to handle implicit feedback? User Event ! ! ! ! ! ! ! !

  15. Why Kafka, Spark & Elasticsearch?

  16. Why Kafka? Scalability § De facto standard for a centralized enterprise message / event queue Integration § Integrates with just about every storage & processing system § Good Spark Streaming integration – 1 st class citizen § Including for Structured Streaming (but still very new & rough!)

  17. Why Spark? DataFrames § Events & metadata are “lightly structured” data § Suited to DataFrames § Pluggable external data source support Spark ML § Spark ML pipelines – including scalable ALS model for collaborative filtering § Implicit feedback & NMF in ALS § Cross-validation § Custom transformers & algorithms

  18. Why Storage Elasticsearch? § Native JSON § Scalable § Good support for time-series / event data § Kibana for data visualisation § Integration with Spark DataFrames Scoring § Full-text search § Filtering § Aggregations (grouping) § Search ~== recommendation (more later)

  19. Kafka for Recommender Systems

  20. Event Data Pipeline ! User analytics & aggregation Event store ! Spark Kafka Streaming ! Dashboards ! Item analytics & aggregation

  21. Write to Event Store Event store ! Spark Streaming

  22. Kibana Dashboards Spark Streaming ! Dashboards

  23. Item Metadata Analytics Spark Streaming Aggregated activity metrics ! Item analytics & aggregation

  24. User Metadata Analytics ! User analytics & aggregation Spark Streaming Aggregated activity metrics & item exclusions

  25. Structured Status Streaming § Still early days § Initial Kafka support in Spark 2.0.2 § No ES support yet – not clear if it will be a full-blown datasource or ForeachWriter § For now, you can create a custom ForeachWriter for your needs

  26. Spark ML for Collaborative Filtering

  27. Collaborative Matrix Factorization Filtering ! ! ! 3 −1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 ! 5 2 2.5 0.3 2.3 1.9 0.4 0.8 −0.3 4.3 −2.4 0.5 1.5 −1.2 0.3 1.2 1 3.6 0.3 1.2 2 1

  28. Collaborative Prediction Filtering ! ! ! 3 −1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 ! 5 2 2.5 0.3 2.3 1.9 0.4 0.8 −0.3 4.3 −2.4 0.5 1.5 −1.2 0.3 1.2 1 3.6 0.3 1.2 2 1

  29. Collaborative Loading Data in Spark ML Filtering

  30. Alternating Least Implicit Preference Data Squares

  31. Deploying & Scoring Recommendation Models

  32. Prelude: Search Full-text Search & Similarity Analysis Term vectors Scoring Ranking ! Sort results cat videos ! 0 0 ⋯ 0 1 ⋯ 0 1 ⋯ 1 0 ⋯ “cat videos” 0 1 ⋯ 1 1 ⋯ Similarity 1 1 ⋯ 0 0 ⋯ 1 0 ⋯ 0 1 ⋯

  33. Recommendation Can we use the same machinery? ? ! ! ! Analysis Term vectors Scoring Ranking ! Sort results ! 0 0 ⋯ 0 1 ⋯ 1.2 ⋯ −0.2 0.3 0 1 ⋯ 1 1 ⋯ User Similarity (or item) 1 1 ⋯ 0 0 ⋯ vector 1 0 ⋯ 0 1 ⋯ Dot product & cosine similarity … the same as we need for recommendations!

  34. Elasticsearch Delimited Payload Filter Term Vectors Raw vector Custom analyzer Term vector with payloads 0|1.2 ⋯ 3|-0.2 4|0.3 1.2 ⋯ −0.2 0.3

  35. Elasticsearch Custom scoring function Scoring • Native script (Java), compiled for speed • Scoring function computes dot product by: For each document vector index (“term”), retrieve § payload § score += payload * query(i) • Normalizes with query vector norm and document vector norm for cosine similarity

  36. Recommendation Can we use the same machinery? ! ! ! ! Analysis Term vectors Scoring Ranking ! Sort results ! −1.1 1.3 ⋯ 0.4 1.2 ⋯ −0.2 0.3 1.2 −0.2 ⋯ 0.3 Delimited Custom User payload filter scoring (or item) 0.5 0.7 ⋯ −1.3 function vector 0.9 1.4 ⋯ −0.8

  37. Elasticsearch We get search engine functionality for free! Scoring

  38. Alternating Least Deploying to Elasticsearch Squares

  39. Monitoring & Feedback

  40. System Events Logging Recommendations Served ! ! ! ! ! ! !

  41. System Events Logging Recommendation Actions ! ! ! !

  42. Tracking Performance Performance ! monitoring & alerts Event store ! ! ! Spark ! Kafka Streaming ! ! ! ! ! ! Dashboards ! Impression capping / fatigue

  43. Scaling Model Scoring

  44. Scoring Performance Scoring time per query, by factor dimension & number of items 600 k=20 k=50 k=100 500 400 Time (ms) 300 200 100 0 100,000 1,000,000 Size of item set *3x nodes, 30x shards

  45. Scoring Increasing number of shards Performance Scoring time per query, by number of shards & number of items 500 10 shards 30 shards 450 60 shards 90 shards 400 350 Time (ms) 300 250 200 150 100 50 0 100,000 1,000,000 Size of item set *3x nodes, k=50

  46. Scoring Locality Sensitive Hashing Performance • LSH hashes each input vector into L “hash tables”. Each table contains a “hash signature” created by applying k hash functions. • Standard for cosine similarity is Sign Random Projections • At indexing time, create a “bucket” by combining hash table id and hash signature • Store buckets as part of item model metadata • At scoring time, filter candidate set using term filter on buckets of query item • Tune LSH parameters to trade off speed / accuracy • LSH coming soon to Spark ML – SPARK-5992

  47. Scoring Locality Sensitive Hashing Performance Scoring time per query - brute force vs LSH 250 200 150 Time (ms) 100 50 0 Brute force LSH *3x nodes, 30x shards, k=50, 1,000,000 items

  48. Scoring Comparison to “score then search” Performance Scoring time per query – LSH vs score-then-search 250 Score Sort Search 200 150 Time (ms) 100 50 0 Brute force LSH Score-then-search *3x nodes, 30x shards, k=50, 1,000,000 items

  49. Demo

  50. Future Work

  51. Future Work • Apache Solr version of scoring plugin (any takers?) • Investigate ways to improve Elasticsearch scoring performance Performance for LSH-filtered scoring should be better! § Can we dig deep into ES scoring internals to combine § efficiency of matrix-vector math with ES search & filter capabilities? • Investigate more complex models Factorization machines & other contextual recommender § models Scoring performance § • Spark Structured Streaming with Kafka, Elasticsearch & Kibana Continuous recommender application including data, § model training, analytics & monitoring

  52. References • Elasticsearch • Elasticsearch Spark Integration • Spark ML ALS for Collaborative Filtering • Collaborative Filtering for Implicit Feedback Datasets • Factorization Machines • Elasticsearch Term Vectors & Payloads • Delimited Payload Filter • Vector Scoring Plugin • Kafka & Spark Streaming • Kibana

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend