recommender system in kkbox simple complex ranking model
play

Recommender System in KKBOX Simple Complex Ranking Model - PowerPoint PPT Presentation

Recommender System in KKBOX Simple Complex Ranking Model Collaborative Persona Aware based Filtering Embedding Attribute Based Context Aware Representation #Item x #users x #attributes Serendipity/Novelty Diversity Precision


  1. Recommender System in KKBOX

  2. Simple Complex Ranking Model Collaborative Persona Aware based Filtering Embedding Attribute Based Context Aware Representation

  3. #Item x #users x #attributes

  4. Serendipity/Novelty Diversity Precision

  5. Collaborative Filtering Matrix Factorization

  6. Word2Vec - “ The results, to our own surprise, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counterparts. ” - Marco et al. “You should know the word by the company it keeps” (Firth J.R.)

  7. CBOW Skip-gram CBOW

  8. DeepWalk (Bryan Perozzi, Rami Al-Rfou& Steven Skiena, 2014 ) Random Walk Word2Vec

  9. 青花瓷 給我一首歌的時 珊瑚海 我不配 間 黃金甲 珊瑚海 雙截棍 天地一鬥

  10. (#item + #users) x log(#Item + #users) x #hidden nodes x window_size

  11. Cold start ? Learn the relationships between laten factors and audio signals

  12. We got features - And ranking is another problem

  13. Click/Play Prediction ● Regression ● Classification Learn to Rank Content User Understanding Understaning ● Embedding ● User Profiling ● Classification ● Embedding ● Topic Mining

  14. 買菜送蔥 Building a pipeline Data pre-processing → ETL Job Feature extraction → Numerical/ Categorical... Model fitting → Logistic Regression/GBDT Validation stages → Cross Validation

  15. Challenges ● Big data ● Heterogeneous sources ● Various formatting ● Data versioning ● Data quality ● Data freshness ● Cost ● Coding is hard, debugging is harder

  16. Logs: External Datasets / Logs: Databases: Parquet, Genre, BPM, Artist Json, Tsv, Songs, DB Mixpanel, App Annie Members, …... Text, …... …... ● Data cleaning, normalization ETL ● Pre aggregation / join Parquet files in S3, partitioned by date and service region if needed.

  17. ETL Data (Parquet files on S3) DB Thrift (or Protobuf, Hive Table Replication Avro) Schema Presto (or Amazon Athena) ● Apache Spark (Scala) ○ From files on S3 to RDD / Dataframe ○ Use JDBC Driver from Presto ● Python / R ○ Read file from S3, deserialize parquet ○ Use JDBC/ODBC driver from Presto

  18. Example

  19. Challenges ● Big data = EC2 + Spark + Hadoop Family + Presto ● Heterogeneous sources = ETL ● Various formatting = ETL ● Data versioning = ETL ● Data quality = ETL ● Data freshness = DB Replication, Data Streaming ● Cost = EC2, Good Tool Chain ● Coding is hard, debugging is harder - Good Design

  20. Case Study

  21. Nearest Neighbors of Songs 1. Build a weighted bipartite graph of users and songs from logs ● Terebytes of data, billions of nodes and edges ● Spark cluster on EC2. (On-demand, hundreds of cores, I/O optimized) 2. Put each song on a vector space 3. Find K-NN for each song ● Random walks ● O(n^2) is impossible ● An embedded model (We use ● Approximation. For example, word2vec) Locality-Sensitive Hashing ● In an very very large instance with a ● Using a spark cluster on EC2, the lot of memory and cores. worker nodes are cpu optimized. All middle results are in parquet format on S3, so we can inpect them with Presto.

  22. Songs a User Like to Listen Again 1. Extract features from logs, databases, and external data set ● Join billions of transactions. ● Spark cluster on EC2. (On-demand, hundreds of cores) 2. Train a model 3. Repeat - feature selection, parameter tuning ● Spark MLlib (EX: GBDT) ● Deep learning frameworks 4. Predict from recent logs (TensorFlow)

  23. Life cycle of ML-related features Define the Problem Deploy and Inspect the A/B Testing Data Train and Verify the Hypothesis Model

  24. References Apache Spark ● Apache Parquet ● Apache Thrift ● Apache Hive ● Presto ● Amazon Elastic Compute Clould ●

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend