Recommender System in KKBOX Simple Complex Ranking Model - - PowerPoint PPT Presentation
Recommender System in KKBOX Simple Complex Ranking Model - - PowerPoint PPT Presentation
Recommender System in KKBOX Simple Complex Ranking Model Collaborative Persona Aware based Filtering Embedding Attribute Based Context Aware Representation #Item x #users x #attributes Serendipity/Novelty Diversity Precision
Attribute Based Collaborative Filtering Embedding Representation Ranking Model based Context Aware Persona Aware Simple Complex
#Item x #users x #attributes
Precision Diversity Serendipity/Novelty
Collaborative Filtering
Matrix Factorization
Word2Vec - “The results, to our own surprise, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counterparts.” - Marco et al.
“You should know the word by the company it keeps” (Firth J.R.)
CBOW CBOW Skip-gram
DeepWalk (Bryan Perozzi, Rami Al-Rfou& Steven Skiena, 2014 ) Random Walk Word2Vec
青花瓷
珊瑚海 我不配 給我一首歌的時 間
黃金甲
珊瑚海 雙截棍 天地一鬥
(#item + #users) x log(#Item + #users) x #hidden nodes x window_size
Cold start ?
Learn the relationships between laten factors and audio signals
We got features - And ranking is another problem
User Understaning Content Understanding
Learn to Rank
- Embedding
- Classification
- Topic Mining
- User Profiling
- Embedding
Click/Play Prediction
- Regression
- Classification
買菜送蔥 Building a pipeline
Data pre-processing Feature extraction Model fitting Validation stages → → → → ETL Job Numerical/ Categorical... Logistic Regression/GBDT Cross Validation
Challenges
- Big data
- Heterogeneous sources
- Various formatting
- Data versioning
- Data quality
- Data freshness
- Cost
- Coding is hard, debugging is harder
Logs: Parquet, Json, Tsv, Text, …... Databases: Songs, Members, …...
External Datasets / Logs: Genre, BPM, Artist DB Mixpanel, App Annie …...
ETL
- Data cleaning, normalization
- Pre aggregation / join
Parquet files in S3, partitioned by date and service region if needed.
ETL Data (Parquet files on S3) Thrift (or Protobuf, Avro) Hive Table Schema Presto (or Amazon Athena)
DB Replication
- Apache Spark (Scala)
○ From files on S3 to RDD / Dataframe ○ Use JDBC Driver from Presto
- Python / R
○ Read file from S3, deserialize parquet ○ Use JDBC/ODBC driver from Presto
Example
Challenges
- Big data = EC2 + Spark + Hadoop Family + Presto
- Heterogeneous sources = ETL
- Various formatting = ETL
- Data versioning = ETL
- Data quality = ETL
- Data freshness = DB Replication, Data Streaming
- Cost = EC2, Good Tool Chain
- Coding is hard, debugging is harder - Good Design
Case Study
Nearest Neighbors of Songs
- 1. Build a weighted bipartite graph of users and songs from logs
- Terebytes of data, billions of nodes and edges
- Spark cluster on EC2. (On-demand, hundreds of cores, I/O optimized)
- 2. Put each song on a vector space
- Random walks
- An embedded model (We use
word2vec)
- In an very very large instance with a
lot of memory and cores.
- 3. Find K-NN for each song
- O(n^2) is impossible
- Approximation. For example,
Locality-Sensitive Hashing
- Using a spark cluster on EC2, the
worker nodes are cpu optimized. All middle results are in parquet format on S3, so we can inpect them with Presto.
Songs a User Like to Listen Again
- 1. Extract features from logs, databases, and external data set
- Join billions of transactions.
- Spark cluster on EC2. (On-demand, hundreds of cores)
- 2. Train a model
- Spark MLlib (EX: GBDT)
- Deep learning frameworks
(TensorFlow)
- 3. Repeat - feature selection,
parameter tuning
- 4. Predict from recent logs
Define the Problem Inspect the Data Hypothesis Train and Verify the Model Deploy and A/B Testing
Life cycle of ML-related features
References
- Apache Spark
- Apache Parquet
- Apache Thrift
- Apache Hive
- Presto
- Amazon Elastic Compute Clould