Recommender System in KKBOX Simple Complex Ranking Model - - PowerPoint PPT Presentation

recommender system in kkbox simple complex ranking model
SMART_READER_LITE
LIVE PREVIEW

Recommender System in KKBOX Simple Complex Ranking Model - - PowerPoint PPT Presentation

Recommender System in KKBOX Simple Complex Ranking Model Collaborative Persona Aware based Filtering Embedding Attribute Based Context Aware Representation #Item x #users x #attributes Serendipity/Novelty Diversity Precision


slide-1
SLIDE 1

Recommender System in KKBOX

slide-2
SLIDE 2
slide-3
SLIDE 3

Attribute Based Collaborative Filtering Embedding Representation Ranking Model based Context Aware Persona Aware Simple Complex

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

#Item x #users x #attributes

slide-8
SLIDE 8

Precision Diversity Serendipity/Novelty

slide-9
SLIDE 9
slide-10
SLIDE 10

Collaborative Filtering

Matrix Factorization

slide-11
SLIDE 11

Word2Vec - “The results, to our own surprise, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counterparts.” - Marco et al.

“You should know the word by the company it keeps” (Firth J.R.)

slide-12
SLIDE 12

CBOW CBOW Skip-gram

slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16

DeepWalk (Bryan Perozzi, Rami Al-Rfou& Steven Skiena, 2014 ) Random Walk Word2Vec

slide-17
SLIDE 17

青花瓷

珊瑚海 我不配 給我一首歌的時 間

黃金甲

珊瑚海 雙截棍 天地一鬥

slide-18
SLIDE 18

(#item + #users) x log(#Item + #users) x #hidden nodes x window_size

slide-19
SLIDE 19

Cold start ?

Learn the relationships between laten factors and audio signals

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

We got features - And ranking is another problem

slide-23
SLIDE 23

User Understaning Content Understanding

Learn to Rank

  • Embedding
  • Classification
  • Topic Mining
  • User Profiling
  • Embedding

Click/Play Prediction

  • Regression
  • Classification
slide-24
SLIDE 24

買菜送蔥 Building a pipeline

Data pre-processing Feature extraction Model fitting Validation stages → → → → ETL Job Numerical/ Categorical... Logistic Regression/GBDT Cross Validation

slide-25
SLIDE 25

Challenges

  • Big data
  • Heterogeneous sources
  • Various formatting
  • Data versioning
  • Data quality
  • Data freshness
  • Cost
  • Coding is hard, debugging is harder
slide-26
SLIDE 26

Logs: Parquet, Json, Tsv, Text, …... Databases: Songs, Members, …...

External Datasets / Logs: Genre, BPM, Artist DB Mixpanel, App Annie …...

ETL

  • Data cleaning, normalization
  • Pre aggregation / join

Parquet files in S3, partitioned by date and service region if needed.

slide-27
SLIDE 27

ETL Data (Parquet files on S3) Thrift (or Protobuf, Avro) Hive Table Schema Presto (or Amazon Athena)

DB Replication

  • Apache Spark (Scala)

○ From files on S3 to RDD / Dataframe ○ Use JDBC Driver from Presto

  • Python / R

○ Read file from S3, deserialize parquet ○ Use JDBC/ODBC driver from Presto

slide-28
SLIDE 28

Example

slide-29
SLIDE 29

Challenges

  • Big data = EC2 + Spark + Hadoop Family + Presto
  • Heterogeneous sources = ETL
  • Various formatting = ETL
  • Data versioning = ETL
  • Data quality = ETL
  • Data freshness = DB Replication, Data Streaming
  • Cost = EC2, Good Tool Chain
  • Coding is hard, debugging is harder - Good Design
slide-30
SLIDE 30

Case Study

slide-31
SLIDE 31

Nearest Neighbors of Songs

  • 1. Build a weighted bipartite graph of users and songs from logs
  • Terebytes of data, billions of nodes and edges
  • Spark cluster on EC2. (On-demand, hundreds of cores, I/O optimized)
  • 2. Put each song on a vector space
  • Random walks
  • An embedded model (We use

word2vec)

  • In an very very large instance with a

lot of memory and cores.

  • 3. Find K-NN for each song
  • O(n^2) is impossible
  • Approximation. For example,

Locality-Sensitive Hashing

  • Using a spark cluster on EC2, the

worker nodes are cpu optimized. All middle results are in parquet format on S3, so we can inpect them with Presto.

slide-32
SLIDE 32

Songs a User Like to Listen Again

  • 1. Extract features from logs, databases, and external data set
  • Join billions of transactions.
  • Spark cluster on EC2. (On-demand, hundreds of cores)
  • 2. Train a model
  • Spark MLlib (EX: GBDT)
  • Deep learning frameworks

(TensorFlow)

  • 3. Repeat - feature selection,

parameter tuning

  • 4. Predict from recent logs
slide-33
SLIDE 33

Define the Problem Inspect the Data Hypothesis Train and Verify the Model Deploy and A/B Testing

Life cycle of ML-related features

slide-34
SLIDE 34

References

  • Apache Spark
  • Apache Parquet
  • Apache Thrift
  • Apache Hive
  • Presto
  • Amazon Elastic Compute Clould