Distributed Implementation of the Triplets View CS535 Big Data | - - PDF document

distributed implementation of the triplets view
SMART_READER_LITE
LIVE PREVIEW

Distributed Implementation of the Triplets View CS535 Big Data | - - PDF document

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs Today is the last day of discussion


slide-1
SLIDE 1

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1

CS535 BIG DATA

PART B. GEAR SESSIONS

SESSION 4: LARGE SCALE RECOMMENDATION SYSTEMS AND SOCIAL MEDIA

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs

  • Today is the last day of discussion period for Session III on Piazza
  • Watch video clips on Canvas à Assignments à Echo360
  • Term project phase 1 (Proposal)
  • Feedbacks are available in Canvas
  • Please arrange a meeting if needed

CS535 Big Data | Computer Science | Colorado State University

Topics of Todays Class

  • Part 1: Distributed implementation of Triplets View in GraphX
  • Recommendation Systems
  • Part 2: Introduction and Content based recommendation systems
  • Part 3: Collaborative Filtering (Case study of Amazon’s Item-to-Item model and Netflix’ Latent Factor

Model)

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 3. Big Graph Analysis

Lecture 3. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework

Distributed Implementation of the Triplets View

CS535 Big Data | Computer Science | Colorado State University

Efficient lookup of edges

  • Edges within a partition are clustered by source vertex id using a compressed sparse

row (CSR) representation and hash-indexed by their target id

  • CSR with an example
  • With a sparse m x n matrix M
  • Using three (1 dimensional) arrays (", $%&'()*+, ,%-'()*+)
  • 5

8 3 6

  • " = 5

8 3 6

  • Col'()*+ = 0

1 2 1

  • column indices
  • ,%-'()*+ = 0

2 3 4

  • index in V where the given row starts

CS535 Big Data | Computer Science | Colorado State University

Define row_start = ROW_INDEX[row] row_end = ROW_INDEX[row+1]

Index Reuse

  • GraphX inherits the immutability of Spark
  • All graph operators logically create new collections rater than destructively modifying existing ones
  • Derived vertex and edge collections can often share indices to reduce memory overhead and improve

local performance

  • Hash index on vertices can enable fast aggregation and resulting aggregates share the index with the original vertices
  • Faster Joins
  • Vertex collections sharing the same index can be joined by a coordinated scan
  • Without requiring any index lookups
  • Index reuse reduces the per-iteration runtime of PageRank on the twitter graph by 59 % (GraphX paper)
  • Operators that do not modify the graph structure (e.g. mapV) automatically preserve indices
  • Operators that restrict the graph structure (e.g. subgraph) relies on bitmasks to construct restricted views
  • reindex operator
  • For the operator changes the structure heavily (e.g. filtered)

CS535 Big Data | Computer Science | Colorado State University

slide-2
SLIDE 2

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2

Implementing the Triplets View

  • Triplets view
  • Three way join between the source and destination vertex properties and the edge properties
  • Vertex Mirroring
  • Multicast Join
  • Partial Materialization
  • Incremental View Maintenance

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Vertex Mirroring

  • Join requires data movement
  • Vertex and edge property collections are partitioned independently
  • Three-way join
  • Shipping the vertex properties across the network to the edges
  • Setting the edge partitions as the join sites
  • Observation 1: Real-world graphs commonly have orders of magnitude more

edges than vertices

  • Observation 2: A single vertex may have many edges in the same partition
  • Enabling substantial reuse of the vertex property

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Multicast Join

  • Broadcast join
  • All vertices are sent to each edge partition
  • Multicast join
  • Each vertex property is sent only to the edge partitions that contain adjacent edges
  • Join site information is stored in the routing table
  • Co-partitioned with the vertex collection
  • Routing table is associated with the edge collection
  • Routing table is constructed lazily upon first instantiation of the triplets view
  • Example
  • Per-city partitioning scheme on the Facebook social network graph
  • 50.5% reduction in query time

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Partial Materialization

  • Local joins at the edge partitions
  • Mirrored vertex properties are stored in local hash maps on each edge partition
  • Referenced when the triplets are constructed

CS535 Big Data | Computer Science | Colorado State University

Implementing the Triplets View: Incremental View Maintenance

  • Iterative graph algorithms often modify only a subset of the vertex properties in each

iteration

  • Incremental view maintenance
  • To avoid unnecessary movement of unchanged data
  • After each graph operation
  • You can track which vertex properties have changed since the triplets view was last constructed
  • When the triplets view is accessed next time
  • Only the changed vertices are re-routed to their edge-partition join sites
  • Local mirrored values of the unchanged vertices are reused

CS535 Big Data | Computer Science | Colorado State University

Query Optimizations for the mrTriplets operator

  • Filtered Index Scanning
  • myTriplets operator logically involves a scan of the triplets view to apply user-defined

map function

  • As iterative graph algorithms converge, the working sets tend to shrink
  • Map function skips many Triplets
  • Active set
  • Map function only need to operate on triplets containing active vertices
  • Defined by the application specific predicate
  • E.g. connected component analysis
  • Indexed scan for the triplets view
  • Application expresses the current active set by restricting the graph using subgraph operator
  • Filter the triplets using this vertex predicate

CS535 Big Data | Computer Science | Colorado State University

slide-3
SLIDE 3

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3

Query Optimizations for the mrTriplets operator

  • Automatic Join Elimination
  • Some operations on triplets view may access only one of the vertex properties or non

at all

  • E.g. counting the degree of each vertex
  • GraphX uses a JVM’s bytecode analyzer to inspect user defined functions at runtime
  • Check whether the source or destination vertex properties is referred
  • If only one property is referenced and the triplets view has not been already

materialized

  • GraphX rewrites the query plan for generating the triplets view
  • From three-way join to a two-way join
  • If none of the vertex properties are referenced
  • GraphX eliminates the join entirely

CS535 Big Data | Computer Science | Colorado State University

Additional Optimizations

  • Memory-based Shuffle
  • Spark’s default shuffle implementation materializes the temporary data to disk
  • GraphX modified the shuffle phase to materialize map outputs in memory and remove this temporary

data using a timeout

  • Batching and Columnar Structure
  • In the join code, batch a block of vertices routed to the same target join site and convert the block from

row-oriented format to column-oriented format

  • Apply the LZF compression algorithm on these blocks to send them
  • Variable Integer Encoding
  • While GraphX uses 64-bit vertex ids, most of ids are smaller than 264
  • GraphX uses a variable-encoding scheme
  • Uses only first 7 bits to encode the value

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems

Recommendation Systems: Introduction

CS535 Big Data | Computer Science | Colorado State University

“What percentage of the top 10,000 titles in any online media store (Netflix, iTunes, Amazon, or any other) will rent or sell at least once a month?”

CS535 Big Data | Computer Science | Colorado State University

The long tail phenomenon [1/2]

  • Distribution of numbers with a portion that has a large number of occurrences far from

the “head” or central part of the distribution

  • The vertical axis represents popularity
  • The items are ordered on the horizontal axis according to their popularity
  • The long-tail phenomenon forces online institutions to recommend items to individual users

Erik Brynjolfsson, Yu (Jeffrey) Hu, and Duncan Simester. 2011. Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales. Manage. Sci. 57, 8 (August 2011), 1373-1386. DOI=http://dx.doi.org/10.1287/mnsc.1110.1371 CS535 Big Data | Computer Science | Colorado State University

Recommendation systems

  • Seek to predict the “rating” or “preference” that a user would give to an item

CS535 Big Data | Computer Science | Colorado State University

slide-4
SLIDE 4

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4

Applications of Recommendation Systems

  • Product recommendations
  • Amazon or similar online vendors
  • Movie recommendations
  • Netflix offers its customers recommendations of movies they might like
  • News articles
  • News services have attempted to identify articles of interest to readers based on the articles that they

have read in the past

  • Blogs, YouTube

CS535 Big Data | Computer Science | Colorado State University

Types of Recommendation Systems

  • Random prediction algorithm
  • Randomly chooses items from the set of available items and recommends them to the users
  • Accuracy of this algorithm is poor
  • Frequent sequence
  • Uses the frequent pattern to recommend other items
  • Cold start problem
  • Content based algorithms
  • Based on properties of items
  • Similarity of items is determined by measuring the similarity in their properties
  • Collaborative Filtering algorithms (CF)
  • Based on the relationship between users and items
  • Similarity of items is determined by the similarity of the ratings of those items by the users who have rated

both items

  • Serendipitous recommendation systems
  • Assumes that the user may want to be surprised with something unexpected
  • From the results of existing recommendation systems, SR increases diversity and novelty

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems

Recommendation Systems: Content-based Recommendations

CS535 Big Data | Computer Science | Colorado State University

Content-Based Recommendations

  • Focuses on properties of items
  • Similarity of items is determined by measuring the similarity in their properties

CS535 Big Data | Computer Science | Colorado State University

Item Profiles

  • A record or collection of records representing important characteristics of the item
  • E.g. the features of a movie
  • The set of actors of the movie (Some viewers prefer movies with their favorite actors)
  • The director
  • The year in which the movie was made
  • The genre or general type of movie
  • Other features: manufacturer, screen size, etc.

CS535 Big Data | Computer Science | Colorado State University

Discovering Features of Documents

  • Some items have features those are not immediately apparent to the systems
  • E.g. document collections and images
  • E.g. News articles
  • Suggesting articles on topics a user is interested in
  • Possible features
  • n words with the highest TF.IDF scores
  • n percentage of word with the highest TF.IDF scores
  • To measure the similarity
  • Jaccard distance or Cosine distance

CS535 Big Data | Computer Science | Colorado State University

What is the TF-IDF value? Term frequency–inverse document frequency a numerical statistic that is intended to reflect how important a word is to a document in a corpus. It combines term frequency and inverse document frequency.

slide-5
SLIDE 5

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5

Obtaining Item Features from Tags of Images

  • Crowd sourcing
  • Inviting users to tag the items
  • del.icio.us: earlier attempt to tag massive amount of data
  • Yahoo: invited user to tag Web pages
  • Disadvantage
  • Are users willing to take the trouble to create the tags?
  • Erroneous tags can bias the system

CS535 Big Data | Computer Science | Colorado State University

Representing Item Profiles

  • Goal
  • Create both an item profile consisting of feature-value pairs and a user profile summarizing the

preferences of the user

  • Example
  • Word vector (with 0’s and 1’s)
  • 1 represents the occurrence of a high TF-IDF word in the document
  • 0 represents the occurrence of a low TF-IDF word in the document

CS535 Big Data | Computer Science | Colorado State University

Representing Item Profiles

  • Suppose the only features of movies are the set of actors and the average rating
  • Consider two movies with five actors each
  • Two of the actors are in both movies
  • Example
  • One movie has an average rating of 3 and the other an average of 4

A= (0 1 1 0 1 1 0 1 3") B= (1 1 0 1 0 1 1 0 4")

  • Cosine similarity between above vectors
  • CosSimilarity(A, B) =

567 ∥5∥∥7∥ = 9:;9<= 9>:;9><=:;??<@

  • If we use " = 1
  • We take the average rating as they are
  • If we use " = 2
  • We double the rating

CS535 Big Data | Computer Science | Colorado State University

User Profiles

  • Using the utility matrix representing the connection between users and items
  • Example: “Find user’s preference for movies with a specific actor!”
  • Suppose items are movies, represented by Boolean profiles with components corresponding to

actors

  • The utility matrix has a 1 if the user has seen the movie and is blank otherwise
  • If 20% of the movies that user U likes have Julia Roberts as one of the actors
  • then the user profile for U will have 0.2 in the component for Julia Roberts
  • Suppose user U gives an average rating of 3
  • There are three movies with Julia Roberts as an actor, and those movies got ratings of 3, 4, and 5
  • The component for Julia Roberts will have value that is the average of (3 − 3), (4 − 3), and (5 − 3), that is, 1
  • If user V gives an average rating of 4
  • Three movies with Julia Roberts as an actors, and ratings of 2, 3, and 5.
  • The user profile for V has, in the component for Julia Roberts, the average of (2 − 4), (3 − 4), and (5 − 4),

that is, −2/3

CS535 Big Data | Computer Science | Colorado State University

Recommending Items to Users Based on Content

  • With the previous example
  • The highest recommendations (lowest cosine distance) belong to the movies with lots of actors that

appear in many of the movies the user likes

CS535 Big Data | Computer Science | Colorado State University

Classification Algorithms

  • Decision tree
  • A collection of nodes, arranged as a binary tree
  • The leaves render decisions
  • In this case, the decision would be ”likes” or “doesn’t like”
  • Each interior node is a condition on the objects being classified

CS535 Big Data | Computer Science | Colorado State University

slide-6
SLIDE 6

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems

Recommendation Systems: Collaborative Filtering

CS535 Big Data | Computer Science | Colorado State University

Collaborative Filtering

  • Identifies similar users and recommending what similar users like
  • Instead of using features of items to determine their similarity
  • Focus on the similarity of the user rating for two items
  • Users are similar if their rating vectors are close according to some distance measure
  • Jaccard or cosine distance
  • Recommendation for a user U is made by looking at the users that are most similar to

U

  • Recommending items that these users like

CS535 Big Data | Computer Science | Colorado State University

Measuring Similarity? -- Jaccard Similarity Coefficient

SW Episode VII SW Episode VIII SW Episode IX Frozen I Frozen II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 5 4 5 Reviewer C 2 3 3 5 Reviewer D 4 2 The utility matrix !"##"$% &'(')"$'*+ ,, . = , ∩ . |, ∪ .| = 1 5 = 20% !"##"$% &'(')"$'*+ ,, 8 = , ∩ 8 |, ∪ 8| = 2 5 = 40% If the utility matrix only reflects purchases of the movie, this can be useful If utilities are more detailed ratings, the Jaccard distance loses important information

CS535 Big Data | Computer Science | Colorado State University

Measuring Similarity? -- Cosine Similarity

SW Episode VII SW Episode VIII SW Episode IX Frozen I Frozen II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 5 4 5 Reviewer C 2 3 3 5 Reviewer D 4 2 The utility matrix !"#$%& '$($)*+$,- ., 0 = . 2 0 ∥ . ∥∥ 0 ∥ = 20 16 + 4 + 25 25 + 16 + 25 = 20 6.7×8.1 = 20 54.27 = 0.37

CS535 Big Data | Computer Science | Colorado State University

Clustering Users and Items

  • It is hard to detect similarity among either items or users
  • we have little information about user-item pairs in the sparse utility matrix
  • Clustering items and/or users

SW Episode VII SW Episode VIII SW Episode IX Frozen I Frozen II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 5 4 5 Reviewer C 2 3 3 5 Reviewer D 4 2

CS535 Big Data | Computer Science | Colorado State University

Clustering Users and Items

  • Cluster Items based on the series

SW Episode VII/VIII/IX Frozen I and II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 4.66 Reviewer C 2 3 5 Reviewer D 4 2

CS535 Big Data | Computer Science | Colorado State University

slide-7
SLIDE 7

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7

Clustering Users and Items

  • Find the clusters to which user and items
  • Estimate entries based on the user-item relationship
  • If the entry is empty, find the most similar item group

SW Episode VII/VIII/IX Frozen I and II Joker Avengers: Endgame Reviewer A 4 2 5 Reviewer B 4.66 Reviewer C 2 3 5 Reviewer D 4 2

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems Recommendation Systems

Amazon.com : Item-to-item collaborative filtering

CS535 Big Data | Computer Science | Colorado State University

This material is built based on,

  • Greg Linden, Brent Smith, and Jeremy York, “Amazon.com Recommendations, Item-to-

Item Collaborative Filtering” IEEE Internet Computing, 2003

CS535 Big Data | Computer Science | Colorado State University

  • Amazon.com uses recommendations as a targeted marketing tool
  • Email campaigns
  • Most of their web pages

CS535 Big Data | Computer Science | Colorado State University

Item-to-item collaborative filtering

  • It does NOT match the user to similar customers
  • Item-to-item collaborative filtering
  • Matches each of the user’s purchased and rated items to similar items
  • Combines those similar items into a recommendation list

CS535 Big Data | Computer Science | Colorado State University

Determining the most-similar match

  • The algorithm builds a similar-items table
  • By finding items that customers tend to purchase together
  • How about building a product-to-product matrix by iterating through all item pairs and

computing a similarity metric for each pair?

  • Many product pairs have no common customer
  • If you already bought a TV today, will you buy another TV again today?

CS535 Big Data | Computer Science | Colorado State University

slide-8
SLIDE 8

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8

  • Calculating the similarity between a single product and all related products:

For each item in product catalog, I1 For each customer C who purchased I1 For each item I2 purchased by customer C Record that a customer purchased I1 and I2 For each item I2 Compute the similarity between I1 and I2

CS535 Big Data | Computer Science | Colorado State University

Creating a similar-item table

  • Similar-items table is extremely computing intensive
  • Offline computation
  • O(N2M) in the worst case
  • Where N is the number of items and M is the number of users
  • Average case is closer to O(NM)
  • Most customers have very few purchases
  • Sampling customers who purchase best-selling titles reduces runtime even more
  • With little reduction in quality

CS535 Big Data | Computer Science | Colorado State University

Computing similarity

  • Option 1. Using co-occurrence matrix
  • If an item has been purchased by the same user together many times, it is considered as a “similar”

item

  • Option 2. Using cosine measure
  • Each vector corresponds to an item rather than a customer
  • M dimensions correspond to customers who have purchased that item
  • Cosine_Similarity(A,B) =cos(A,B)=

!"# ∥!∥∗∥#∥

CS535 Big Data | Computer Science | Colorado State University

Example

I0 I1 I2 I3 I4 I5 I6 I0 I1 I2 I3 I4 I5 I6

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

CS535 Big Data | Computer Science | Colorado State University

Example

I0 I1 I2 I3 I4 I5 I6 I0 1 2 1 I1 2 1 I2 1 1 I3 1 2 1 I4 1 1 I5 2 1 I6 1 1

Purchase record for the user UA={ I1 , I3. ,I4 } Purchase record for the user UB={ I2 , I3. ,I4 } Purchase record for the user UC={ I2 } Purchase record for the user UD={ I0 , I5. ,I6 } Purchase record for the user UE={ I1 , I3. } Purchase record for the user UF={ I0 , I3. ,I5 } Purchase record for the user UG={ I5 , I6. }

Co-occurrence matrix

CS535 Big Data | Computer Science | Colorado State University

Example

I0 I1 I2 I3 I4 I5 I6 I0 1 2 1 I1 2 1 I2 1 1 I3 1 2 1 I4 1 1 I5 2 1 I6 1 1

Co-occurrence matrix

Cosine similarity (I0,I1) =Cosine (I0,I1) =

!"#!$ ∥!"∥∗∥!$∥

CS535 Big Data | Computer Science | Colorado State University

slide-9
SLIDE 9

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9

Scalability

  • Amazon.com has around 110 million active customers(244 million total customers) and

several million catalog items

  • Traditional collaborative filtering does little or no offline computation
  • Online computation scales with the number of customers and catalog items.

http://www.fool.com/investing/general/2014/05/24/how-many-customers-does-amazon-have.aspx

CS535 Big Data | Computer Science | Colorado State University

Key scalability strategy for amazon recommendations

  • Creating the expensive similar-items table offline
  • Online component
  • Looking up similar items for the user’s purchases and ratings
  • Scales independently of the catalog size or the total number of customers
  • It is dependent only on how many titles the user has purchased or rated

CS535 Big Data | Computer Science | Colorado State University

Recommendation quality

  • The algorithm recommends highly correlated similar items
  • Recommendation quality is excellent
  • Algorithm performs well with limited user data

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems Recommendation Systems

Recommending Music and the Audioscrobbler Dataset

CS535 Big Data | Computer Science | Colorado State University

Dataset

  • Audioscrobbler dataset
  • 2002, Richard Jones
  • Collecting and analyzing user’s songs to generate recommendation
  • Started with support for Winamp and XMMS
  • iTunes, Winamp, Windows Media Player, Foobar, iPod, Amarok, Rhythmbox, mpd, Xbox media center,

Slimserver, Jinzora, mpg321, Muine, Rhapsody, YME, Soundbridge, VLC…

CS535 Big Data | Computer Science | Colorado State University

Dataset

  • Confined rating system
  • “Bob rates Coldplay 3.5 stars.”
  • Users rate music far less frequently than they play music
  • Audioscrobbler dataset
  • “Bob played Coldplay track”
  • Each individual data carries less information
  • Implicit feedback
  • User-artist connections are implied as a side effect of other actions

CS535 Big Data | Computer Science | Colorado State University

slide-10
SLIDE 10

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 10

Dataset

  • 141,000 unique users
  • 1.6 million unique artists
  • 24.2 million user’s plays of artist are recorded
  • User_artist_data.txt
  • http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html
  • On average, each user has played songs from about 171 artists (out of 1.6 M)
  • Extremely sparse dataset

CS535 Big Data | Computer Science | Colorado State University

Netflix Prize

  • The Netflix Prize challenge concerned recommender systems for movies (October,

2006)

  • Netflix released a training set consisting of data from almost 500,000 customers and

their ratings on 18,000 movies.

  • More than 100 million ratings
  • The task was to use these data to build a model to predict ratings for a hold-out set of 3

million ratings

CS535 Big Data | Computer Science | Colorado State University

GEAR Session 4. Large Scale Recommendation Systems and Social Media

Lecture 1. Large Scale Recommendation Systems Recommendation Systems

Collaborative Filtering: Latent Factor Model

CS535 Big Data | Computer Science | Colorado State University

Collaborative filtering [1/2]

  • Collects and analyzes a large amount of information on users’ behaviors, activities or

preferences and predicts what users will like based on their similarity to other users

  • Explicit data collection
  • Rate an item
  • Search history
  • Favorite item
  • Wish list
  • Implicit data collection
  • Viewing times
  • Tracking online purchases
  • Analyzing the user’s social network

CS535 Big Data | Computer Science | Colorado State University

Collaborative filtering [2/2]

  • Two users may share similar tastes because they are the same age
  • It is NOT an example of collaborative filtering
  • Two users may both like the same song because they play many other same songs
  • It IS an example of collaborative filtering
  • Algorithm that learns without access to user or artist attributes

CS535 Big Data | Computer Science | Colorado State University

Latent-Factor model

  • Tries to explain observed interactions between large numbers of users and products

through a relatively small number of unobserved, underlying reasons

  • Within the music business context,
  • Why millions of people buy a particular few of thousands of possible albums by describing users

and albums for tens of genres and tastes that are not directly observable

CS535 Big Data | Computer Science | Colorado State University

slide-11
SLIDE 11

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 11

Simplified illustration of the latent factor approach

Geared toward males Geared toward females serious escapist Fast and Furious King Arthur Ken Burns the Civil war Twilight Still Alice

Bob Jennifer Tom Nancy Area 1 Area 2 Area 3 Area 4

Iron Man

CS535 Big Data | Computer Science | Colorado State University

Simplified illustration of the latent factor approach

Geared toward males Geared toward females serious escapist

Bob Jennifer Tom Nancy Area 1 Area 2 Area 3 Area 4

Fast and Furious King Arthur Ken Burns the Civil war Twilight Still Alice Iron Man

CS535 Big Data | Computer Science | Colorado State University

How do we model this?

  • User and product data in a large matrix A
  • Row i and column j
  • If user i has played product j
  • The k columns correspond to the latent factors

≈ ×

A

k k

X YT

Products users

CS535 Big Data | Computer Science | Colorado State University

Creating user and artist matrices

  • Two matrices
  • Matrix X for user
  • Each value corresponds to a

latent feature in the model

  • Matrix Y for products
  • Each value corresponds to a

latent feature in the model

  • Rows express how much users

and products associate with these latent features

  • Product of X and Y

Complete estimation of the entire, dense user-product interaction matrix

×

X

k k

YT Users’ matrix Products’ matrix

CS535 Big Data | Computer Science | Colorado State University

Computational challenge

  • A=XYT generally no solution
  • If X and Y are not large enough
  • Goal
  • Finding the best X and Y

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

  • Alternating least squares algorithm to compute X and Y
  • Spark MLib’s ALS implementation
  • Step 1
  • Y is not known
  • Initialized to a matrix with randomly chosen row vectors
  • Then simple linear algebra gives the best X, given Y and A
  • AiY(YTY)-1=Xi
  • Equality cannot achieved exactly
  • The goal becomes to minimize |AiY(YTY)-1 - Xi|
  • The sum of squared differences between the two matrices’ entries

CS535 Big Data | Computer Science | Colorado State University

slide-12
SLIDE 12

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 12

Alternating Least Squares (ALS)

  • Step 2.
  • Repeat similar sequence as step 1 to compute Y from the X (from step 1)
  • Step 3.
  • Repeat similar sequence as step 1 to compute X from the Y (from step 2)

  • X and Y do eventually converge to good (acceptable) solutions

CS535 Big Data | Computer Science | Colorado State University

Alternating Least Squares (ALS)

  • Takes advantage of the sparsity of the input data
  • Easy to apply data parallelism

CS535 Big Data | Computer Science | Colorado State University

GEAR Workshop I | Advanced Big Data Analytics Case Study

Recommendation Systems

Building a model with Spark MLlib

CS535 Big Data | Computer Science | Colorado State University

Preparing the Data

  • Files are available at /user/ds/
  • Spark MLib’s ALS implementation
  • Requires numeric IDs for users and items
  • Nonnegative 32-bit integers
  • An ID larger than Integer.MAX_VALUE cannot be used

val rawUserArtistData = sc.textFile(“hdfs:///user/ds/user_artist_data.txt”) rawUserArtistData.map(_.split(' ')(0).toDouble).stats() rawUserArtistData.map(_.split(' ')(1).toDouble).stats() Maximum user IDs: 24443548 Maximum artist IDs: 2147483647 No additional transformation will be needed

CS535 Big Data | Computer Science | Colorado State University

Extracting names

  • artist_data.txt
  • Artist ID and name separated by a tab
  • Straightforward parsing of the file into (Int, String) tuples will fail

val rawArtistData = sc.textFile(" hdfs:///user/ds/artist_data.txt") val artistByID = rawArtistData.map { line = > val (id, name) = line.span(_!='\ t') (id.toInt, name.trim) }

CS535 Big Data | Computer Science | Colorado State University

Extracting names

  • Scala’s Option class
  • Option represents a value that might only optionally exist

val artistByID = rawArtistData.flatMap { line = > val (id, name) = line.span(_ != '\ t') if (name.isEmpty) { None } else { try { Some((id.toInt, name.trim)) } catch { case e: NumberFormatException = > None } } }

CS535 Big Data | Computer Science | Colorado State University

slide-13
SLIDE 13

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 13

Building a First Model

  • Two transformations are required
  • Aliases dataset should be applied to convert all artist IDs to a canonical ID
  • The data should be converted to a Rating object
  • User-product-value data

import org.apache.spark.mllib.recommendation._ val bArtistAlias = sc.broadcast( artistAlias) val trainData = rawUserArtistData.map { line = > val Array( userID, artistID, count) = line.split(' '). map(_. toInt) val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID) Rating(userID, finalArtistID, count) }.cache()

CS535 Big Data | Computer Science | Colorado State University

cache()

  • RDD should be temporarily stored after being computed
  • ALS is iterative
  • It will typically need to access this RDD ≥ 10 times
  • Otherwise, this RDD could be repeatedly recomputed from the original data each time

CS535 Big Data | Computer Science | Colorado State University

Broadcast variables

  • For the case that many tasks (from different closures) need access to the same

(immutable) data structure

  • Extends normal handling of task closures
  • Caching data as raw Java objects on each executor
  • Caching data across multiple jobs and stages
  • Spark will send, and hold in memory, just one copy for each executor in the cluster
  • Saves network traffic and memory

CS535 Big Data | Computer Science | Colorado State University

Building the ALS model

  • Constructs model as a MatrixFactorizationModel

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)

CS535 Big Data | Computer Science | Colorado State University

Retrieving some feature vectors

  • Array of 10 numbers

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0) model.userFeatures.mapValues(_.mkString(”,")).first() ... (4293,-0.3233030601963864, 0.31964527593541325, 0.49025505511361034, 0.09000932568001832, 0.4429537767744912, 0.4186675713407441, 0.8026858843673894, -0.4841300444834003, - 0.12485901532338621, 0.19795451025931002)

CS535 Big Data | Computer Science | Colorado State University

Spot Checking Recommendations

  • To see if the artist recommendations for user(2093760) makes

any intuitive sense

val rawArtistsForUser = rawUserArtistData.map(_. split(' ')). filter { case Array( user,_,_) = > user.toInt = = 2093760 } val existingProducts = rawArtistsForUser.map { case Array(_, artist,_) = > artist.toInt }.collect().toSet artistByID.filter { case (id, name) = > existingProducts.contains(id) }.values.collect().Foreach(println) ... David Gray Blackalicious Jurassic The Saw Doctors Xzibit

CS535 Big Data | Computer Science | Colorado State University

slide-14
SLIDE 14

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 14

Spot Checking Recommendations

  • To see five recommendations for this user (ID: 2093760)

val recommendations = model.recommendProducts(2093760, 5) recommendations.foreach(println) ... Rating( 2093760,1300642,0.02833118412903932) Rating( 2093760,2814,0.027832682960168387) Rating( 2093760,1037970,0.02726611004625264) Rating( 2093760,1001819,0.02716011293509426) Rating( 2093760,4605,0.027118271894797333)

CS535 Big Data | Computer Science | Colorado State University

  • 5. Advanced Data Analytics with Apache Spark

Recommending Music and the Audioscrobbler Dataset

Evaluating the Recommendation Model

CS535 Big Data | Computer Science | Colorado State University

What is a “good” recommendation?

  • “a popular artist”?
  • “artists the user has listened to”?
  • “artists the user will listen to”?

CS535 Big Data | Computer Science | Colorado State University

Preparing data for evaluation

  • To perform a meaningful evaluation, some of the artist play data can be set aside
  • Hidden from the ALS model building process
  • The held-out data can be used as a collection of good recommendations for each user
  • Compute the recommender’s score

For building model For testing model

CS535 Big Data | Computer Science | Colorado State University

AUC metric

  • Rank 1.0 is perfect, 0.0 is the worst
  • Receiver Operating Characteristic (ROC)
  • Based on the rank used to decide final recommendations
  • Area Under the Curve (AUC) of ROC may be used as the probability that a randomly

chosen good recommendation ranks above a randomly chosen bad recommendation

  • Spark’s BinaryCalssficationMetrics
  • Computes AUC per users and averages the result
  • Generating mean AUC

CS535 Big Data | Computer Science | Colorado State University

MAP metric

  • Mean average precision
  • Focuses on the top recommendations

CS535 Big Data | Computer Science | Colorado State University

slide-15
SLIDE 15

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 15

Computing AUC

  • 90% of the data is used for training and the remaining 10% for validation

import org.apache.spark.rdd._ def areaUnderCurve( positiveData: RDD[ Rating], bAllItemIDs: Broadcast[ Array[ Int]], predictFunction: (RDD[( Int, Int)] = > RDD[Rating])) = { ... } val allData = buildRatings( rawUserArtistData, bArtistAlias) val Array( trainData, cvData) = allData.randomSplit(Array( 0.9, 0.1))

CS535 Big Data | Computer Science | Colorado State University

Computing AUC

  • continued

trainData.cache() cvData.cache() val allItemIDs = allData.map(_. product). distinct(). collect() val bAllItemIDs = sc.broadcast( allItemIDs) val model = ALS.trainImplicit( trainData, 10, 5, 0.01, 1.0) val auc = areaUnderCurve( cvData, bAllItemIDs, model.predict)

CS535 Big Data | Computer Science | Colorado State University

k-Fold Cross-validation

  • Create a k-fold partition of the dataset
  • For each of the k experiments use K-1 folds for training
  • The remaining fold for testing

Experiment 1 Experiment 2 Experiment 3 Total number of examples Test example Experiment 4

CS535 Big Data | Computer Science | Colorado State University

True error estimate

  • k-fold cross validation is similar to random subsampling
  • The advantage of k-Fold Cross validation
  • All the examples in the dataset are eventually used for both training and testing
  • The true error is estimated as the average error rate

E = 1 K Ei

i=1 K

CS535 Big Data | Computer Science | Colorado State University

k-Fold Cross-validation with Spark

  • MLUtils.kFold()

def predictMostListened( sc: SparkContext, train: RDD[Rating])(allData: RDD[( Int, Int)]) = { val bListenCount = sc.broadcast( train.map( r = > (r.product, r.rating)). reduceByKey(_ + _).collectAsMap() ) allData.map { case (user, product) = > Rating( user, product, bListenCount.value.getOrElse(product, 0.0) ) } } val auc = areaUnderCurve(cvData, bAllItemIDs, predictMostListened(sc,trainData))

CS535 Big Data | Computer Science | Colorado State University

Hyperparameter selection

  • MatrixFactorizationModel
  • ALS.trainImplicit()
  • rank = 10
  • The number of latent factors in the model
  • The number of columns, k
  • iterations = 5
  • The number of iterations that the factorization runs
  • lambda = 0.1
  • A standard overfitting parameter
  • Higher value guards against overfitting
  • Values that are too high will decrease the factorization’s accuracy
  • alpha = 1.0
  • Controls the relative weight of observed versus unobserved user-product interactions in the

factorization

CS535 Big Data | Computer Science | Colorado State University

slide-16
SLIDE 16

CS535 Big Data 4/8/2020 Week 11-B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 16

Questions?

CS535 Big Data | Computer Science | Colorado State University