Big Data Moscow, October-11th -2018
Moran Gavish, Outbrain
mgavish@outbrain.com
Look-a-likes
How Internet Giants Reach the Most Relevant Audience at Scale
Look-a-likes How Internet Giants Reach the Most Relevant Audience - - PowerPoint PPT Presentation
Look-a-likes How Internet Giants Reach the Most Relevant Audience at Scale Moran Gavish, Outbrain mgavish@outbrain.com Big Data Moscow, October-11 th -2018 Outbrains Mission: Helping people discover great content 3 +550M +250B 6000
Big Data Moscow, October-11th -2018
Moran Gavish, Outbrain
mgavish@outbrain.com
Look-a-likes
How Internet Giants Reach the Most Relevant Audience at Scale
Outbrain’s Mission:
3
Recommendations Served Monthly 35K/Sec Requests @ 50ms
Servers Across 3 Data Center
Unique Monthly Global Audience
5
5
Visitors
Video views Downloads Purchases
Sign-ups Social sharing Time spent Leads Comments
Marketers optimize by their marketing objectives
8
9
10
11
12
13
Original Seed Users Amplified Look-a-like Audience
14
New LAL Modeling Request Create LAL Model Models Repository Online Serving
LAL Scoring
(1ms)
Input parameters:
15
Q: How to identify LAL users? A: Classification with confidence.
Seed Users
#OutbrainMasterclass
Strategy Creation Discovery Engagement ROI
SEARCH GRAPH
What they are searching for
SOCIAL GRAPH INTEREST GRAPH
What they are sharing (Driven by Ego) What they are reading & watching
17
Dissonanse between Consuming vs. Sharing Content
18
Seed Users
Perhaps an “unbiased sample” of the general population?
19
20
General Population
(“Unbiased sample”)
Seed Users
21
General Population
(“Unbiased sample”)
Seed Users
90% from 50% - 40% - 5% - 5% - Rest of World
22
–E.g. “US_English_Desktop”, “ES_Spanish_Mobile”, etc…
23
– Aggregated – Decayed – Top-K – Un-structured (JSON)
24
Websites (Where) Page Views # http://www.cnn.com/news 30 http://www.espn.com/basketball 17 http://fox-news/news 12 http://www.wired.com/journalists 6 http://www.the-sun.com/men 5 http://www.outbrain.com/blog/ 4 http://www.foodsdictionary.co.il/recipes 3 http://www.geektime.co.il/ 3 http://www.maariv.co.il/culture 2 Categories (What) Page Views # Politics 26 Basketball 18 Investing 14 Justice 10 Dining 8 Marketing 4 Music 3 Autos 1 Celebrities 1
25
(All the users are of the same Country–Language–Platform)
26
User Categories
(100 features)
Web Domains
(500 features) Seed 1 Seed 2 Seed 3 Neutral 1 Neutral 2 Neutral 3 Neutral 4 Neutral 5
Concatenate into a unified sparse feature-vector
(Sparse Users-Features Matrix)
27
Schematic LAL View
(per additional LAL Model)
(even when numOfSeedUsers << numOfFeatures)
Reduction of Dimensionality
28
29
Recall: Principal components analysis (PCA) is a procedure for identifying a smaller number of uncorrelated variables, called "principal components", from a large set of data. The goal of PCA is to explain the maximum amount of variance with the fewest number of principal components.
30
PCA #
Eigen Profile (Human interpretation)
1 Politics 2 MSN no Sport 3 Sport Fans 4 Autos-Investing-Computers (Men?) 5 Celebrities 6 Investing 7 CNN over Fox (Liberals?) 8 News junkies 9 Stock Markets 10 Dining 11 Television but not Celebrities 12 Football but not Basketball 13 Travel but not Autos and not Dining 14 Baseball 15 Interpersonal Relationships 16 Football and Basketball but not Baseball 17 Education 18 War and Conflict but not Travel 19 Lifestyle
31
User Categories
(100 features)
Web Domains
(500 features) Seed 1 Seed 2 Neutral 1 Neutral 2
Eigen Profile 1 Eigen Profile 2 Eigen Profile 3 Eigen Profile 4 Eigen Profile 5 Eigen Profile 6 Eigen Profile 7 …
User Eigen Space
(Up to 250) Seed 1 Seed 2 Neutral 1 Neutral 2 ( 𝑇 + 𝑂) 𝑌 600 600 𝑌 250 ( 𝑇 + 𝑂) 𝑌 250
(Sparse Users-Features Matrix) (Dense Users-Features Matrix in eigen space)
32
33
–Same Dates –Same CPC –Same Ads
–Control –Look-a-Likes –Re-targeting
34
35
36
(Konstanz Information Miner)
mgavish@outbrain.com
39
Category Page Views Weight Science 7 0.7 Basketball 2 0.2 Investing 1 0.1 Politics Crime Celebrities Television Soccer Basketball Mobile Science Careers Health Investing Education Aging
0.2 0.7 0.1
(Sparse Categories Vector)
40
Website PVs www.cnn.com/news 2 www.cnn.com/politics 2 www.espn.com/nfl 2 www.espn.com/mma 1 www.techcrunch.com 1 www.really-good-food.com 1 www.best-invest.com 1
www.cnn.com www.msn.com www.foxnews.com www.espn.com www.wired.com www.mtv.com www.techcrunch.com www.RateMyProfessors.com www.ask.com www.dummies.com www.historychannel.com www.vogue.com Long Tail Domains 0.4 0.3 0.1 0.2
Web Domain PVs Weight www.cnn.com 4 0.4 www.espn.com 3 0.3 www.techcrunch.com 1 0.1 Long Tail Domains 2 0.2
Clustering of websites into Web Domains 0% 10% 20% 30% 40% 20 40 60 80 100
Popularity Web Domain Rank
Web Domains Popularity
US-Desktop Israel-Desktop
(Sparse Web Domains Vector)
41
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 50 100 150 200 250 300 350 400 450 500
# Features
Web Domains Flattening “Lossy-ness”
US-Desktop Israel-Desktop
42
Classification Results vs. Feature Sets