Look-a-likes How Internet Giants Reach the Most Relevant Audience - - PowerPoint PPT Presentation

look a likes
SMART_READER_LITE
LIVE PREVIEW

Look-a-likes How Internet Giants Reach the Most Relevant Audience - - PowerPoint PPT Presentation

Look-a-likes How Internet Giants Reach the Most Relevant Audience at Scale Moran Gavish, Outbrain mgavish@outbrain.com Big Data Moscow, October-11 th -2018 Outbrains Mission: Helping people discover great content 3 +550M +250B 6000


slide-1
SLIDE 1

Big Data Moscow, October-11th -2018

Moran Gavish, Outbrain

mgavish@outbrain.com

Look-a-likes

How Internet Giants Reach the Most Relevant Audience at Scale

slide-2
SLIDE 2

Outbrain’s Mission:

Helping people discover great content

slide-3
SLIDE 3

3

slide-4
SLIDE 4

+250B

Recommendations Served Monthly 35K/Sec Requests @ 50ms

6000

Servers Across 3 Data Center

+550M

Unique Monthly Global Audience

slide-5
SLIDE 5

5

slide-6
SLIDE 6

5

slide-7
SLIDE 7

Visitors

ACTION

Video views Downloads Purchases

ENGAGEMENT TRAFFIC

Sign-ups Social sharing Time spent Leads Comments

Online Marketing KPIs

Marketers optimize by their marketing objectives

slide-8
SLIDE 8
  • Past engagement is a great predictor for future action.
  • Therefore, users who engaged in the past (e.g. visited marketer’s

website) are much more likely to convert and hence will be targeted aggressively.

8

Returning Audience

slide-9
SLIDE 9

9

Remarketing – How is it working from an advertiser’s perspective

slide-10
SLIDE 10
  • “Lookalike Audience” - is a way to reach new people

who are similar to the “engaged” (seed) population.

10

What are “Lookalike Audiences”?

slide-11
SLIDE 11

11

Consider the following online retailer…

slide-12
SLIDE 12
  • has a list of 1K users that visited their

website, but did not complete a purchase (“seed users”).

  • The seed users respond greatly to ‘s online

campaign.

  • However, its scale is marginal

12

Case Study

slide-13
SLIDE 13
  • Lookalike audience amplifies reach by targeting users who are “similars” to the seed users.

13

Schematic Conceptual View

Original Seed Users Amplified Look-a-like Audience

slide-14
SLIDE 14
  • Real-time vs Offline LAL Scoring

14

Schematic Product View

New LAL Modeling Request Create LAL Model Models Repository Online Serving

LAL Scoring

(1ms)

Input parameters:

  • Seed users
  • Required Reach
slide-15
SLIDE 15

15

Q: How to identify LAL users? A: Classification with confidence.

Seed Users

slide-16
SLIDE 16

#OutbrainMasterclass

Strategy Creation Discovery Engagement ROI

SEARCH GRAPH

How do marketing platforms get to know their online audience?

What they are searching for

SOCIAL GRAPH INTEREST GRAPH

What they are sharing (Driven by Ego) What they are reading & watching

slide-17
SLIDE 17
  • For Your Eyes Only: Consuming vs. Sharing Content, APRIL 4, 2016 | ROY SASSON , RAM MESHULAM

17

Dissonanse between Consuming vs. Sharing Content

  • More Consume
  • More Share
slide-18
SLIDE 18
  • The user data is represented by the content she consumed

i.e. categories of the articles she read in the past and the websites where she read them.

  • There is neither identification nor

demographic data

18

How is user data represented?

slide-19
SLIDE 19

Seed Users

Perhaps an “unbiased sample” of the general population?

19

What are the negative examples for training the classifier?

slide-20
SLIDE 20

20

1st LAL Classifier

General Population

(“Unbiased sample”)

Seed Users

slide-21
SLIDE 21

21

1st LAL Classifier

General Population

(“Unbiased sample”)

Seed Users

90% from 50% - 40% - 5% - 5% - Rest of World

slide-22
SLIDE 22
  • Observation: “Commonly, the seed users are associated to a small set of distinct silos”.

22

Siloed LAL Modeling

slide-23
SLIDE 23
  • Country–Language–Platform Triplets (equivalence classes)

–E.g. “US_English_Desktop”, “ES_Spanish_Mobile”, etc…

  • Important Property: In Serving time, a request corresponds to a one unique Silo.

23

Sub - Silos

slide-24
SLIDE 24
  • The user data is:

– Aggregated – Decayed – Top-K – Un-structured (JSON)

  • Normalization and Flattening are required

24

User Data Example

Websites (Where) Page Views # http://www.cnn.com/news 30 http://www.espn.com/basketball 17 http://fox-news/news 12 http://www.wired.com/journalists 6 http://www.the-sun.com/men 5 http://www.outbrain.com/blog/ 4 http://www.foodsdictionary.co.il/recipes 3 http://www.geektime.co.il/ 3 http://www.maariv.co.il/culture 2 Categories (What) Page Views # Politics 26 Basketball 18 Investing 14 Justice 10 Dining 8 Marketing 4 Music 3 Autos 1 Celebrities 1

slide-25
SLIDE 25
  • Repository of “Neutral” users
  • Important Property:

Higher homogenity of the user profiles.

  • Flattening and Dimensionality reduction

25

Within a Silo

(All the users are of the same Country–Language–Platform)

slide-26
SLIDE 26

26

Classification Formulation

User Categories

(100 features)

Web Domains

(500 features) Seed 1 Seed 2 Seed 3 Neutral 1 Neutral 2 Neutral 3 Neutral 4 Neutral 5

Concatenate into a unified sparse feature-vector

  • Best classification performance achieved using Random-Forests
  • White-box classification algorithms

(Sparse Users-Features Matrix)

slide-27
SLIDE 27

27

Schematic LAL View

slide-28
SLIDE 28
  • Requirement 1: Score 1K Models in less than 1 ms
  • Requirement 2: Reasonable Memory Footprint

(per additional LAL Model)

  • Requirement 3: Work for any size of training set

(even when numOfSeedUsers << numOfFeatures)

Reduction of Dimensionality

28

Towards Productization

slide-29
SLIDE 29
  • Observation: The features within user profiles are

highly correlated.

  • Eigen Faces

29

Dimensionality Reduction

Recall: Principal components analysis (PCA) is a procedure for identifying a smaller number of uncorrelated variables, called "principal components", from a large set of data. The goal of PCA is to explain the maximum amount of variance with the fewest number of principal components.

slide-30
SLIDE 30

30

Eigen Profiles

PCA #

Eigen Profile (Human interpretation)

1 Politics 2 MSN no Sport 3 Sport Fans 4 Autos-Investing-Computers (Men?) 5 Celebrities 6 Investing 7 CNN over Fox (Liberals?) 8 News junkies 9 Stock Markets 10 Dining 11 Television but not Celebrities 12 Football but not Basketball 13 Travel but not Autos and not Dining 14 Baseball 15 Interpersonal Relationships 16 Football and Basketball but not Baseball 17 Education 18 War and Conflict but not Travel 19 Lifestyle

slide-31
SLIDE 31

31

Eigen Profiles Transform

User Categories

(100 features)

Web Domains

(500 features) Seed 1 Seed 2 Neutral 1 Neutral 2

𝑌

Eigen Profile 1 Eigen Profile 2 Eigen Profile 3 Eigen Profile 4 Eigen Profile 5 Eigen Profile 6 Eigen Profile 7 …

=

User Eigen Space

(Up to 250) Seed 1 Seed 2 Neutral 1 Neutral 2 ( 𝑇 + 𝑂) 𝑌 600 600 𝑌 250 ( 𝑇 + 𝑂) 𝑌 250

(Sparse Users-Features Matrix) (Dense Users-Features Matrix in eigen space)

slide-32
SLIDE 32
  • Logistic Regression
  • Reconstruct the model coefficients (Inverse Transform)

32

Eigen Space Operations

slide-33
SLIDE 33

33

Experiments

  • Methodology

–Same Dates –Same CPC –Same Ads

  • Groups

–Control –Look-a-Likes –Re-targeting

  • Results
slide-34
SLIDE 34

34

Hair Building (magical) Science - Example

  • What is the top LAL

differentiating Category?

slide-35
SLIDE 35
  • Silos – No Free meals – Complex Engineering
  • Example bias in time – World Cup
  • Eliminate cherry picking – All models compete on the

same users - Remove meta data

  • Recap the entire Process X Silos

35

Some Remarks

slide-36
SLIDE 36

36

One more Takeaway

(Konstanz Information Miner)

slide-37
SLIDE 37

Thank You

mgavish@outbrain.com

slide-38
SLIDE 38

Backup Slides

slide-39
SLIDE 39
  • ~100 Categories
  • Loseless

39

Flattening

Category Page Views Weight Science 7 0.7 Basketball 2 0.2 Investing 1 0.1 Politics Crime Celebrities Television Soccer Basketball Mobile Science Careers Health Investing Education Aging

0.2 0.7 0.1

(Sparse Categories Vector)

slide-40
SLIDE 40
  • > 1M Websites  Lossy

40

Flattening (cont.)

Website PVs www.cnn.com/news 2 www.cnn.com/politics 2 www.espn.com/nfl 2 www.espn.com/mma 1 www.techcrunch.com 1 www.really-good-food.com 1 www.best-invest.com 1

www.cnn.com www.msn.com www.foxnews.com www.espn.com www.wired.com www.mtv.com www.techcrunch.com www.RateMyProfessors.com www.ask.com www.dummies.com www.historychannel.com www.vogue.com Long Tail Domains 0.4 0.3 0.1 0.2

Web Domain PVs Weight www.cnn.com 4 0.4 www.espn.com 3 0.3 www.techcrunch.com 1 0.1 Long Tail Domains 2 0.2

Clustering of websites into Web Domains 0% 10% 20% 30% 40% 20 40 60 80 100

Popularity Web Domain Rank

Web Domains Popularity

US-Desktop Israel-Desktop

(Sparse Web Domains Vector)

slide-41
SLIDE 41

41

Features Required vs Data Loss

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 50 100 150 200 250 300 350 400 450 500

  • Acc. Weight

# Features

Web Domains Flattening “Lossy-ness”

US-Desktop Israel-Desktop

slide-42
SLIDE 42

42

Classification Results vs. Feature Sets