Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir , - - PowerPoint PPT Presentation

between ad exchanges
SMART_READER_LITE
LIVE PREVIEW

Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir , - - PowerPoint PPT Presentation

Tracing Information Flows Between Ad Exchanges Using Retargeted Ads Muhammad Ahmad Bashir , Sajjad Arshad, William Robertson, Christo Wilson Northeastern University Your Privacy Footprint 2 Your Privacy Footprint 2 Your Privacy Footprint 2


slide-1
SLIDE 1

Tracing Information Flows Between Ad Exchanges Using Retargeted Ads

Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson, Christo Wilson Northeastern University

slide-2
SLIDE 2

Your Privacy Footprint

2

slide-3
SLIDE 3

Your Privacy Footprint

2

slide-4
SLIDE 4

Your Privacy Footprint

2

slide-5
SLIDE 5

Your Privacy Footprint

2

slide-6
SLIDE 6

Your Privacy Footprint

2

slide-7
SLIDE 7

Your Privacy Footprint

2

slide-8
SLIDE 8

Real Time Bidding

  • RTB brings more flexibility in the ad ecosystem.
  • Ad request managed by an Ad Exchange which holds an auction.
  • Advertisers bid on each ad impression.
  • RTB spending to cross $20B by 2017[1].
  • 49% annual growth.
  • Will account for 80% of US Display Ad spending by 2022.

3

[1] http://www.prnewswire.com/news-releases/new-idc-study-shows-real-time-bidding-rtb-display-ad- spend-to-grow-worldwide-to-208-billion-by-2017-228061051.html Exchange Advertiser

Cookie matching is a prerequisite.

slide-9
SLIDE 9

Real Time Bidding (RTB)

4

GET, CNN’s Cookie GET, DoubleClick’s Cookie

User Publisher Ad Exchange Advertisers

Solicit bids, DoubleClick’s Cookie Bid

slide-10
SLIDE 10

Real Time Bidding (RTB)

4

GET, CNN’s Cookie GET, DoubleClick’s Cookie

User Publisher Ad Exchange Advertisers

Solicit bids, DoubleClick’s Cookie GET, RightMedia’s Cookie Advertisement Bid

slide-11
SLIDE 11

Real Time Bidding (RTB)

4

GET, CNN’s Cookie GET, DoubleClick’s Cookie

User Publisher Ad Exchange Advertisers

Solicit bids, DoubleClick’s Cookie GET, RightMedia’s Cookie Advertisement Bid

Advertisers cannot read their cookie!

slide-12
SLIDE 12

Cookie Matching

Key problem: Advertisers cannot read their cookies in the RTB auction

  • How can they submit reasonable bids if they cannot identify the user?

Solution: cookie matching

  • Also known as cookie synching
  • Process of linking the identifiers used by two ad exchanges

5

GET, Cookie=12345 GET ?dblclk_id=12345, Cookie=ABCDE 301 Redirect, Location=http://criteo.com/?dblclk_id=12345

slide-13
SLIDE 13

Cookie Matching

Key problem: Advertisers cannot read their cookies in the RTB auction

  • How can they submit reasonable bids if they cannot identify the user?

Solution: cookie matching

  • Also known as cookie synching
  • Process of linking the identifiers used by two ad exchanges

5

GET, Cookie=12345 GET ?dblclk_id=12345, Cookie=ABCDE 301 Redirect, Location=http://criteo.com/?dblclk_id=12345

slide-14
SLIDE 14

Prior Work

  • Several studies have examined cookie matching
  • Acar et al. found hundreds of domains passing identifiers to each other
  • Olejnik et al. found 125 exchanges matching cookies
  • Falahrastegar et al. analyzed clusters of exchanges that share the exact same

cookies

  • These studies rely on studying HTTP requests/responses.

6

slide-15
SLIDE 15

Challenge 1: Server Side Matching

7

1) 2)

Criteo observes the user.

(IP: 207.91.160.7)

RightMedia observes the user.

(IP: 207.91.160.7)

Behind the scene, RightMedia and Criteo sync up. (IP: 207.91.160.7)

slide-16
SLIDE 16

Challenge 2: Obfuscation

8

GET %^$ck#&93#&, Cookie=XYZYX amazon.com dbclk.js

slide-17
SLIDE 17

Challenge 2: Obfuscation

8

GET %^$ck#&93#&, Cookie=XYZYX amazon.com dbclk.js

slide-18
SLIDE 18

Challenge 2: Obfuscation

8

GET %^$ck#&93#&, Cookie=XYZYX amazon.com dbclk.js

slide-19
SLIDE 19

Goal

Develop a method to identify information flows (cookie matching) between ad exchanges

  • Mechanism agnostic: resilient to obfuscation
  • Platform agnostic: detect sharing on the client- and server-side

9

?

slide-20
SLIDE 20

Key Insight: Use Retargeted Ads

Retargeted ads are the most highly targeted form of online ads

10

Key insight: because retargets are so specific, they can be used to conduct controlled experiments

  • Information must be shared between ad exchanges to serve retargeted ads

$15.99

slide-21
SLIDE 21

Contributions

  • 1. Novel methodology for identifying information flows between ad

exchanges

  • 2. Demonstrate the impact of ad network obfuscation in practice
  • 31% of cookie matching partners cannot be identified using heuristics
  • 3. Develop a method to categorize information sharing relationships
  • 4. Use graph analysis to infer the roles of actors in the ad ecosystem

11

slide-22
SLIDE 22

Contributions

  • 1. Novel methodology for identifying information flows between ad

exchanges

  • 2. Demonstrate the impact of ad network obfuscation in practice
  • 31% of cookie matching partners cannot be identified using heuristics
  • 3. Develop a method to categorize information sharing relationships
  • 4. Use graph analysis to infer the roles of actors in the ad ecosystem

11

slide-23
SLIDE 23

Data Collection Classifying Ad Network Flows Results

12

slide-24
SLIDE 24

Using Retargets as an Experimental Tool

This implies a causal flow of information from Exchange  Advertiser

13

Key observation: retargets are only served under very specific circumstances

1) 2)

Advertiser observes the user at a shop Advertiser and the exchange must have matched cookies

slide-25
SLIDE 25

Data Collection Overview

14

150 Publishers 15 pages/publisher

Single Persona

10 websites/persona 10 products/website Visit Persona Visit Publishers Store Images, Inclusion Chains, HTTP requests/ responses

571,636 Images

slide-26
SLIDE 26

Data Collection Overview

14

150 Publishers 15 pages/publisher

Single Persona

10 websites/persona 10 products/website Visit Persona Visit Publishers Store Images, Inclusion Chains, HTTP requests/ responses Potential Targeted Ads 31,850 Ad Detection Filter Images which appeared in > 1 persona

90 Personas

571,636 Images

{

slide-27
SLIDE 27

Data Collection Overview

14

150 Publishers 15 pages/publisher

Single Persona

10 websites/persona 10 products/website Visit Persona Visit Publishers Store Images, Inclusion Chains, HTTP requests/ responses Potential Targeted Ads 31,850 Ad Detection Isolated Retargeted Ads Filter Images which appeared in > 1 persona

90 Personas

571,636 Images

Crowd Sourcing

{

slide-28
SLIDE 28

Crowd Sourcing

15

We used Amazon Mechanical Turk (AMT) to label 31,850 ads.

  • Total 1,142 Tasks.
  • 30 ads / Task.
  • 27 unlabeled.
  • 3 labeled by us.
  • 2 workers per ad.
  • $415 spent.
slide-29
SLIDE 29

Crowd Sourcing

15

We used Amazon Mechanical Turk (AMT) to label 31,850 ads.

  • Total 1,142 Tasks.
  • 30 ads / Task.
  • 27 unlabeled.
  • 3 labeled by us.
  • 2 workers per ad.
  • $415 spent.
slide-30
SLIDE 30

Crowd Sourcing

15

We used Amazon Mechanical Turk (AMT) to label 31,850 ads.

  • Total 1,142 Tasks.
  • 30 ads / Task.
  • 27 unlabeled.
  • 3 labeled by us.
  • 2 workers per ad.
  • $415 spent.
slide-31
SLIDE 31

Crowd Sourcing

15

We used Amazon Mechanical Turk (AMT) to label 31,850 ads.

  • Total 1,142 Tasks.
  • 30 ads / Task.
  • 27 unlabeled.
  • 3 labeled by us.
  • 2 workers per ad.
  • $415 spent.
slide-32
SLIDE 32

Final Dataset

5,102 unique retargeted ads

  • From 281 distinct online retailers

35,448 publisher-side chains that served the retargets

  • We observed some retargets multiple times

16

slide-33
SLIDE 33

Data Collection Classifying Ad Network Flows Results

17

slide-34
SLIDE 34

A look at Publisher Chains

18

Example Shopper-side chain Publisher-side chain

  • How does Criteo know to serve ad on BBC?
  • In this case it is pretty trivial.
  • Criteo observed us on the shopper.
  • Can we classify all such publisher-side chains?
slide-35
SLIDE 35

What is a Chain?

19

slide-36
SLIDE 36

What is a Chain?

19

a a e e

slide-37
SLIDE 37

What is a Chain?

19

^pub .* e a$

a a e e

slide-38
SLIDE 38

Four Classifications

Four possible ways for a retargeted ad to be served

1. Direct (Trivial) Matching 2. Cookie Matching 3. Indirect Matching 4. Latent (Server-side) Matching

20

slide-39
SLIDE 39

Four Classifications

Four possible ways for a retargeted ad to be served

1. Direct (Trivial) Matching 2. Cookie Matching 3. Indirect Matching 4. Latent (Server-side) Matching

20

slide-40
SLIDE 40

1) Direct (Trivial) Matching

21

Shopper-side Publisher-side Example Rule ^shop .* a .*$ ^pub a$ a is the advertiser that serves the retarget

slide-41
SLIDE 41

1) Direct (Trivial) Matching

21

Shopper-side Publisher-side Example Rule ^shop .* a .*$ ^pub a$ a is the advertiser that serves the retarget a must appear

  • n the shopper-

side… … but other trackers may also appear

slide-42
SLIDE 42

2) Cookie Matching

22

Shopper-side Publisher-side Example Rule ^shop .* a .*$ ^pub .* e a$ e precedes a, which implies an RTB auction

slide-43
SLIDE 43

2) Cookie Matching

22

Shopper-side Publisher-side Example Rule ^shop .* a .*$ a must appear

  • n the

shopper-side ^pub .* e a$ e precedes a, which implies an RTB auction

slide-44
SLIDE 44

2) Cookie Matching

22

Shopper-side Publisher-side Example Rule ^shop .* a .*$ a must appear

  • n the

shopper-side ^pub .* e a$ ^* .* e a .*$ Anywhere e precedes a, which implies an RTB auction Transition ea is where cookie match occurs

slide-45
SLIDE 45

3) Latent (Server-side) Matching

23

Shopper-side Publisher-side Example Rule ^shop [^ea]$ Neither e nor a appears on the shopper-side ^pub .* e a$

slide-46
SLIDE 46

3) Latent (Server-side) Matching

23

Shopper-side Publisher-side Example Rule ^shop [^ea]$ Neither e nor a appears on the shopper-side ^pub .* e a$ a must receive information from some shopper-side tracker

slide-47
SLIDE 47

3) Latent (Server-side) Matching

23

Shopper-side Publisher-side Example Rule ^shop [^ea]$ Neither e nor a appears on the shopper-side ^pub .* e a$ a must receive information from some shopper-side tracker We find latent matches in practice!

slide-48
SLIDE 48

Data Collection Classifying Ad Network Flows Results

24

slide-49
SLIDE 49

Categorizing Chains

Type Chains % Chains % Direct (Trivial) Match 1770 5 8449 24 Cookie Match 25049 71 25873 73 Latent (Server-side) Match 5362 15 343 1 No Match 775 2 183 1

25

Clustered

Take away:

1- As expected, most retargets are due to cookie matching 2- Very small number of chains that cannot be categorized

  • Suggests low false positive rate of AMT image labeling task

3- Surprisingly large amount latent matches…

Raw Chains

slide-50
SLIDE 50

Categorizing Chains

Type Chains % Chains % Direct (Trivial) Match 1770 5 8449 24 Cookie Match 25049 71 25873 73 Latent (Server-side) Match 5362 15 343 1 No Match 775 2 183 1

26

Raw Chains Clustered Chains

Cluster together domains by “owner”

  • E.g. google.com, doubleclick.com, googlesyndication.com
slide-51
SLIDE 51

Categorizing Chains

Type Chains % Chains % Direct (Trivial) Match 1770 5 8449 24 Cookie Match 25049 71 25873 73 Latent (Server-side) Match 5362 15 343 1 No Match 775 2 183 1

26

Raw Chains Clustered Chains

Cluster together domains by “owner”

  • E.g. google.com, doubleclick.com, googlesyndication.com

Latent matches essentially disappear

  • The vast majority of these chains involve Google
  • Suggests that Google shares tracking data across their services
slide-52
SLIDE 52

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

slide-53
SLIDE 53

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

slide-54
SLIDE 54

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

slide-55
SLIDE 55

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

slide-56
SLIDE 56

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

slide-57
SLIDE 57

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

slide-58
SLIDE 58

Who is Cookie Matching?

Participant 1 Participant 2 Chains Ads Heuristics criteo  googlesyndication 9090 1887  P criteo  doubleclick 3610 1144  E, P  DC, P criteo  adnxs 3263 1066  E, P criteo  rubiconproject 1586 749  E, P criteo  servedbyopenx 707 460  P doubleclick  steelhousemedia 362 27  P  E, P mathtag  mediaforge 360 124  E, P netmng  scene7 267 119  E  ? googlesyndication  adsrvr 107 29  P rubiconproject  steelhousemedia 86 30  E googlesyndication  steelhousemedia 47 22 ? adtechus  adacado 36 18 ? atwola  adacado 32 6 ? adroll  adnxs 31 8 ?

27

Heuristics Key

(used by prior work)

E – share exact cookies P – special URL parameters DC – DoubleClick URL parameters ? – Unknown sharing method

31% of cookie matching partners would be missed.

slide-59
SLIDE 59

Summary

We develop a novel methodology to detect information flows between ad exchanges

  • Controlled methodology enables causal inference
  • Defeats obfuscation attempts
  • Detects client- and server-side flows

Dataset gives a better picture of ad ecosystem

  • Reveals which ad exchanges are linking information about users
  • Allows us to reason about how information is being transferred

28