xray
play

XRay Transparency for the data-driven Web. Mathias Lcuyer ! - PowerPoint PPT Presentation

XRay Transparency for the data-driven Web. Mathias Lcuyer ! Guillaume Ducoffe , Francis Lan , Andrei Papancea , Theofilos Petsios , Riley Spahn ! Augustin Chaintreau , and Roxana Geambasu ! ! Columbia University ! 1 Why this ad? Cedars Hotel


  1. XRay Transparency for the data-driven Web. Mathias Lécuyer ! Guillaume Ducoffe , Francis Lan , Andrei Papancea , Theofilos Petsios , Riley Spahn ! Augustin Chaintreau , and Roxana Geambasu ! ! Columbia University ! 1

  2. Why this ad? Cedars Hotel Loughborough ! 36 Bedrooms, Restaurant; Bar Free WiFi, Parking, Best Rates www.thecedarshotel.com 2

  3. Why this ad? H o m o Cedars Hotel s e x Loughborough ! u a l i 36 Bedrooms, Restaurant; t y Bar Free WiFi, Parking, Best Rates www.thecedarshotel.com 3

  4. Why this ad? Ralph Lauren Online ! Shop ! The official Site for Ralph Lauren Apparel, Accessories & More www.ralphlauren.com 4

  5. Why this ad? P r e g Ralph Lauren Online ! n a n Shop ! c y The official Site for Ralph Lauren Apparel, Accessories & More www.ralphlauren.com 5

  6. Did you know? Did you know? ! • Data Brokers can tell when you're sick, tired and depressed (and sell the information) [CNN ’14] • Google Apps for Ed used institutional emails to target ads in personal accounts? [SafeGov’14] • Credit companies use Facebook data to decide loans? [CNN’13] 6

  7. Welcome to the big data world • Myriad of web services parties collect immense information about us and use it for varied purposes • Data has lots of beneficial uses Useful recommendations Powerful, predictive applications Improve business with effective product placement Improve public health, disaster response … 7

  8. Big data lacks transparency • We have no visibility into what services do with our data: What is the data used for exactly? Is it being shared? With whom? Can we delete it? • Obscurity threatens to transform the data-driven web into a breeding ground for data misuse . • No robust tools exist to reveal data (mis)uses, even auditors cannot find answers. 8

  9. Question: can we build tools that reveal data misuse? • Which emails trigger which ads? • Which prior searches trigger which prices? • Does Facebook share our data with third-parties? 9

  10. Question: can we build tools that reveal data misuse? • Which emails trigger which ads? • Which prior searches trigger which prices? • Does Facebook share our data with third-parties? Can we do taint tracking systems? • Lots of prior work, many successful systems (e.g., Taintdroid) • Assume a controlled environment (runtime, language, OS) • Need something for the complex and uncontrolled web 10

  11. XRay • First generic data tracking system for the Web associate inputs (e.g., emails) outputs (e.g., ads) • It is accurate , scalable , and generic ! Works now on Gmail, Amazon and YouTube • Provides key building blocks for a new ecosystem of tools to keep big data in check 11

  12. Overview Motivation Design ! Evaluation 12

  13. Goals 1. Fine-grained , accurate data use prediction • Predict use at individual input level (e.g., emails) 2. Scalability • Track many inputs (e.g., 100s of emails) 3. Widely applicable and Self-Tuning • Applies to many services (e.g., gmail, amazon…) 13

  14. Web service model Treat Web services as a black box Data inputs (D i ) ! Targeted outputs (O k ) ! e.g., emails, e.g., ads, recommendations searches, browsing 14

  15. Web service model Treat Web services as a black box Data inputs (D i ) ! Targeted outputs (O k ) ! e.g., emails, e.g., ads, recommendations searches, browsing XRay ! (Differential Correlation) Associations ! D i -> O k 15

  16. Differential Correlation • Key idea: correlate inputs with outputs • Populate extra accounts with subsets of inputs • Use shadow account observations to relate inputs to outputs 16

  17. Differential Correlation • Key idea: correlate inputs with outputs • Populate extra accounts with subsets of inputs • Use shadow account observations to relate inputs to outputs Shadow accounts D 1 , D 2 O 1 Main account Association: ! D 1 , D 2 , D 3 D 2 , D 3 D 2 -> O 1 O 1 O 1 D 1 , D 3 no O 1 17

  18. Challenge: scaling • Key idea: correlate inputs with outputs • Populate extra accounts with subsets of inputs • Use shadow account observations to relate inputs to outputs Shadow accounts It sounds like a lot of accounts… D 1 , D 2 O 1 Main account Association: ! D 1 , D 2 , D 3 D 2 , D 3 D 2 -> O 1 O 1 O 1 D 1 , D 3 no O 1 18

  19. Scalable algorithms • Theorem Under certain assumptions for any ε > 0 there exists an algorithm that requires C x log(N) accounts to correctly identify the inputs of a targeted ad with probability (1 − ε ). ! • Algo1: Set Intersection (simple, not robust) • Algo2: Bayesian (more robust) 19

  20. Algo1: Set Intersection Input : Output O k (an ad) , Inputs D i s (emails) , Observations x Step 1 Output : Targeted input Step1 : Randomly assign emails to D 1 , D 2 D 2 , D 3 D 1 , D 3 shadow accounts. Step2 : Take the sets of emails from Step 2 accounts where the ad appeared. X Step3 : Compute the intersection of D 1 , D 2 D 2 , D 3 D 1 , D 3 these sets. Step 3 Step4 : if the intersection is non empty : it is the targeted emails D 2 else there is no targeting. 20

  21. Algo1: Set Intersection Input : Output O k (an ad) , Inputs D i s (emails) , Observations x Output : Targeted input • We prove it needs a Step1 : Randomly assign emails to shadow accounts. logarithmic number of Step2 : Take the sets of emails from accounts in number of accounts where the ad appeared. inputs for high Step3 : Compute the intersection of probability of detection these sets. Step4 : if the intersection is non empty : it is the targeted emails else there is no targeting. 21

  22. Challenge: it needs tuning • Ads must never appear in the wrong accounts Not true: email redundancy, cache Need a manual threshold to detect emails in a significant number of accounts • Doesn’t take low signal into account Need hard coded minimum number of accounts that see an ad • Tuning is service specific and hard to do 22

  23. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. = apply_bayes return D i with max 23

  24. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. = apply_bayes return D i with max 24

  25. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. P in : probability to see ad if targeted • email in account = apply_bayes return D i with max P out : probability to see ad if targeted • email not in account 25

  26. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. = apply_bayes return D i with max 26

  27. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. = apply_bayes return D i with max 27

  28. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs D i s (emails) , Observations x Output : Targeted input • If an email is targeted, we foreach input Di do can tell which one. compute prob. = apply_bayes • Challenge : what if no end tracked input is targeted? compute prob. = apply_bayes return D i with max 28

  29. Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. Without targeting: = apply_bayes return D i with max 29

  30. Bayesian can self-tune • Automatic self-tuning with classic iterative inference to learn the parameters • Many other challenges (input overlap, different kind of targeting…). 30

  31. XRay’s architecture Web service User account XRay - frontend All inputs ! Shadow 1 Browser Plugin … All outputs (gives visual feedback) Shadow n Get associations Tracked inputs Shadow outputs XRay - backend Shadow Account Manager ! (service specific) Correlation Engine (service agnostic) 31

  32. Prototype • We built the prototype for Gmail , to associate ads to the emails they target. • Applied correlation engine as-is to Amazon product recommendations and YouTube video recommendations. • 0 lines of code to change to adapt the correlation mechanisms. 32

  33. Talk overview Motivation Design Evaluation 33

  34. Evaluation questions How accurate is XRay? ! Is XRay general, extensible and self-tuning? ! How does XRay scale with the number of inputs? ! How can we manage input overlap? ! Is XRay useful? 34

  35. #1 How accurate is XRay? • We measured recall and precision for XRay’s associations on Gmail , YouTube and Amazon • We need Ground Truth: Ground truth provided by Amazon and YouTube Manual labeling and validation for Gmail 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend