XRay Transparency for the data-driven Web. Mathias Lcuyer ! - PowerPoint PPT Presentation

XRay Transparency for the data-driven Web. Mathias Lécuyer ! Guillaume Ducoffe , Francis Lan , Andrei Papancea , Theofilos Petsios , Riley Spahn ! Augustin Chaintreau , and Roxana Geambasu ! ! Columbia University ! 1

Why this ad? Cedars Hotel Loughborough ! 36 Bedrooms, Restaurant; Bar Free WiFi, Parking, Best Rates www.thecedarshotel.com 2

Why this ad? H o m o Cedars Hotel s e x Loughborough ! u a l i 36 Bedrooms, Restaurant; t y Bar Free WiFi, Parking, Best Rates www.thecedarshotel.com 3

Why this ad? Ralph Lauren Online ! Shop ! The official Site for Ralph Lauren Apparel, Accessories & More www.ralphlauren.com 4

Why this ad? P r e g Ralph Lauren Online ! n a n Shop ! c y The official Site for Ralph Lauren Apparel, Accessories & More www.ralphlauren.com 5

Did you know? Did you know? ! • Data Brokers can tell when you're sick, tired and depressed (and sell the information) [CNN ’14] • Google Apps for Ed used institutional emails to target ads in personal accounts? [SafeGov’14] • Credit companies use Facebook data to decide loans? [CNN’13] 6

Welcome to the big data world • Myriad of web services parties collect immense information about us and use it for varied purposes • Data has lots of beneficial uses Useful recommendations Powerful, predictive applications Improve business with effective product placement Improve public health, disaster response … 7

Big data lacks transparency • We have no visibility into what services do with our data: What is the data used for exactly? Is it being shared? With whom? Can we delete it? • Obscurity threatens to transform the data-driven web into a breeding ground for data misuse . • No robust tools exist to reveal data (mis)uses, even auditors cannot find answers. 8

Question: can we build tools that reveal data misuse? • Which emails trigger which ads? • Which prior searches trigger which prices? • Does Facebook share our data with third-parties? 9

Question: can we build tools that reveal data misuse? • Which emails trigger which ads? • Which prior searches trigger which prices? • Does Facebook share our data with third-parties? Can we do taint tracking systems? • Lots of prior work, many successful systems (e.g., Taintdroid) • Assume a controlled environment (runtime, language, OS) • Need something for the complex and uncontrolled web 10

XRay • First generic data tracking system for the Web associate inputs (e.g., emails) outputs (e.g., ads) • It is accurate , scalable , and generic ! Works now on Gmail, Amazon and YouTube • Provides key building blocks for a new ecosystem of tools to keep big data in check 11

Overview Motivation Design ! Evaluation 12

Goals 1. Fine-grained , accurate data use prediction • Predict use at individual input level (e.g., emails) 2. Scalability • Track many inputs (e.g., 100s of emails) 3. Widely applicable and Self-Tuning • Applies to many services (e.g., gmail, amazon…) 13

Web service model Treat Web services as a black box Data inputs (D i ) ! Targeted outputs (O k ) ! e.g., emails, e.g., ads, recommendations searches, browsing 14

Web service model Treat Web services as a black box Data inputs (D i ) ! Targeted outputs (O k ) ! e.g., emails, e.g., ads, recommendations searches, browsing XRay ! (Differential Correlation) Associations ! D i -> O k 15

Differential Correlation • Key idea: correlate inputs with outputs • Populate extra accounts with subsets of inputs • Use shadow account observations to relate inputs to outputs 16

Differential Correlation • Key idea: correlate inputs with outputs • Populate extra accounts with subsets of inputs • Use shadow account observations to relate inputs to outputs Shadow accounts D 1 , D 2 O 1 Main account Association: ! D 1 , D 2 , D 3 D 2 , D 3 D 2 -> O 1 O 1 O 1 D 1 , D 3 no O 1 17

Challenge: scaling • Key idea: correlate inputs with outputs • Populate extra accounts with subsets of inputs • Use shadow account observations to relate inputs to outputs Shadow accounts It sounds like a lot of accounts… D 1 , D 2 O 1 Main account Association: ! D 1 , D 2 , D 3 D 2 , D 3 D 2 -> O 1 O 1 O 1 D 1 , D 3 no O 1 18

Scalable algorithms • Theorem Under certain assumptions for any ε > 0 there exists an algorithm that requires C x log(N) accounts to correctly identify the inputs of a targeted ad with probability (1 − ε ). ! • Algo1: Set Intersection (simple, not robust) • Algo2: Bayesian (more robust) 19

Algo1: Set Intersection Input : Output O k (an ad) , Inputs D i s (emails) , Observations x Step 1 Output : Targeted input Step1 : Randomly assign emails to D 1 , D 2 D 2 , D 3 D 1 , D 3 shadow accounts. Step2 : Take the sets of emails from Step 2 accounts where the ad appeared. X Step3 : Compute the intersection of D 1 , D 2 D 2 , D 3 D 1 , D 3 these sets. Step 3 Step4 : if the intersection is non empty : it is the targeted emails D 2 else there is no targeting. 20

Algo1: Set Intersection Input : Output O k (an ad) , Inputs D i s (emails) , Observations x Output : Targeted input • We prove it needs a Step1 : Randomly assign emails to shadow accounts. logarithmic number of Step2 : Take the sets of emails from accounts in number of accounts where the ad appeared. inputs for high Step3 : Compute the intersection of probability of detection these sets. Step4 : if the intersection is non empty : it is the targeted emails else there is no targeting. 21

Challenge: it needs tuning • Ads must never appear in the wrong accounts Not true: email redundancy, cache Need a manual threshold to detect emails in a significant number of accounts • Doesn’t take low signal into account Need hard coded minimum number of accounts that see an ad • Tuning is service specific and hard to do 22

Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. = apply_bayes return D i with max 23

Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. P in : probability to see ad if targeted • email in account = apply_bayes return D i with max P out : probability to see ad if targeted • email not in account 25

Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs D i s (emails) , Observations x Output : Targeted input • If an email is targeted, we foreach input Di do can tell which one. compute prob. = apply_bayes • Challenge : what if no end tracked input is targeted? compute prob. = apply_bayes return D i with max 28

Algo2: Bayesian (XRay) Input : Output O k (an ad) , Inputs • Bayes’ rule: D i s (emails) , Observations x Output : Targeted input foreach input Di do • Probability of observations: compute prob. With D i targeted: = apply_bayes ! end ! compute prob. Without targeting: = apply_bayes return D i with max 29

Bayesian can self-tune • Automatic self-tuning with classic iterative inference to learn the parameters • Many other challenges (input overlap, different kind of targeting…). 30

XRay’s architecture Web service User account XRay - frontend All inputs ! Shadow 1 Browser Plugin … All outputs (gives visual feedback) Shadow n Get associations Tracked inputs Shadow outputs XRay - backend Shadow Account Manager ! (service specific) Correlation Engine (service agnostic) 31

Prototype • We built the prototype for Gmail , to associate ads to the emails they target. • Applied correlation engine as-is to Amazon product recommendations and YouTube video recommendations. • 0 lines of code to change to adapt the correlation mechanisms. 32

Talk overview Motivation Design Evaluation 33

Evaluation questions How accurate is XRay? ! Is XRay general, extensible and self-tuning? ! How does XRay scale with the number of inputs? ! How can we manage input overlap? ! Is XRay useful? 34

#1 How accurate is XRay? • We measured recall and precision for XRay’s associations on Gmail , YouTube and Amazon • We need Ground Truth: Ground truth provided by Amazon and YouTube Manual labeling and validation for Gmail 35

XRay Transparency for the data-driven Web. Mathias Lcuyer ! - PowerPoint PPT Presentation

XRay Transparency for the data-driven Web. Mathias Lcuyer ! Guillaume Ducoffe , Francis Lan , Andrei Papancea , Theofilos Petsios , Riley Spahn ! Augustin Chaintreau , and Roxana Geambasu ! ! Columbia University ! 1 Why this ad? Cedars Hotel

Basic Circuitry and Xray Production Lynn C. Sadler, MSRS, R.T.(R)(QM) President, WCEC, Inc.

Outlook of the lecture Magnetism in nutshell Xray absorption spectroscopy XAS

Inflection Risk data Jiaying Wang Data Overview ID Stay Age InfctRsk Culture Xray Beds

Wayra Technology Showcase 22 October 2013 About Us E n Xray Limited is a private UK company.

Xray photoelectron spectroscopy An introduction Spyros Diplas spyros.diplas@sintef.no

HenryFord Nuclear Engr & Rad. Science Health System mikef@umich.edu mikef@rad.hfh.edu

EXPLORING THE SOLUTION ZOO OF A SEMI-ANALYTICAL MHD MODEL FOR SELF-SIMILAR JETS C HIARA C

Welcome! Recording, poll results, notes, and Q&A debrief will be sent to participants

Use of CMS Nursing Home Compare Quality Ratings in Planning and Certificate of Need Regulation

THE MULTI-FACETED ROLES OF TELEHEALTH IN SENIOR CARE 2019 Ziegler LinkAge Fund Symposium

MassHealth Long Term Care Jan Stiefel, J.D. Community Legal Aid June 30, 2020 Financial

Pharmacist and Pharmacy Techician Learning Objectives 1. Identify common industry branding

Module 6 Understanding Pension and How to Apply Topics Covered in This Module Entitlement

Caring House The Smarties Series Care Planning for Smarties 1. The Six Keys to Care Planning

ASC 606 Webinar Disclaimer The information contained herein is general in nature and is not

RN DELEGATION FOR ADMINISTRATORS September 19, 2014 Office of Licensing and Regulatory Oversight

Event Recognition for Unobtrusive Assisted Living Nikos Katzouris, Alexander Artikis and George

Options MyCare Ohio 2014 1 Headline Goes Here Eligibility Long Term Care (LTC) is the provision

Associated with Produce Veronica Bryant, Emergency Preparedness and Outbreak Coordinator Burden

EXPANSION OF A MULTI-DISCIPLINARY IMMUNIZATION PROGRAM TO IMPROVE INFLUENZA VACCINATION RATES

COVID-19 Executive Dialogue March 25, 2020 nhpco.org/coronavirus Your line has been muted upon

Prevention Alison Miley, MSW What Does it Mean to be in SASH? Where is SASH? SASH is available

Three Types of Benefits Disability compensation (service connected or S/C) Non-Service

COVID-19 and LTC March 19, 2020 CMS Guidance update 3-13 Facilities should restrict visitation

Sambuz

Useful Links

Newsletter

Mail Us

XRay Transparency for the data-driven Web. Mathias Lcuyer ! - PowerPoint PPT Presentation

XRay Transparency for the data-driven Web. Mathias Lcuyer ! Guillaume Ducoffe , Francis Lan , Andrei Papancea , Theofilos Petsios , Riley Spahn ! Augustin Chaintreau , and Roxana Geambasu ! ! Columbia University ! 1 Why this ad? Cedars Hotel

Basic Circuitry and Xray Production Lynn C. Sadler, MSRS, R.T.(R)(QM) President, WCEC, Inc.

Outlook of the lecture Magnetism in nutshell Xray absorption spectroscopy XAS

Inflection Risk data Jiaying Wang Data Overview ID Stay Age InfctRsk Culture Xray Beds

Wayra Technology Showcase 22 October 2013 About Us E n Xray Limited is a private UK company.

Xray photoelectron spectroscopy An introduction Spyros Diplas spyros.diplas@sintef.no

HenryFord Nuclear Engr &amp; Rad. Science Health System mikef@umich.edu mikef@rad.hfh.edu

EXPLORING THE SOLUTION ZOO OF A SEMI-ANALYTICAL MHD MODEL FOR SELF-SIMILAR JETS C HIARA C

Welcome! Recording, poll results, notes, and Q&amp;A debrief will be sent to participants

Use of CMS Nursing Home Compare Quality Ratings in Planning and Certificate of Need Regulation

THE MULTI-FACETED ROLES OF TELEHEALTH IN SENIOR CARE 2019 Ziegler LinkAge Fund Symposium

MassHealth Long Term Care Jan Stiefel, J.D. Community Legal Aid June 30, 2020 Financial

Pharmacist and Pharmacy Techician Learning Objectives 1. Identify common industry branding

Module 6 Understanding Pension and How to Apply Topics Covered in This Module Entitlement

Caring House The Smarties Series Care Planning for Smarties 1. The Six Keys to Care Planning

ASC 606 Webinar Disclaimer The information contained herein is general in nature and is not

RN DELEGATION FOR ADMINISTRATORS September 19, 2014 Office of Licensing and Regulatory Oversight

Event Recognition for Unobtrusive Assisted Living Nikos Katzouris, Alexander Artikis and George

Options MyCare Ohio 2014 1 Headline Goes Here Eligibility Long Term Care (LTC) is the provision

Associated with Produce Veronica Bryant, Emergency Preparedness and Outbreak Coordinator Burden

EXPANSION OF A MULTI-DISCIPLINARY IMMUNIZATION PROGRAM TO IMPROVE INFLUENZA VACCINATION RATES

COVID-19 Executive Dialogue March 25, 2020 nhpco.org/coronavirus Your line has been muted upon

Prevention Alison Miley, MSW What Does it Mean to be in SASH? Where is SASH? SASH is available

Three Types of Benefits Disability compensation (service connected or S/C) Non-Service

COVID-19 and LTC March 19, 2020 CMS Guidance update 3-13 Facilities should restrict visitation

Sambuz

Useful Links

Newsletter

Mail Us

HenryFord Nuclear Engr & Rad. Science Health System mikef@umich.edu mikef@rad.hfh.edu

Welcome! Recording, poll results, notes, and Q&A debrief will be sent to participants