HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. - PowerPoint PPT Presentation

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 !

Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet solved ! ! images, videos, text ! ! Incorporate xyzabc ! Structured Data ! 2 !

Why should we (DM/DB folks) care? ! Reason 2: S/ware companies use crowds at scale ! We undertook a survey of industry crowdsourcing users ! !! ! ! ! ! ! ! ! ! ! ! ! ! use crowds! ! Often 10s+ of Millions of $ / yr. / company ! (on crowds + supervisors) ! Plenty of startups too! ! 3 !

Why should we (DM/DB folks) care? ! ! ! Reason 3: Marketplaces are growing rapidly ! 20+ marketplaces ! ! Big companies have internal ones ! Crowdsourcing Marketplaces ! Size of these marketplaces have doubled in 2011 – 2013 !

Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Reason 2: Software companies use crowds at scale ! Reason 3: Marketplaces are growing rapidly ! 5 !

What is Human-Powered Data Management? ! Data Processing ! Data Processing Algorithms ! Systems ! where humans act as “data processors” ! e.g., compare, label, extract ! Learning Machine Learning ! accuracies ! Interfaces ! HCI ! Patterns ! Economics ! Incentives ! 6 !

Efficient Data Processing Algorithms & Systems ! Filter [SIGMOD12, VLDB14] ! Max [SIGMOD12] ! Data Processing ! Clean [KDD12, TKDD13] ! Categorize [VLDB11] ! Algorithms ! Search [ICDE14] ! ! Debugging [NIPS12] ! Data Processing Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12] ! DataSift [HCOMP13, SIGMOD14] HQuery [CIDR11] ! Systems ! Auxiliary Plugins: Confidence [KDD13, TR14] ! Eviction [TR12] ! Quality, Pricing ! Pricing [VLDB15] ! ! Quality [HCOMP14] ! i.stanford.edu/~adityagp/scoop.html ! 7 !

Data Proc. Sys.: Crowd-Powered Search ! Can your search engine handle this? ! buildings in the vicinity of xxx ! type of cable that connects to ! apartments in a good school district near Urbana, with a bus stop near by ! 8 !

DataSift: Crowd-Powered Search ! ! • No Non-t n-text xtual ual cont ntent nt: ! ! ! “ cables that plug into <img>” ! ! ! “funny pictures of cats with hats with captions” ! ! • Ti Time-c -consum nsuming ng: “find noise canceling headphones where the battery lasts 13 hrs” ! ! ! “apartments in a nice area around urbana” ! ! 9 !

Building DataSift: Challenges ! Ask for text reformulations for query ! Gather ! Check if item satisfies query ! Filter ! ! Gather ! Retrieve ! Filter ! ! ! Gather ! Retrieve ! Filter ! Retrieve ! Filter ! ! ! ! • How many any re reformul ulat ations ns sho shoul uld we gat athe her? r? ! ! • How many any items s sho shoul uld we re retrieve at at eac ach h st step? p? ! • How do we fi filter r items? s? How many any pe peopl ple do we ask ask? ? ! • How do we opt ptimize the he workfl flow? ! • How do we guar uarant antee corre rectne ness? ss? ! 11 ! !

Fundamental Tradeoffs ! How long can I wait? ! Latency ! What is my desired quality? ! Quality ! Cost ! How much am I willing to spend? ! 12 !

DataSift Summary ! Sample applications: ! education, social media, commerce, journalism, … ! Latency ! Gather ! Retrieve ! Filter ! Quality ! Cost ! [SIGMOD14] DataSift: A Crowd-Powered Search Toolkit (demo) ! [HCOMP13] An expressive and accurate crowd powered search ! 13 !

Filtering: The Simplest Version ! Is this image a cat? ! Boolean ! Dataset set of of It Items ems ! Filtered Filt ered Dataset set ! Predicate ! Y ! Y ! N ! Does X satisfy predicate? ! Latency ! For now, all humans have same error rates ! Quality ! Cost ! 14 !

Our Visualization of Strategies ! continue ! No No ! decide PASS ! 5 ! decide FAIL ! 4 ! Markov 3 ! Decision ! 2 ! Process ! 1 ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! 15 !

Strategy Examples ! continue ! No ! No ! No No decide PASS ! 5 ! 5 ! decide FAIL ! 4 ! 4 ! 3 ! 3 ! 2 ! 2 ! 1 ! 1 ! 1 ! 1 ! 2 ! 2 ! 3 ! 3 ! 4 ! 4 ! 5 ! 5 ! Yes Yes es ! es ! 16 !

Simplest Version ! ! Given: ! — Human error probability (FP/FN) ! Via sampling, ! — Pr [Yes | 0]; Pr [No | 1] ! prior history, or ! — A-priori probability ! gold standard ! — Pr [0]; Pr[1] ! ! Find st stra rateg egy with minimum expected cost (# of questions) ! m ! — Expected error < t (say, 5%) ! x+y=m ! — Cost per item < m (say, 20 questions ) ! m ! 17 !

Evaluating Strategies ! continue ! decide PASS ! No No ! decide FAIL ! 5 ! ! ost = (x+y) Pr [reach(x,y)] Cost ∑ 4 ! ! Error ror = Pr [reach � 1] + ∑ ! Pr [reach � 0] ! ∑ 3 ! y ! 2 ! Pr. [reach (4, 2)] = ! 1 ! Pr. [reach (4, 1) & get a No]+ Pr. [reach (3, 2) & get a Yes] ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! x ! 18 !

Naïve Approach ! For each grid point ! No ! No ! Assign , or ! 5 ! ! 4 ! For all strategies: ! • Evaluate cost & error ! 3 ! Return the best ! 2 ! 1 ! O(3 (3 g ), ), g = O(m (m 2 ) ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! If m= If m= 5, 5, g = 21 21 ! ! 19 !

Comparison ! Computing Money ! Strategy ! Naïve ! $$ !! Not feasible ! deterministic ! Our best Exponential; $$$ ! deterministic ! feasible ! 20 !

Probabilistic Strategy Example ! No No ! continue ! 6 ! decide PASS ! decide FAIL ! 5 ! 4 ! 3 ! 2 ! 1 ! (0.2, 0.8, 0) ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! 21 !

Comparison ! Computing Money ! Strategy ! Naïve ! Exponential; ! $$ !! deterministic ! not feasible ! Our best Exponential; $$$ ! deterministic ! feasible ! The best Polynomial(m) ! $ ! probabilistic ! THE BEST THE BEST ! 22 !

Finding the Optimal Strategy ! Simple: Use Linear Programming " ! • variables: “probabilistic decision per grid point” ! ! • constraints: ! • probability conservation ! • boundary conditions ! ! [SIGMOD12] Crowdscreen: Algorithms for filtering data with humans ! 23 !

Generalizations ! • Multiple answers (ratings, categories) ! • Multiple independent filters ! Doable ! • Difficulty ! • Different penalty functions ! • Latency ! • Different worker abilities ! Hard! ! • Different worker probes ! • A-priori scores ! ! 24 !

Generalization: Worker Abilities ! It Item em 1 ! It Item em 2 ! Item It em 3 ! Actual ! 0 ! 1 ! 0 ! W 1 ! 0 ! 1 ! 0 ! W 2 ! 1 ! 1 ! 1 ! W 3 ! 1 ! 0 ! 1 ! (W 1 Yes, W 1 No, …, W n Yes, W n No) ! O(m 2n ) points ! n � 1000 ! Explosion of state! ! 25 !

A Different Representation ! Pr Pr [1| 1|An Ans] s] ! 1 ! No No ! 3 ! 0.8 ! 2 ! 0.6 ! 1 ! 0.4 ! 0.2 ! 1 ! 2 ! 3 ! Yes es ! 1 ! 2 ! 3 ! Cost ost ! 26 !

Worker Abilities: Sufficiency ! Pr Pr [1| 1|An Ans] s] ! 1 ! 0.8 ! ( W 1 Yes, W 1 No, ! 0.6 ! W 2 Yes, W 2 No, ! 0.4 ! …, ! W n Yes, W n No) ! 0.2 ! 1 ! 2 ! 3 ! 4 ! 5 ! Cost ost ! Recording Pr[1|Ans] is sufficient: ! Strategy ! Optimal ! 27 !

MOOCs: Application of Filtering ! � ! Peer Evaluation ! � Crowdsourcing ! Required ! A+ ! A ! B- ! B+ ! Generalization of boolean filtering to scoring [1-5] ! 28 !

Experiments on MOOCs ! Stanford HCI Course ! 1000 x 5 x 5 Parts = 25000 Parts ! Graded by random peers with known error rates ! To study: how much we can reduce error for fixed cost ! ! 29 !

Summary : ! For same cost, reduction in error ! (distance from correct grade) of: ! • 50% over median ! • 30% over MLE ! • 10-20% over same accuracy ! [VLDB14] Optimal Crowd-Powered Rating and Filtering Algorithms ! 30 !

Efficient Data Processing Algorithms & Systems ! Filt Filter er [SIG IGMOD12, 12, VLDB14] 14] ! Max [SIGMOD12] ! Data Processing ! Clean [KDD12, TKDD13] ! Categorize [VLDB11] ! Algorithms ! Search [ICDE14] ! ! Debugging [NIPS12] ! Data Processing Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12] ! DataSif ift [HCOMP13, 13, SIG IGMOD14] 14] HQuery [CIDR11] ! Systems ! Auxiliary Plugins: Confidence [KDD13, TR14] ! Eviction [TR12] ! Quality, Pricing ! Pricing [VLDB15] ! ! Quality [HCOMP14] ! Latency ! i.stanford.edu/~adityagp/scoop.html ! Quality ! Cost ! 31 !

VISUAL DATA MANAGEMENT with SeeDB ! Aditya Parameswaran ! ! with: ! Hector Garcia Molina, Sam Madden, ! Alkis Polyzotis, Manasi Vartak ! ! 33 !

Simplifying Data Analytics ! ! Up to a million additional analysts will be needed to address data analytics needs in 2018 in the US alone. ! ! ! ! ! --- McKinsey Big Data Report, 2013 ! How w do o we e ma make e it it ea easier sier for or novice ice data analyst ysts s to o get et in insig sights s from rom data? ! 34 !

Data Analytics Workflow ! “Production by State” ! 50 ! 40 ! 30 ! 25 ! 20 ! 15 ! 10 ! 10 ! All Products ! MA ! CA ! IL ! NY ! “Staplers” ! Query ! Views ! “Sales by Year” ! 4.5 ! 4 ! 3.5 ! 3 ! 2.5 ! 2 ! 1.5 ! Labor oriou ious s and Tiresome! iresome! ! Can we e automa omate e this? is? ! “Production by Year” ! ! 4.5 ! 4 ! Simila imilar r issu issues es wit with ! 3.5 ! 3 ! 2.5 ! Tab ableau, au, Sho ShowMe, Pro rofi filer, Spo Spotfi fire re ! 2 ! 1.5 ! 35 !

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. - PowerPoint PPT Presentation

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 ! Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet

University of Waikato Powered by 2018 Hamilton campus Powered by The P Powered by Tauranga

The Future of Water Management Powered by Life beyond the 100th meridian 2 Powered by Our

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Powered by TCPDF (www.tcpdf.org) Powered by TCPDF (www.tcpdf.org) Powered by TCPDF

Lisa Copeland Powered by General Manger Choose a job you love and you will NEVER have to

Voices for Public Transportation powered by powered by Voices for Public Transportation Voices

POWER GENERATION IN DEVELOPING COUNTRIES KINKAJUICE : HUMAN POWERED GENERATOR Keith Durand /

City of Santa Monica Airport Commission Electric-powered aircraft Providing follow-up data

SOLAR POWERED DISPLAYS OPTIMAL medium for stand-alone (power/data cable free) information

Behind the scenes of a FOSS-powered HPC cluster at UCLouvain Ansible or Salt? Ansible AND Salt!

Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye

Human Resources Human Resources Business Unit Business Unit DaVonna Johnson Human Resources

Tompkins County Tompkins County HUMAN SER HUMAN SERVICES CO HUMAN SER HUMAN SERVICES CO ICES

Human Design Welcome to the World of HUMAN DESIGN My Divine Human Design Our Host What

SPEAK TRUTH TO ROBERT F . KENNEDY POWER HUMAN A human rights education program RIGHTS ABOUT

Institutionalizing Crypto Asset Management Powered by Disclaimer This confidential Management

Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling U. Michieli, M.

Arquitectura de Software (Estilos Arquitectnicos) Universidad de los Andes Demin Gutierrez

India: Palaces & forts ( part two) 6 Junagarh Fort (Bikaner) 9 Umaid Bhawan Palace

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

Natural SUSY On Trial: Status of Higgsino Searches at ATLAS Julia Gonski Harvard University 25

A Robust and Efficient Parallel SVD Solver Based on Restarted Lanczos Bidiagonalization Jose E.

Peter Lavender University of Wolverhampton some thoughts on Frank Glendennings legacy

and Challenges Dr Anthony Charles Introduction Member of Faculty at Swansea University

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. - PowerPoint PPT Presentation

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 ! Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet

University of Waikato Powered by 2018 Hamilton campus Powered by The P Powered by Tauranga

The Future of Water Management Powered by Life beyond the 100th meridian 2 Powered by Our

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Powered by TCPDF (www.tcpdf.org) Powered by TCPDF (www.tcpdf.org) Powered by TCPDF

Lisa Copeland Powered by General Manger Choose a job you love and you will NEVER have to

Voices for Public Transportation powered by powered by Voices for Public Transportation Voices

POWER GENERATION IN DEVELOPING COUNTRIES KINKAJUICE : HUMAN POWERED GENERATOR Keith Durand /

City of Santa Monica Airport Commission Electric-powered aircraft Providing follow-up data

SOLAR POWERED DISPLAYS OPTIMAL medium for stand-alone (power/data cable free) information

Behind the scenes of a FOSS-powered HPC cluster at UCLouvain Ansible or Salt? Ansible AND Salt!

Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye

Human Resources Human Resources Business Unit Business Unit DaVonna Johnson Human Resources

Tompkins County Tompkins County HUMAN SER HUMAN SERVICES CO HUMAN SER HUMAN SERVICES CO ICES

Human Design Welcome to the World of HUMAN DESIGN My Divine Human Design Our Host What

SPEAK TRUTH TO ROBERT F . KENNEDY POWER HUMAN A human rights education program RIGHTS ABOUT

Institutionalizing Crypto Asset Management Powered by Disclaimer This confidential Management

Region Merging Driven by Deep Learning for RGB-D Segmentation and Labeling U. Michieli, M.

Arquitectura de Software (Estilos Arquitectnicos) Universidad de los Andes Demin Gutierrez

India: Palaces &amp; forts ( part two) 6 Junagarh Fort (Bikaner) 9 Umaid Bhawan Palace

Statistical Learning Theory and Applications 9.520/6.860 in Fall 2016 Class Times:

Natural SUSY On Trial: Status of Higgsino Searches at ATLAS Julia Gonski Harvard University 25

A Robust and Efficient Parallel SVD Solver Based on Restarted Lanczos Bidiagonalization Jose E.

Peter Lavender University of Wolverhampton some thoughts on Frank Glendennings legacy

and Challenges Dr Anthony Charles Introduction Member of Faculty at Swansea University

India: Palaces & forts ( part two) 6 Junagarh Fort (Bikaner) 9 Umaid Bhawan Palace