human powered data management
play

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. - PowerPoint PPT Presentation

HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 ! Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet


  1. HUMAN-POWERED DATA MANAGEMENT ! ! Aditya Parameswaran ! ! with H. Garcia-Molina, ! J. Widom, A. Polyzotis, M. Teh ! ! 1 !

  2. Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Unstructured Data ! Automated processing: not yet solved ! ! images, videos, text ! ! Incorporate xyzabc ! Structured Data ! 2 !

  3. Why should we (DM/DB folks) care? ! Reason 2: S/ware companies use crowds at scale ! We undertook a survey of industry crowdsourcing users ! !! ! ! ! ! ! ! ! ! ! ! ! ! use crowds! ! Often 10s+ of Millions of $ / yr. / company ! (on crowds + supervisors) ! Plenty of startups too! ! 3 !

  4. Why should we (DM/DB folks) care? ! ! ! Reason 3: Marketplaces are growing rapidly ! 20+ marketplaces ! ! Big companies have internal ones ! Crowdsourcing Marketplaces ! Size of these marketplaces have doubled in 2011 – 2013 !

  5. Why should we (DM/DB folks) care? ! Reason 1: Most data is unstructured ! Reason 2: Software companies use crowds at scale ! Reason 3: Marketplaces are growing rapidly ! 5 !

  6. What is Human-Powered Data Management? ! Data Processing ! Data Processing Algorithms ! Systems ! where humans act as “data processors” ! e.g., compare, label, extract ! Learning Machine Learning ! accuracies ! Interfaces ! HCI ! Patterns ! Economics ! Incentives ! 6 !

  7. Efficient Data Processing Algorithms & Systems ! Filter [SIGMOD12, VLDB14] ! Max [SIGMOD12] ! Data Processing ! Clean [KDD12, TKDD13] ! Categorize [VLDB11] ! Algorithms ! Search [ICDE14] ! ! Debugging [NIPS12] ! Data Processing Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12] ! DataSift [HCOMP13, SIGMOD14] HQuery [CIDR11] ! Systems ! Auxiliary Plugins: Confidence [KDD13, TR14] ! Eviction [TR12] ! Quality, Pricing ! Pricing [VLDB15] ! ! Quality [HCOMP14] ! i.stanford.edu/~adityagp/scoop.html ! 7 !

  8. Data Proc. Sys.: Crowd-Powered Search ! Can your search engine handle this? ! buildings in the vicinity of xxx ! type of cable that connects to ! apartments in a good school district near Urbana, with a bus stop near by ! 8 !

  9. DataSift: Crowd-Powered Search ! ! • No Non-t n-text xtual ual cont ntent nt: ! ! ! “ cables that plug into <img>” ! ! ! “funny pictures of cats with hats with captions” ! ! • Ti Time-c -consum nsuming ng: “find noise canceling headphones where the battery lasts 13 hrs” ! ! ! “apartments in a nice area around urbana” ! ! 9 !

  10. 10 !

  11. Building DataSift: Challenges ! Ask for text reformulations for query ! Gather ! Check if item satisfies query ! Filter ! ! Gather ! Retrieve ! Filter ! ! ! Gather ! Retrieve ! Filter ! Retrieve ! Filter ! ! ! ! • How many any re reformul ulat ations ns sho shoul uld we gat athe her? r? ! ! • How many any items s sho shoul uld we re retrieve at at eac ach h st step? p? ! • How do we fi filter r items? s? How many any pe peopl ple do we ask ask? ? ! • How do we opt ptimize the he workfl flow? ! • How do we guar uarant antee corre rectne ness? ss? ! 11 ! !

  12. Fundamental Tradeoffs ! How long can I wait? ! Latency ! What is my desired quality? ! Quality ! Cost ! How much am I willing to spend? ! 12 !

  13. DataSift Summary ! Sample applications: ! education, social media, commerce, journalism, … ! Latency ! Gather ! Retrieve ! Filter ! Quality ! Cost ! [SIGMOD14] DataSift: A Crowd-Powered Search Toolkit (demo) ! [HCOMP13] An expressive and accurate crowd powered search ! 13 !

  14. Filtering: The Simplest Version ! Is this image a cat? ! Boolean ! Dataset set of of It Items ems ! Filtered Filt ered Dataset set ! Predicate ! Y ! Y ! N ! Does X satisfy predicate? ! Latency ! For now, all humans have same error rates ! Quality ! Cost ! 14 !

  15. Our Visualization of Strategies ! continue ! No No ! decide PASS ! 5 ! decide FAIL ! 4 ! Markov 3 ! Decision ! 2 ! Process ! 1 ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! 15 !

  16. Strategy Examples ! continue ! No ! No ! No No decide PASS ! 5 ! 5 ! decide FAIL ! 4 ! 4 ! 3 ! 3 ! 2 ! 2 ! 1 ! 1 ! 1 ! 1 ! 2 ! 2 ! 3 ! 3 ! 4 ! 4 ! 5 ! 5 ! Yes Yes es ! es ! 16 !

  17. Simplest Version ! ! Given: ! — Human error probability (FP/FN) ! Via sampling, ! — Pr [Yes | 0]; Pr [No | 1] ! prior history, or ! — A-priori probability ! gold standard ! — Pr [0]; Pr[1] ! ! Find st stra rateg egy with minimum expected cost (# of questions) ! m ! — Expected error < t (say, 5%) ! x+y=m ! — Cost per item < m (say, 20 questions ) ! m ! 17 !

  18. Evaluating Strategies ! continue ! decide PASS ! No No ! decide FAIL ! 5 ! ! ost = (x+y) Pr [reach(x,y)] Cost ∑ 4 ! ! Error ror = Pr [reach � 1] + ∑ ! Pr [reach � 0] ! ∑ 3 ! y ! 2 ! Pr. [reach (4, 2)] = ! 1 ! Pr. [reach (4, 1) & get a No]+ Pr. [reach (3, 2) & get a Yes] ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! x ! 18 !

  19. Naïve Approach ! For each grid point ! No ! No ! Assign , or ! 5 ! ! 4 ! For all strategies: ! • Evaluate cost & error ! 3 ! Return the best ! 2 ! 1 ! O(3 (3 g ), ), g = O(m (m 2 ) ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! If m= If m= 5, 5, g = 21 21 ! ! 19 !

  20. Comparison ! Computing Money ! Strategy ! Naïve ! $$ !! Not feasible ! deterministic ! Our best Exponential; $$$ ! deterministic ! feasible ! 20 !

  21. Probabilistic Strategy Example ! No No ! continue ! 6 ! decide PASS ! decide FAIL ! 5 ! 4 ! 3 ! 2 ! 1 ! (0.2, 0.8, 0) ! 1 ! 2 ! 3 ! 4 ! 5 ! Yes es ! 21 !

  22. Comparison ! Computing Money ! Strategy ! Naïve ! Exponential; ! $$ !! deterministic ! not feasible ! Our best Exponential; $$$ ! deterministic ! feasible ! The best Polynomial(m) ! $ ! probabilistic ! THE BEST THE BEST ! 22 !

  23. Finding the Optimal Strategy ! Simple: Use Linear Programming " ! • variables: “probabilistic decision per grid point” ! ! • constraints: ! • probability conservation ! • boundary conditions ! ! [SIGMOD12] Crowdscreen: Algorithms for filtering data with humans ! 23 !

  24. Generalizations ! • Multiple answers (ratings, categories) ! • Multiple independent filters ! Doable ! • Difficulty ! • Different penalty functions ! • Latency ! • Different worker abilities ! Hard! ! • Different worker probes ! • A-priori scores ! ! 24 !

  25. Generalization: Worker Abilities ! It Item em 1 ! It Item em 2 ! Item It em 3 ! Actual ! 0 ! 1 ! 0 ! W 1 ! 0 ! 1 ! 0 ! W 2 ! 1 ! 1 ! 1 ! W 3 ! 1 ! 0 ! 1 ! (W 1 Yes, W 1 No, …, W n Yes, W n No) ! O(m 2n ) points ! n � 1000 ! Explosion of state! ! 25 !

  26. A Different Representation ! Pr Pr [1| 1|An Ans] s] ! 1 ! No No ! 3 ! 0.8 ! 2 ! 0.6 ! 1 ! 0.4 ! 0.2 ! 1 ! 2 ! 3 ! Yes es ! 1 ! 2 ! 3 ! Cost ost ! 26 !

  27. Worker Abilities: Sufficiency ! Pr Pr [1| 1|An Ans] s] ! 1 ! 0.8 ! ( W 1 Yes, W 1 No, ! 0.6 ! W 2 Yes, W 2 No, ! 0.4 ! …, ! W n Yes, W n No) ! 0.2 ! 1 ! 2 ! 3 ! 4 ! 5 ! Cost ost ! Recording Pr[1|Ans] is sufficient: ! Strategy ! Optimal ! 27 !

  28. MOOCs: Application of Filtering ! � ! Peer Evaluation ! � Crowdsourcing ! Required ! A+ ! A ! B- ! B+ ! Generalization of boolean filtering to scoring [1-5] ! 28 !

  29. Experiments on MOOCs ! Stanford HCI Course ! 1000 x 5 x 5 Parts = 25000 Parts ! Graded by random peers with known error rates ! To study: how much we can reduce error for fixed cost ! ! 29 !

  30. Summary : ! For same cost, reduction in error ! (distance from correct grade) of: ! • 50% over median ! • 30% over MLE ! • 10-20% over same accuracy ! [VLDB14] Optimal Crowd-Powered Rating and Filtering Algorithms ! 30 !

  31. Efficient Data Processing Algorithms & Systems ! Filt Filter er [SIG IGMOD12, 12, VLDB14] 14] ! Max [SIGMOD12] ! Data Processing ! Clean [KDD12, TKDD13] ! Categorize [VLDB11] ! Algorithms ! Search [ICDE14] ! ! Debugging [NIPS12] ! Data Processing Deco [CIKM12, VLDB12, TR12, SIGMOD Record 12] ! DataSif ift [HCOMP13, 13, SIG IGMOD14] 14] HQuery [CIDR11] ! Systems ! Auxiliary Plugins: Confidence [KDD13, TR14] ! Eviction [TR12] ! Quality, Pricing ! Pricing [VLDB15] ! ! Quality [HCOMP14] ! Latency ! i.stanford.edu/~adityagp/scoop.html ! Quality ! Cost ! 31 !

  32. 32 !

  33. VISUAL DATA MANAGEMENT with SeeDB ! Aditya Parameswaran ! ! with: ! Hector Garcia Molina, Sam Madden, ! Alkis Polyzotis, Manasi Vartak ! ! 33 !

  34. Simplifying Data Analytics ! ! Up to a million additional analysts will be needed to address data analytics needs in 2018 in the US alone. ! ! ! ! ! --- McKinsey Big Data Report, 2013 ! How w do o we e ma make e it it ea easier sier for or novice ice data analyst ysts s to o get et in insig sights s from rom data? ! 34 !

  35. Data Analytics Workflow ! “Production by State” ! 50 ! 40 ! 30 ! 25 ! 20 ! 15 ! 10 ! 10 ! All Products ! MA ! CA ! IL ! NY ! “Staplers” ! Query ! Views ! “Sales by Year” ! 4.5 ! 4 ! 3.5 ! 3 ! 2.5 ! 2 ! 1.5 ! Labor oriou ious s and Tiresome! iresome! ! Can we e automa omate e this? is? ! “Production by Year” ! ! 4.5 ! 4 ! Simila imilar r issu issues es wit with ! 3.5 ! 3 ! 2.5 ! Tab ableau, au, Sho ShowMe, Pro rofi filer, Spo Spotfi fire re ! 2 ! 1.5 ! 35 !

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend