how am i going to skim through these data
play

How am I going to skim through these data ? 1 Trends Computers - PowerPoint PPT Presentation

How am I going to skim through these data ? 1 Trends Computers keep getting faster But data grows faster yet! Remember? BIG DATA! Queries are becoming more complex Remember? ANALYTICS! 2 Analytic Queries Analytic


  1. How am I going to skim through these data …? 1

  2. Trends • Computers keep getting faster • But data grows faster yet! – Remember? BIG DATA! • Queries are becoming more complex – Remember? ANALYTICS! 2

  3. Analytic Queries • Analytic queries typically involve aggregates • Simple query – report the total • A CUBE operator in OLAP – allows sales of each region users to drill down or roll up between multiple nodes of the data SELECT SUM(S.sales) cube operation FROM SALES S GROUP BY region SELECT SUM(S.sales) FROM SALES S A query involving multiple aggregates • GROUP BY CUBE(pid, locid, timeid) ─ find the average supplier ‐ quantity supplied by suppliers of a particular {pid, locid, timeid} part SELECT AVG(quantity) {pid, locid} {pid, timeid} {locid, timeid} FROM (SELECT supp, part, SUM(quantity) as quantity FROM lineitem {pid} {locid} {timeid} WHERE part = 10 GROUP BY supp, part); { } 3

  4. System perspective • Aggregation queries – read a large amount of data – a long time to compute – return a very small final result • answers are or derived from summary data 4 4

  5. User perspective • “Big Picture” analytics – Decision makers want to know something about some data quickly – Precise answers typically not required; approximate results are ok – Real ‐ time interaction and control over processing – Visually oriented interface Time System 1 System 2 System 3 1.0000 3.01325 4.32445 7.5654 2.0000 4.54673 6.56784 8.6562 VS 3.0000 5.46571 6.87658 10.3343 5

  6. Challenge Mismatch between user needs and system functionality 6

  7. Drawbacks of Current Systems • Only exact answers are available – A losing proposition as data volume grows – Hardware improvements not sufficient • HCI solution: interactive tools don’t do big jobs – E.g., spreadsheet programs (1 M row, 16k column limit) • Systems solution: big jobs aren’t interactive – No user feedback or control in big DBMS queries (“back to the 60’s”) – Long processing time – Fundamental mismatch with preferred modes of HCI • Best solutions to date – precompute (store answers of queries beforehand), e.g. OLAP – Don’t handle ad hoc queries or data sets well 7

  8. Desirable features for Big Picture Analytics • Early (approximate) answers with guarantees! • Refinement over time • Interaction and ad ‐ hoc control (human in the loop) 100% Online  Traditional Did you see any problem with “Online”? Time 8

  9. Example (Conventional) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 1 second 30 seconds 5 minutes Conventional database 9

  10. Example (Conventional) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 minutes Avg Stock Price = 1000 Conventional database 10

  11. Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 1 second Avg Stock Price $2031+/ ‐ $523 90% 5% Sampling Progress With online aggregation 11

  12. Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 seconds Avg Stock Price $1890+/ ‐ $420 95% 15% Sampling Progress With online aggregation 12

  13. Example (Online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 5 minutes Avg Stock Price $1150+/ ‐ $210 97% 40% Sampling Progress With online aggregation 13

  14. Example (online) • Find the average stock price from the nasdaq ‐ db where company = ‘abc’ After 30 minutes Avg Stock Price $1040+/ ‐ $70 99% 95% Sampling Progress With online aggregation 14

  15. Example: Online Aggregation Additional Features: Speed up Slow down Terminate 15

  16. Example: Online visualization 16

  17. Example: Browsing 17

  18. Key benefit: “Premature” termination After 5 minutes • If acceptably accurate answer reached quickly, the query can be aborted Conventional database Avg Stock Price $1150+/ ‐ $210 97% 40% Sampling Progress With online aggregation Stop early 18

  19. Why Stop Early?? • Save human time (30 min vs 5 min) – Precise vs estimate answers • For exploratory applications • Save machine time – Save cost $$ • Very important when dealing with BIG DATA in the cloud – Pay for what you used (users need to justify the cost to the organization) 19

  20. Analytic queries are costly! QphH = Query ‐ per ‐ Hour Performance 20 20

  21. Solution: Online aggregation • Users must get continual feedback on results of processing – Observe the progress of their queries – Give continually improving partial results: aggregates have running output and confidence interval • Control execution on ‐ the ‐ fly For a retailer, approximate result, Average Sales: $21,712+/ ‐ $47 $21,795+/ ‐ $105 $21,255+/ ‐ $286 $22,131+/ ‐ $523 85% 90% 95% 98% such as $21,712+/ ‐ $47, can provide a good estimation for its 40% 35% 20% 15% daily sale’s statistics. And it is Sampling Progress more cost effective. 21 21

  22. Statistical estimation • Users do not need to set a priori specification of stopping condition • The interface is easier for users with no statistical background • It requires more powerful statistical estimation techniques (Hoeffding’s inequality versus Chebyshev’s inequality) 22 22

  23. Usability goals • Continuous observation • Control of time/precision • Control of fairness/partiality 23 23

  24. Performance goals • Minimum time to accuracy – produce a useful estimate of the final answer ASAP • Minimum time to completion – secondary goal, assume user will terminate processing long before the final answer is produced • Pacing – guarantee a smooth and continuous improving display 24 24

  25. What are the tools?? 25

  26. Random access to data • We need to retrieve data in random order to produce meaningful statistical estimation • At any time, the input to the query is a sample • Input grows over time until • the query is terminated prematurely or • all data examined 26 26

  27. Sampling – design issues • Granularity of sample – Record ‐ level: high I/O cost – Block ‐ level: high variability from clustering • Types of sample – Often simple random sample (SRS) – With/without replacement?? • Data structures from which to sample – Files or relational tables – Indexes (B + trees, etc) 27

  28. Row ‐ level sampling techniques • Maintain file in (pseudo) random order – Sampling = scan – Is file initially in random order? • Statistical tests needed: e.g., Runs test, Kolmogorov ‐ Smirnov test – Can start scans from random positions • Best I/O behavior – Must “freshen” ordering (online reorg) • On ‐ the ‐ fly sampling – Index scan via index on “random” column • Indexed attributes are different from (and not correlated to) aggregated attributes, e.g., name is not correlated to salary – Else get random page, then row within page • Less efficient • Problem: variable number of records on page 28

  29. Sampling from Index Root 17 27 5 13 30 2* 3* 33* 34* 38* 39* 5* 7* 8* 22* 24* 27* 29* 14* 16* 29

  30. How expensive is row ‐ level sampling? 100 Pages fet ched (% ) 80 60 40 20 0 2 4 6 8 2 4 6 8 0 1 2 . . . . . . . . 0 0 0 0 1 1 1 1 Sampling Rat e (% ) 30

  31. Group By operation • Aggregate queries involve “grouping”, e.g., count of grades of students, average sales per month, etc • How are “group ‐ by” queries processed? – Sort on “group ‐ by” attributes, e.g., sort by grades, then count – Hash on “group ‐ by” attributes • Same attribute values will be hashed to the same bucket, e.g., students with the same grade will be grouped together 31

  32. Non ‐ blocking GROUP BY and DISTINCT • Blocking operator – Cannot work on another operator while the current operator is being processed • Sorting is a blocking algorithm – only one group is computed at a time after sorting • Hashing is non ‐ blocking, but … – hash table need to fit in memory to have good performance 32 32

  33. Index striding • For fair Group By: – Want to update all groups • Challenging for groups with small number of records – Want random tuple from Group 1, random tuple from Group 2, ... – Idea • Index gives tuples from a single group • Opens many cursors in index, one per group • Fetch records in round ‐ robin • Can control speed by weighting the schedule – Gives fairness/partiality, info/speed match! 33 33

  34. Conclusion • Big data analytics is becoming increasingly important • Online aggregation (OLA) is a promising direction • OLA mechanisms – Random sampling – Non ‐ blocking schemes – Index striding • Online Hadoop is a variant of Hadoop that offers a flavor of “online aggregation” 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend