How am I going to skim through these data ? 1 Trends Computers - - PowerPoint PPT Presentation

how am i going to skim through these data
SMART_READER_LITE
LIVE PREVIEW

How am I going to skim through these data ? 1 Trends Computers - - PowerPoint PPT Presentation

How am I going to skim through these data ? 1 Trends Computers keep getting faster But data grows faster yet! Remember? BIG DATA! Queries are becoming more complex Remember? ANALYTICS! 2 Analytic Queries Analytic


slide-1
SLIDE 1

How am I going to skim through these data …?

1

slide-2
SLIDE 2

Trends

  • Computers keep getting faster
  • But data grows faster yet!

– Remember? BIG DATA!

  • Queries are becoming more complex

– Remember? ANALYTICS!

2

slide-3
SLIDE 3

Analytic Queries

  • Analytic queries typically involve aggregates
  • Simple query – report the total

sales of each region

SELECT SUM(S.sales) FROM SALES S GROUP BY region

  • A query involving multiple aggregates

─ find the average supplier‐quantity supplied by suppliers of a particular part

SELECT AVG(quantity) FROM (SELECT supp, part, SUM(quantity) as quantity FROM lineitem WHERE part = 10 GROUP BY supp, part);

3

  • A CUBE operator in OLAP – allows

users to drill down or roll up between multiple nodes of the data cube operation

SELECT SUM(S.sales) FROM SALES S GROUP BY CUBE(pid, locid, timeid)

{pid, locid, timeid} {pid, locid} {pid, timeid} {locid, timeid} {pid} {locid} {timeid} { }

slide-4
SLIDE 4

4

System perspective

  • Aggregation queries

– read a large amount of data – a long time to compute – return a very small final result

  • answers are or derived from summary data

4

slide-5
SLIDE 5

User perspective

5

Time System 1 System 2 System 3 1.0000 3.01325 4.32445 7.5654 2.0000 4.54673 6.56784 8.6562 3.0000 5.46571 6.87658 10.3343

VS

  • “Big Picture” analytics

– Decision makers want to know something about some data quickly – Precise answers typically not required; approximate results are ok – Real‐time interaction and control over processing – Visually oriented interface

slide-6
SLIDE 6

Challenge

6

Mismatch between user needs and system functionality

slide-7
SLIDE 7

Drawbacks of Current Systems

  • Only exact answers are available

– A losing proposition as data volume grows – Hardware improvements not sufficient

  • HCI solution: interactive tools don’t do big jobs

– E.g., spreadsheet programs (1 M row, 16k column limit)

  • Systems solution: big jobs aren’t interactive

– No user feedback or control in big DBMS queries (“back to the 60’s”) – Long processing time – Fundamental mismatch with preferred modes of HCI

  • Best solutions to date

– precompute (store answers of queries beforehand), e.g. OLAP – Don’t handle ad hoc queries or data sets well

7

slide-8
SLIDE 8

Time

100%

Online Traditional

Desirable features for Big Picture Analytics

  • Early (approximate) answers with guarantees!
  • Refinement over time
  • Interaction and ad‐hoc control (human in the

loop)

8

Did you see any problem with “Online”?

slide-9
SLIDE 9

Example (Conventional)

  • Find the average stock price from the nasdaq‐

db where company = ‘abc’

9

After Conventional database 1 second 30 seconds 5 minutes

slide-10
SLIDE 10

Example (Conventional)

  • Find the average stock price from the nasdaq‐

db where company = ‘abc’

10

After Conventional database Avg Stock Price = 1000 30 minutes

slide-11
SLIDE 11

Example (Online)

  • Find the average stock price from the nasdaq‐

db where company = ‘abc’

11

After 1 second With online aggregation

5% Avg Stock Price $2031+/‐$523 90% Sampling Progress

slide-12
SLIDE 12

Example (Online)

  • Find the average stock price from the nasdaq‐

db where company = ‘abc’

12

After 30 seconds With online aggregation

15% Avg Stock Price $1890+/‐$420 95% Sampling Progress

slide-13
SLIDE 13

Example (Online)

  • Find the average stock price from the nasdaq‐

db where company = ‘abc’

13

After 5 minutes With online aggregation

40% Avg Stock Price $1150+/‐$210 97% Sampling Progress

slide-14
SLIDE 14

Example (online)

  • Find the average stock price from the nasdaq‐

db where company = ‘abc’

14

After 30 minutes With online aggregation

95% Avg Stock Price $1040+/‐$70 99% Sampling Progress

slide-15
SLIDE 15

Example: Online Aggregation

Additional Features:

Speed up Slow down Terminate

15

slide-16
SLIDE 16

Example: Online visualization

16

slide-17
SLIDE 17

Example: Browsing

17

slide-18
SLIDE 18

Key benefit: “Premature” termination

  • If acceptably

accurate answer reached quickly, the query can be aborted

18

After 5 minutes Conventional database With online aggregation

40% Avg Stock Price $1150+/‐$210 97% Sampling Progress

Stop early

slide-19
SLIDE 19

Why Stop Early??

  • Save human time (30 min vs 5 min)

– Precise vs estimate answers

  • For exploratory applications
  • Save machine time

– Save cost $$

  • Very important when dealing with BIG DATA in

the cloud

– Pay for what you used (users need to justify the cost to the organization)

19

slide-20
SLIDE 20

Analytic queries are costly!

QphH = Query‐per‐Hour Performance

20 20

slide-21
SLIDE 21

21

Solution: Online aggregation

  • Users must get continual feedback on results
  • f processing

– Observe the progress of their queries – Give continually improving partial results: aggregates have running output and confidence interval

  • Control execution on‐the‐fly

15% 20% 35% 40% Average Sales: $22,131+/‐$523 85% $21,255+/‐$286 90% $21,795+/‐$105 95% $21,712+/‐$47 98% Sampling Progress For a retailer, approximate result, such as $21,712+/‐$47, can provide a good estimation for its daily sale’s statistics. And it is more cost effective.

21

slide-22
SLIDE 22

22

Statistical estimation

  • Users do not need to set a priori specification
  • f stopping condition
  • The interface is easier for users with no

statistical background

  • It requires more powerful statistical

estimation techniques (Hoeffding’s inequality versus Chebyshev’s inequality)

22

slide-23
SLIDE 23

23

Usability goals

  • Continuous observation
  • Control of time/precision
  • Control of fairness/partiality

23

slide-24
SLIDE 24

24

Performance goals

  • Minimum time to accuracy

– produce a useful estimate of the final answer ASAP

  • Minimum time to completion

– secondary goal, assume user will terminate processing long before the final answer is produced

  • Pacing

– guarantee a smooth and continuous improving display

24

slide-25
SLIDE 25

What are the tools??

25

slide-26
SLIDE 26

26

Random access to data

  • We need to retrieve data in random order to

produce meaningful statistical estimation

  • At any time, the input to the query is a sample
  • Input grows over time until
  • the query is terminated prematurely or
  • all data examined

26

slide-27
SLIDE 27

Sampling – design issues

  • Granularity of sample

– Record‐level: high I/O cost – Block‐level: high variability from clustering

  • Types of sample

– Often simple random sample (SRS) – With/without replacement??

  • Data structures from which to sample

– Files or relational tables – Indexes (B+ trees, etc)

27

slide-28
SLIDE 28

Row‐level sampling techniques

  • Maintain file in (pseudo) random order

– Sampling = scan – Is file initially in random order?

  • Statistical tests needed: e.g., Runs test, Kolmogorov‐Smirnov test

– Can start scans from random positions

  • Best I/O behavior

– Must “freshen” ordering (online reorg)

  • On‐the‐fly sampling

– Index scan via index on “random” column

  • Indexed attributes are different from (and not correlated to)

aggregated attributes, e.g., name is not correlated to salary

– Else get random page, then row within page

  • Less efficient
  • Problem: variable number of records on page

28

slide-29
SLIDE 29

Sampling from Index

29

2* 3*

Root

17 30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29*

slide-30
SLIDE 30

How expensive is row‐level sampling?

30

20 40 60 80 100 . 2 . 4 . 6 . 8 1 1 . 2 1 . 4 1 . 6 1 . 8 2 Sampling Rat e (% ) Pages fet ched (% )

slide-31
SLIDE 31

Group By operation

  • Aggregate queries involve “grouping”, e.g.,

count of grades of students, average sales per month, etc

  • How are “group‐by” queries processed?

– Sort on “group‐by” attributes, e.g., sort by grades, then count – Hash on “group‐by” attributes

  • Same attribute values will be hashed to the same

bucket, e.g., students with the same grade will be grouped together

31

slide-32
SLIDE 32

32

Non‐blocking GROUP BY and DISTINCT

  • Blocking operator

– Cannot work on another operator while the current

  • perator is being processed
  • Sorting is a blocking algorithm

– only one group is computed at a time after sorting

  • Hashing is non‐blocking, but …

– hash table need to fit in memory to have good performance

32

slide-33
SLIDE 33

33

Index striding

  • For fair Group By:

– Want to update all groups

  • Challenging for groups with small number of

records

– Want random tuple from Group 1, random tuple from Group 2, ... – Idea

  • Index gives tuples from a single group
  • Opens many cursors in index, one per group
  • Fetch records in round‐robin
  • Can control speed by weighting the schedule

– Gives fairness/partiality, info/speed match!

33

slide-34
SLIDE 34

Conclusion

  • Big data analytics is becoming increasingly

important

  • Online aggregation (OLA) is a promising

direction

  • OLA mechanisms

– Random sampling – Non‐blocking schemes – Index striding

  • Online Hadoop is a variant of Hadoop that
  • ffers a flavor of “online aggregation”

34