T RUST B UT V ERIFY Optimistic Visualizations of Approximate Queries - - PowerPoint PPT Presentation

t rust b ut v erify
SMART_READER_LITE
LIVE PREVIEW

T RUST B UT V ERIFY Optimistic Visualizations of Approximate Queries - - PowerPoint PPT Presentation

T RUST B UT V ERIFY Optimistic Visualizations of Approximate Queries for Exploring Big Data Dominik Moritz @domoritz Danyel Fisher @FisherDanyel Bolin Ding @AtlasDing Chi Wang Paul G. Allen School of CSE HCI and DMX University of Washington


slide-1
SLIDE 1

TRUST BUT VERIFY

Optimistic Visualizations of Approximate Queries for Exploring Big Data

Dominik Moritz @domoritz

1

Paul G. Allen School of CSE University of Washington Danyel Fisher @FisherDanyel Bolin Ding @AtlasDing Chi Wang HCI and DMX Microsoft Research

slide-2
SLIDE 2

2

What's the distribution of flight distances?

slide-3
SLIDE 3

3

Computer by Simple Icons from the Noun Project

Visual Analysis

analyst by Gregor Cresnar from the Noun Project

$ wget https://www.transtats.bts.gov/download.zip ==========================================> 70GB

  • > Done

$ import download.zip

  • > Done

$ SELECT bin(distance), count(*)
 FROM flights

  • > Running Query. Please wait ...
slide-4
SLIDE 4

6

Visual Analysis

Computer by Simple Icons from the Noun Project analyst by Gregor Cresnar from the Noun Project

slide-5
SLIDE 5

7

Big Data Visual Analysis

Coffee by jeff from the Noun Project

Query finished!

slide-6
SLIDE 6

State of the Art in Big Data Exploration

8

$ SELECT bin(distance), count(*)
 FROM flights

  • >

$ SELECT bin(distance), count(*)
 FROM flights
 WHERE airline = 'hi'

  • > Running Query. Please wait ...
slide-7
SLIDE 7

10

State of the Art in Big Data Exploration

Distributed Systems
 Expensive and high latency. Indexes (Data Cubes)
 Requires pre computation and limited queries. Sampling
 Use a representative subset of the data.

Cluster servers by Branis Panos from the Noun Project Rubik's Cube by Aleks from the Noun Project

Sampling

slide-8
SLIDE 8

Use a representative subset of the data and estimate the true values of aggregate results.

11

Sampling and Approximate Query Processing (AQP)

slide-9
SLIDE 9

Use a representative subset of the data and estimate the true values of aggregate results. Decide on acceptable uncertainty or timeout

12

Sampling and Approximate Query Processing (AQP)

Sum of 100 % = 168 ±10 Estimate Uncertainty Sum of 25% = 42

slide-10
SLIDE 10

Growing sample ➞ continuously improving results Analysts watch updates until bounds errors are low enough

13

Progressive Visualization with Online Aggregation

Query finished!

Sum of 100 % = 168 ±10 Sum of 25% = 42 Sum of 100 % = 168 ±5 Sum of 35% = 59 Sum of 100 % = 168 ±1 Sum of 50% = 84

slide-11
SLIDE 11

14

Challenges with AQP

$ SELECT bin(distance), count(*)
 FROM flights
 WHERE airline = 'hi'

  • > No Results

$ SELECT bin(distance), count(*)
 FROM flights
 WHERE airline = 'ha'

  • > Running Query. Please wait ...
slide-12
SLIDE 12

15

Challenges with AQP

Estimate

Approximate results

➞ Convey uncertainty

Probabilistic guarantees Unbounded errors Arbitrary aggregation or joins

Max

slide-13
SLIDE 13

16

A UX approach to challenges with AQP traditionally treated as database problems. Optimistic Visualization

slide-14
SLIDE 14

17

Optimistic Visualization

Assume that approximation is mostly right but offer a way to detect and recover from mistakes. Analysts use initial estimates, run precise query in background, and confirm results later. Gives users confidence in using AQP.

slide-15
SLIDE 15

Pangloss implements Optimistic Visualization

18

Query Specification

slide-16
SLIDE 16

Pangloss implements Optimistic Visualization

19

Visualization View

slide-17
SLIDE 17

Pangloss implements Optimistic Visualization

20

Approximation Expected Error (Uncertainty)

slide-18
SLIDE 18

Pangloss implements Optimistic Visualization

21

Annotation + Remember Button

slide-19
SLIDE 19

Pangloss implements Optimistic Visualization

22

History

slide-20
SLIDE 20

Pangloss implements Optimistic Visualization

23

slide-21
SLIDE 21

170 Million flights (30 years). ~100ms query time

slide-22
SLIDE 22

Text annotations help analysts clarify

  • bservations.
slide-23
SLIDE 23

"Remember" button moves query into the background

slide-24
SLIDE 24

Continue exploration without waiting

slide-25
SLIDE 25

Orange ➞ Approximate Blue ➞ Precise

slide-26
SLIDE 26

Difference Visualization

slide-27
SLIDE 27

30

Evaluation

Lab Study 5 users Flight delay data
 (170 Million records)
 1 hour each Case Study 3 teams Product insights,
 Social media,
 Bing ~1+ hour exploration

slide-28
SLIDE 28

31

Findings from the study

AQP works: “seeing something right away at first glimpse is really great” Optimism works: “I was thinking what to do next— and I saw that it had loaded, so I went back and checked it . . . [the passive update is] very nice for not interrupting your workflow.” Need for guarantees: “[with a competitor] I was willing to wait 70-80

  • seconds. It wasn’t ideally interactive, but it meant I was looking at all the

data.”

slide-29
SLIDE 29

32

Findings from the study (cont)

“When I’m using your system, there is a path that I need to follow.” “Now that I’ve been sitting here for an hour, after I go back, it makes a lot of sense [to have these annotations], but as I was doing it, I was thinking, ‘I want to move on, I want to move on.”

slide-30
SLIDE 30

33

Conclusions

Fundamental problems with AQP addressed as UX problem Gives analysts confidence in AQP Future: Alerting, Remembering, Progressive + Optimistic

slide-31
SLIDE 31

34

AQP needs Multi-Disciplinary Solutions

Bolin - DB Chi - DB Danyel - HCI Dominik - Vi+DB

slide-32
SLIDE 32

Implications for the Database Community

HILDA at SIGMOD 2017

slide-33
SLIDE 33

36

Trust But Verify: Optimistic Visualizations for AQP

Fundamental problems with AQP addressed as UX problem Optimistic Visualization gives analysts confidence in AQP Integrates well into existing Visual Analysis tools Future: Alerting, Remembering, Progressive Details: bit.ly/2pwQQg7

Dominik Moritz @domoritz Danyel Fisher @FisherDanyel Bolin Ding @AtlasDing Chi Wang Query finished!

slide-34
SLIDE 34

Backup Slides

37

slide-35
SLIDE 35

38

Histogram of Distances for Hawaiian Airlines

slide-36
SLIDE 36

39

Distribution Uncertainty

2 2 2 2 2 2 3 1.8 1.8 1.8 1.8 1.8 2⅔ 1⅓ 2⅔ 1⅓ 2⅔ 1⅓ 3 1 3 1 3 1

Error: Sum: 6 12 Error: Sum: 4 12 Error: Sum: 4 12

Approximation Within Distribution Uncertainty Outside Distribution Uncertainty

Distribution Uncertainty: 4

slide-37
SLIDE 37

40

Distribution Uncertainty

slide-38
SLIDE 38

41

Filtering can show new groups

new predicate → new query → different sample → different groups

slide-39
SLIDE 39

42

Precise results can show new groups

Approximate Precise

slide-40
SLIDE 40

43

Vocabulary of visual cues

Heatmap Barchart