Information Visualization Aggregate & Filter Tamara Munzner - - PowerPoint PPT Presentation

information visualization aggregate filter
SMART_READER_LITE
LIVE PREVIEW

Information Visualization Aggregate & Filter Tamara Munzner - - PowerPoint PPT Presentation

Information Visualization Aggregate & Filter Tamara Munzner Department of Computer Science University of British Columbia Lect 17, 10 Mar 2020 https://www.cs.ubc.ca/~tmm/courses/436V-20 Upcoming Foundations 5: out Thu Mar 12, due Wed


slide-1
SLIDE 1

https://www.cs.ubc.ca/~tmm/courses/436V-20

Information Visualization Aggregate & Filter

Tamara Munzner Department of Computer Science University of British Columbia

Lect 17, 10 Mar 2020

slide-2
SLIDE 2

Upcoming

  • Foundations 5: out Thu Mar 12, due Wed Mar 18 11:59pm
  • Milestone 2: due Wed Mar 25 11:59pm

–(with update announce last week, schedule status component)

2

slide-3
SLIDE 3

Correction

3

slide-4
SLIDE 4

Idiom: Small multiples

  • encoding: same
  • data: none shared

–different attributes different items (different condition keys, same gene keys),
 same attributes: expression values
 for node colors –(same network layout for nodes=genes)

  • navigation: shared

4

System: Cerebral

[Cerebral: Visualizing Multiple Experimental Conditions on a Graph with Biological Context. Barsky, Munzner, Gardy, and Kincaid. IEEE Trans. Visualization and Computer Graphics (Proc. InfoVis 2008) 14:6 (2008), 1253–1260.]

slide-5
SLIDE 5

Reminder

5

slide-6
SLIDE 6

Beyond slides: Textbook for further reading (optional)

  • Intro

– Ch 1. What's Vis, and Why Do It?

  • Data Abstraction

– Ch 2. What: Data Abstraction – Ch 4. Analysis: Four Levels for Validation

  • Task Abstraction

– Ch 3. Why: Task Abstraction

  • Marks & Channels

– Ch 5. Marks and Channels

  • Multivariate Tables

– Ch 7. Arrange Tables

  • Interactive

Views

– Ch 11. Manipulate View – Ch 12. Facet into Multiple Views

  • Maps

– Ch 8. Arrange Spatial Data (only 8.1-8.3)

  • Color

– Ch 10. Map Color and Other Channels

  • Networks & Trees

– Ch 9. Arrange Networks and Trees

  • Aggregation

– Ch 13. Reduce Items and Attributes – Ch 14. Embed: Focus+Context

  • Rules of

Thumb (upcoming)

– Ch 6. Rules of Thumb

6

Visualization Analysis & Design, free through library: catalog page EZProxy direct link

slide-7
SLIDE 7

Filter & Aggregate

7

slide-8
SLIDE 8

Exercise: Too much stuff

  • Cars dataset: 7 attributes

–MPG quantitative –Cylinders ordinal –Horsepower quantitative –Weight quantitative –Acceleration quantitative –Model Year ordinal –Origin categorical

  • This table has 100 million items
  • Pair up, discuss how to have scalable approach, create sketch to illustrate

– [8 min] –Socrative: true when done

8

slide-9
SLIDE 9

How to handle complexity: 1 previous strategy + 3 more

9

Manipulate Facet Reduce Change Select Navigate Juxtapose Partition Superimpose Filter Aggregate Embed

Derive

  • derive new data to

show within view

  • change view over time
  • facet across multiple

views

  • reduce items/attributes

within single view

slide-10
SLIDE 10

10

Encode Arrange Express Separate Order Align Use Manipulate Facet Reduce Change Select Navigate Juxtapose Partition Superimpose Filter Aggregate Embed

How? Encode Manipulate Facet

Map Color Motion Size, Angle, Curvature, ...

Hue Saturation Luminance

Shape

Direction, Rate, Frequency, ...

from categorical and ordered attributes

slide-11
SLIDE 11

11

Reducing Items and Attributes Filter Items Attributes Aggregate Items Attributes

slide-12
SLIDE 12

Reduce items and attributes

12

  • reduce/increase: inverses
  • filter

–pro: straightforward and intuitive

  • to understand and compute

–con: out of sight, out of mind

  • aggregation

–pro: inform about whole set –con: difficult to avoid losing signal

  • not mutually exclusive

–combine filter, aggregate –combine reduce, change, facet

Reduce

Filter Aggregate Embed

Reducing Items and Attributes Filter Items Attributes Aggregate Items Attributes

slide-13
SLIDE 13

Filter

  • eliminate some elements

–either items or attributes

  • according to what?

–any possible function that partitions
 dataset into two sets

  • attribute values bigger/smaller than x
  • noise/signal
  • filters vs queries

–query: start with nothing, add in elements –filters: start with everything, remove elements –best approach depends on dataset size

13

Reducing Items and Attributes Filter Items Attributes

slide-14
SLIDE 14

Idiom: FilmFinder

  • dynamic queries/filters for items

–tightly coupled interaction and visual encoding idioms, so user can immediately see results of action

14

[Ahlberg & Shneiderman, Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays. CHI 1994.]

slide-15
SLIDE 15

Idiom: cross filtering

  • item filtering
  • coordinated views/controls combined
  • all scented histogram bisliders update when any ranges change

15

System: Crossfilter

[http://square.github.io/crossfilter/]

slide-16
SLIDE 16

Idiom: cross filtering

16

[https://www.nytimes.com/interactive/2014/upshot/buy-rent-calculator.html?_r=0]

slide-17
SLIDE 17

Aggregate

  • a group of elements is represented by a smaller number of derived

elements

17

Aggregate Items Attributes

slide-18
SLIDE 18

Idiom: histogram

  • static item aggregation
  • task: find distribution
  • data: table
  • derived data

–new table: keys are bins, values are counts

  • bin size crucial

–pattern can change dramatically depending on discretization –opportunity for interaction: control bin size on the fly

18

20 15 10 5 Weight Class (lbs)

slide-19
SLIDE 19

Histograms explained

  • also great example of

scrollytelling!

19

http://tinlizzie.org/histograms/

slide-20
SLIDE 20

Histogram bins

  • good # bins hard to predict

–make it interactive when possible

  • rules of thumb

–# bins = sqrt(n) –# bins = log2(n)+1

20

age age # passengers # passengers 20 bins 10 bins

slide-21
SLIDE 21

Idiom: scented widgets

  • augmented widgets show information scent

–better cues for information foraging: show whether value in drilling down further vs looking elsewhere

  • concise use of space: histogram on slider

21

[Scented Widgets: Improving Navigation Cues with Embedded Visualizations. Willett, Heer, and Agrawala. IEEE TVCG (Proc. InfoVis 2007) 13:6 (2007), 1129–1136.]

[Multivariate Network Exploration and Presentation: From Detail to Overview via Selections and Aggregations. van den Elzen, van Wijk, IEEE TVCG 20(12): 2014 (Proc. InfoVis 2014).]

slide-22
SLIDE 22

Scented histogram bisliders: detailed

22

[ICLIC: Interactive categorization of large image collections. van der Corput and van

  • Wijk. Proc. PacificVis 2016. ]
slide-23
SLIDE 23

Example: Keshif

  • interactive item filtering

with scented widgets

–also: interaction speed w/
 scatterplot vs list view

23

https://keshif.me/gallery/olympics

slide-24
SLIDE 24

Interactive legends

  • controls combining

–visual representation of 
 static legends w/ –interaction mechanisms of widgets

  • define & control visual display

together

24

Riche 2010

slide-25
SLIDE 25

Idiom: boxplot

  • static item aggregation
  • task: find distribution
  • data: table
  • derived data

–5 quant attribs

  • median: central line
  • lower and upper quartile: boxes
  • lower upper fences: whiskers

– values beyond which items are outliers

–outliers beyond fence cutoffs explicitly shown

  • scalability

–unlimited number of items!

25

! ! ! ! ! ! ! ! !

n s k mm !2 2 4

[40 years of boxplots. Wickham and Stryjewski. 2012. had.co.nz]

slide-26
SLIDE 26

Boxplots

  • aka box-and-whisker plots

–show outliers as points

  • bad for non-normal

distributions

  • really bad for bimodal or

multimodal distributions

26

[wikipedia]

slide-27
SLIDE 27

Boxplots: Drawbacks

  • four distributions with same boxplot

27

http://stat.mq.edu.au/wp-content/uploads/2014/05/Can_the_Box_Plot_be_Improved.pdf

slide-28
SLIDE 28

Violin plots

  • boxplot + probability density function

28

https://towardsdatascience.com/violin-plots-explained-fb1d115e023d

slide-29
SLIDE 29

Density plots

  • aka kernel density plots, kernel density estimation (KDE)

–smoothed, continuous version of a histogram estimated from data –continuous curve (the kernel, usually Gaussian bell curve) drawn at each data point –add curves together for single smooth density estimation

  • bandwidth influences estimate

29

https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0

KDE wikipedia

slide-30
SLIDE 30

KDE in D3: Interactive bandwidth controls

30

https://observablehq.com/@d3/kernel-density-estimation

slide-31
SLIDE 31

Idiom: Continuous scatterplot

  • static item aggregation
  • data: table
  • derived data: table

– key attribs x,y for pixels – quant attrib: overplot density

  • dense space-filling 2D

matrix

  • color: sequential categorical

hue + ordered luminance colormap

  • scalability

– no limits on overplotting: millions of items

31

[Continuous Scatterplots. Bachthaler and

  • Weiskopf. 


IEEE TVCG (Proc. Vis 08) 14:6 (2008), 1428–1435. 2008. ]

slide-32
SLIDE 32

Aggregate 2

32

slide-33
SLIDE 33

News

  • Online lectures and office hours start today, using Zoom: 


https://zoom.us/j/9016202871

  • Lecture mode

–Plan: I livestream with video + audio + screenshare, will also try recording. –You'll be able to just join the session –Please connect audio-only, no video, to avoid congestion –You'll be auto-muted. If you have a question use the Show Hand (click on Participants, button is at the bottom of the popup window), I'll unmute you myself

  • Office hours mode

–Please do connect with video if possible, in addition to audio –I'll use the Waiting Room feature, where I will individually allow you in

  • If I'm already talking to somebody else I'll briefly let you know, then put you back in WR until

it's your turn.

33

slide-34
SLIDE 34

News

  • Labs will be Zoom + Canvas scheduling

–different Zoom URL for each TA, stay tuned –you can sign up for reserved slots in advance, or check for availability on the fly –more details soon

  • Final exam plan still TBD

–but will not be in person –you are free to leave campus when you want (but are not required to do so)

34

slide-35
SLIDE 35

Schedule shift

  • Nothing due this Wed
  • M2 & M3 on schedule

–M2 due Wed Mar 25 –M3 due Wed Apr 8

  • Combined F5/6

–will go out Thu Mar 26, due Wed Apr 1

35

slide-36
SLIDE 36

News

  • Midterm marks and solutions released

–Gradescope has detailed breakdown, note stats are wrt total of 75 –Canvas has percentages, mean was 79% –solutions have detailed rubric w/ answer alternatives & explanations

  • M1 marks released

–we specifically suggest meet to discuss during labs or office hrs to several teams

  • P3 marks released

–bimodal distribution

36

slide-37
SLIDE 37

P1-P3 marks

  • increasingly bimodal

37

slide-38
SLIDE 38

Q1-Q7 marks

38

slide-39
SLIDE 39

Foundations F1-F4

39

slide-40
SLIDE 40

Spatial aggregation

  • MAUP: Modifiable Areal Unit Problem

–changing boundaries of cartographic regions can yield dramatically different results –zone effects –scale effects

40

[http://www.e-education.psu/edu/geog486/l4_p7.html, Fig 4.cg.6]

https://blog.cartographica.com/blog/2011/5/19/ the-modifiable-areal-unit-problem-in-gis.html

slide-41
SLIDE 41

Gerrymandering: MAUP for political gain

41

https://www.washingtonpost.com/news/wonk/wp/2015/03/01/this-is-the-best-explanation-of- gerrymandering-you-will-ever-see/

A real district in Pennsylvania: 
 Democrats won 51% of the vote but only 5 out of 18 house seats

slide-42
SLIDE 42

Example: Gerrymandering in PA

42

https://www.nytimes.com/interactive/2018/01/17/upshot/pennsylvania-gerrymandering.html

slide-43
SLIDE 43

Example: Gerrymandering in PA

  • updated map after court decision

43

https://www.nytimes.com/interactive/2018/11/29/us/politics/north-carolina-gerrymandering.html?action=click&module=Top%20Stories&pgtype=Homepage

slide-44
SLIDE 44

Clustering

  • classification of items into similar bins

–based on similiarity measure

  • Euclidean distance, Pearson correlation

–partitioning algorithms

  • divide data into set of bins
  • # bins (k) set manually or automatically

–hierarchical algorithms

  • produce "similarity tree" (dendrograms): cluster hierarchy
  • agglomerative clustering: start w/ each node as own cluster, then iteratively merge
  • cluster hierarchy: derived data used w/ many dynamic aggregation idioms

–cluster more homogeneous than whole dataset

  • statistical measures & distribution more meaningful

44

slide-45
SLIDE 45

Idiom: GrouseFlocks

45

  • data: compound graphs

–network –cluster hierarchy atop it

  • derived or interactively chosen
  • visual encoding

–connection marks for network links –containment marks for hierarchy –point marks for nodes

  • dynamic interaction

–select individual metanodes in hierarchy to expand/ contract

[GrouseFlocks: Steerable Exploration of Graph Hierarchy Space. Archambault, Munzner, and Auber. IEEE TVCG 14(4): 900-913, 2008.] Graph Hierarchy 1

slide-46
SLIDE 46

Idiom: aggregation via hierarchical clustering (visible)

46

System: Hierarchical Clustering Explorer

[http://www.cs.umd.edu/hcil/hce/]

slide-47
SLIDE 47

Idiom: Hierarchical parallel coordinates

  • dynamic item aggregation
  • derived data: hierarchical clustering
  • encoding:

–cluster band with variable transparency, line at mean, width by min/max values –color by proximity in hierarchy

47

[Hierarchical Parallel Coordinates for Exploration of Large Datasets. Fua, Ward, and Rundensteiner. Proc. IEEE Visualization Conference (Vis ’99), pp. 43– 50, 1999.]

slide-48
SLIDE 48

Dimensionality Reduction

48

slide-49
SLIDE 49

Dimensionality reduction

  • attribute aggregation

–derive low-dimensional target space from high-dimensional measured space

  • capture most of variance with minimal error

–use when you can’t directly measure what you care about

  • true dimensionality of dataset conjectured to be smaller than dimensionality of measurements
  • latent factors, hidden variables

49 46

Tumor Measurement Data

DR

Malignant Benign data: 9D measured space derived data: 2D target space

slide-50
SLIDE 50

Idiom: Dimensionality reduction for documents

50

Task 1 In HD data Out 2D data Produce In High- dimensional data Why? What? Derive In 2D data Task 2 Out 2D data How? Why? What? Encode Navigate Select Discover Explore Identify In 2D data Out Scatterplot Out Clusters & points Out Scatterplot Clusters & points Task 3 In Scatterplot Clusters & points Out Labels for clusters Why? What? Produce Annotate In Scatterplot In Clusters & points Out Labels for clusters

wombat

slide-51
SLIDE 51

Dimensionality reduction & visualization

  • why do people do DR?

–improve performance of downstream algorithm

  • avoid curse of dimensionality

–data analysis

  • if look at the output: visual data analysis
  • abstract tasks when visualizing DR data

– dimension-oriented tasks

  • naming synthesized dims, mapping synthesized dims to original dims

– cluster-oriented tasks

  • verifying clusters, naming clusters, matching clusters and classes

51

[Visualizing Dimensionally-Reduced Data: Interviews with Analysts and a Characterization of Task

  • Sequences. Brehmer, Sedlmair, Ingram, and Munzner. Proc. BELIV 2014.]
slide-52
SLIDE 52

Dimension-oriented tasks

  • naming synthesized dims: inspect data represented by lowD points

52

[A global geometric framework for nonlinear dimensionality reduction. Tenenbaum, de Silva, and Langford. Science, 290(5500):2319–2323, 2000.]

slide-53
SLIDE 53

Cluster-oriented tasks

  • verifying, naming, matching to classes

53

no discernable clusters clearly discernable clusters partial match
 cluster/class clear match 
 cluster/class no match 
 cluster/class

[Visualizing Dimensionally-Reduced Data: Interviews with Analysts and a Characterization of Task

  • Sequences. Brehmer, Sedlmair, Ingram, and Munzner. Proc. BELIV 2014.]
slide-54
SLIDE 54

Linear dimensionality reduction

  • principal components analysis (PCA)

–finding axes: first with most variance, second with next most, … –describe location of each point as linear combination of weights for each axis

  • mapping synthesized dims to original dims

54

[http://en.wikipedia.org/wiki/File:GaussianScatterPCA.png]

slide-55
SLIDE 55

Nonlinear dimensionality reduction

  • pro: can handle curved rather than linear structure
  • cons: lose all ties to original dims/attribs

–new dimensions often cannot be easily related to originals

– mapping synthesized dims to original dims task is difficult

  • many techniques proposed

–many literatures: visualization, machine learning, optimization, psychology, ... –techniques: t-SNE, MDS (multidimensional scaling), charting, isomap, LLE,… –t-SNE: excellent for clusters – but some trickiness remains: http://distill.pub/2016/misread-tsne/ –MDS: confusingly, entire family of techniques, both linear and nonlinear – minimize stress or strain metrics – early formulations equivalent to PCA

55

slide-56
SLIDE 56

Nonlinear DR: Many options

  • MDS: multidimensional scaling (treat as optimization problem)
  • t-SNE: t-distributed stochastic neighbor embedding
  • UMAP: uniform manifold approximation and projection

–both emphasize cluster structure

56

https://pair-code.github.io/understanding-umap/ https://distill.pub/2016/misread-tsne/ https://colah.github.io/posts/2014-10-Visualizing-MNIST/

MDS PCA t-SNE UMAP

slide-57
SLIDE 57

VDA with DR example: nonlinear vs linear

  • DR for computer graphics reflectance model

–goal: simulate how light bounces off materials to make realistic pictures

  • computer graphics: BRDF (reflectance)

–idea: measure what light does with real materials

57

[Fig 2. Matusik, Pfister, Brand, and McMillan. A Data-Driven Reflectance Model. SIGGRAPH 2003]

slide-58
SLIDE 58

Capturing & using material reflectance

  • reflectance measurement: interaction of light with real materials (spheres)
  • result: 104 high-res images of material

–each image 4M pixels

  • goal: image synthesis

–simulate completely new materials

  • need for more concise model

–104 materials * 4M pixels = 400M dims –want concise model with meaningful knobs

  • how shiny/greasy/metallic
  • DR to the rescue!

58

[Figs 5/6. Matusik et al. A Data-Driven Reflectance Model. SIGGRAPH 2003]

slide-59
SLIDE 59

Linear DR

  • first try: PCA (linear)
  • result: error falls off sharply after ~45 dimensions

–scree plots: error vs number of dimensions in lowD projection

  • problem: physically impossible intermediate

points when simulating new materials

–specular highlights cannot have holes!

59

[Figs 6/7. Matusik et al. A Data-Driven Reflectance Model. SIGGRAPH 2003]

slide-60
SLIDE 60

Nonlinear DR

  • second try: charting (nonlinear DR technique)

–scree plot suggests 10-15 dims –note: dim estimate depends on 
 technique used!

60

[Fig 10/11. Matusik et al. A Data-Driven Reflectance Model. SIGGRAPH 2003]

slide-61
SLIDE 61

Finding semantics for synthetic dimensions

  • look for meaning in scatterplots

–synthetic dims created by algorithm but named by human analysts –points represent real-world images (spheres) –people inspect images corresponding to points to decide if axis could have meaningful name

  • cross-check meaning

–arrows show simulated images (teapots) made from model –check if those match dimension semantics

61

row 4

[Fig 12/16. Matusik et al. A Data-Driven Reflectance Model. SIGGRAPH 2003]

slide-62
SLIDE 62

Understanding synthetic dimensions

62

[Fig 13/14/16. Matusik et al. A Data-Driven Reflectance Model. SIGGRAPH 2003]

Specular-Metallic Diffuseness-Glossiness

slide-63
SLIDE 63

Embed

63

slide-64
SLIDE 64

Embed: Focus+Context

64

  • combine information

within single view

  • elide

–selectively filter and aggregate

  • superimpose layer

–local lens

  • distortion design choices

–region shape: radial, rectilinear, complex –how many regions: one, many –region extent: local, global –interaction metaphor

Embed Elide Data Superimpose Layer Distort Geometry

slide-65
SLIDE 65

Idiom: DOITrees Revisited

65

  • elide

–some items dynamically filtered out –some items dynamically aggregated together –some items shown in detail

[DOITrees Revisited: Scalable, Space-Constrained Visualization of Hierarchical Data. Heer and Card. Proc. Advanced Visual Interfaces (AVI), pp. 421–424, 2004.]

slide-66
SLIDE 66

Idiom: Fisheye Lens

66

  • distort geometry

–shape: radial –focus: single extent –extent: local –metaphor: draggable lens

http://tulip.labri.fr/TulipDrupal/?q=node/351
 http://tulip.labri.fr/TulipDrupal/?q=node/371

slide-67
SLIDE 67

Idiom: Fisheye Lens

67

[D3 Fisheye Lens](https://bost.ocks.org/mike/fisheye/)

System: D3

slide-68
SLIDE 68

Idiom: Stretch and Squish Navigation

68

  • distort geometry

–shape: rectilinear –foci: multiple –impact: global –metaphor: stretch and squish, borders fixed

[TreeJuxtaposer: Scalable Tree Comparison Using Focus+Context With Guaranteed

  • Visibility. Munzner, Guimbretiere,

Tasiran, Zhang, and Zhou. ACM Transactions on Graphics (Proc. SIGGRAPH) 22:3 (2003), 453– 462.]

System: TreeJuxtaposer

slide-69
SLIDE 69

Distortion costs and benefits

  • benefits

–combine focus and context information in single view

  • costs

–length comparisons impaired

  • network/tree topology

comparisons unaffected: connection, containment

–effects of distortion unclear if

  • riginal structure unfamiliar

–object constancy/tracking maybe impaired

69

[Living Flows: Enhanced Exploration of Edge-Bundled Graphs Based on GPU-Intensive Edge Rendering. Lambert, Auber, and Melançon. Proc. Intl. Conf. Information Visualisation (IV), pp. 523–530, 2010.]

fisheye lens magnifying lens neighborhood layering Bring and Go

slide-70
SLIDE 70

Credits

  • Visualization Analysis and Design (Ch 13, 14)
  • Alex Lex & Miriah Meyer, http://dataviscourse.net/

70