Visualization Max Turgeon STAT 4690Applied Multivariate Analysis - PowerPoint PPT Presentation

Visualization Max Turgeon STAT 4690–Applied Multivariate Analysis

Tidyverse • For graphics, I personally prefer using ggplot2 than base R functions. • Of course, you’re free to use whatever you prefer! • Therefore, I often use the tidyverse packages to prepare data for visualization • Great resources: • The book R for Data Science • RStudio’s cheatsheets 2

Pipe operator • One of the important features of the tidyverse is the pipe operator %>% • It takes the output of a function (or of an expression) and uses it as input for the next function (or expression) 3

library (tidyverse) count (mtcars, cyl) # Or with the pipe mtcars %>% count (cyl) 4

Pipe operator • Note that the LHS ( mtcars ) becomes the fjrst argument of the function appearing on the RHS ( count ) • In more complex examples, where multiple transformations are applied one after another, the pipe operator improves readability and avoids creating too many intermediate variables. 5

Main tidyverse functions • mutate : Create a new variable as a function of the other variables mutate (mtcars, liters_per_100km = mpg / 235.215) • filter : Keep only rows for which some condition is TRUE filter (mtcars, cyl %in% c (6, 8)) • summarise : Apply summary function to some variables. Often used with group_by . mtcars %>% group_by (cyl) %>% summarise (avg_mpg = mean (mpg)) 6

Data Visualization 7

Main principles Why would we want to visualize data? • Quality control • Identify outliers • Find patterns of interest (EDA) 8

Visualizing multivariate data • To start, you can visualize multivariate data one variable at a time. • Therefore, you can use the same visualizing tools you’re likely familiar with. 9

Histogram i library (tidyverse) library (dslabs) dim (olive) ## [1] 572 10 olive %>% ggplot ( aes (oleic)) + geom_histogram () 10

Histogram ii 11 40 count 20 0 65 70 75 80 85 oleic

Histogram iii olive %>% ggplot ( aes (oleic, fill = region)) + geom_histogram () + theme (legend.position = 'top') 12

Histogram iv 13 region Northern Italy Sardinia Southern Italy 40 count 20 0 65 70 75 80 85 oleic

Histogram v # Or with facets olive_bg <- olive %>% dplyr ::select ( - region) ggplot ( aes (oleic, fill = region)) + geom_histogram (data = olive_bg, fill = 'grey') + geom_histogram () + facet_grid (. ~ region) + theme (legend.position = 'top') 14 olive %>%

Histogram vi 15 region Northern Italy Sardinia Southern Italy Northern Italy Sardinia Southern Italy 40 count 20 0 65 70 75 80 85 65 70 75 80 85 65 70 75 80 85 oleic

Density plot i • Another way to estimate the density is with kernel density (e.g. see Wikipedia) triangular, Epanechnikov, biweight, cosine or optcosine 16 estimators . • Let X 1 , . . . , X n be our IID sample. For K a non-negative function and h > 0 a smoothing parameter, we have n f n ( x ) = 1 ( x − X i ) ˆ ∑ K . nh h i =1 • Many functions K can be used: gaussian, rectangular,

Density plot ii olive %>% ggplot ( aes (oleic)) + geom_density () 17

18 Density plot iii 0.075 density 0.050 0.025 0.000 65 70 75 80 85 oleic

Density plot iv olive %>% ggplot ( aes (oleic, fill = region)) + geom_density (alpha = 0.5) + theme (legend.position = 'top') 19

Density plot v 20 region Northern Italy Sardinia Southern Italy 0.4 0.3 density 0.2 0.1 0.0 65 70 75 80 85 oleic

ECDF plot i • Density plots are “smoothed histograms” • The smoothing can hide important details, or even create artifacts • Another way of looking at the distribution: Empirical CDFs • Easily compute/compare quantiles • Steepness corresponds to variance 21

ECDF plot ii olive %>% ggplot ( aes (oleic)) + stat_ecdf () + ylab ("Cumulative Probability") 22

ECDF plot iii 23 1.00 0.75 Cumulative Probability 0.50 0.25 0.00 65 70 75 80 85 oleic

ECDF plot iv # You can add a "rug" olive %>% ggplot ( aes (oleic)) + stat_ecdf () + geom_rug (sides = "b") + ylab ("Cumulative Probability") 24

ECDF plot v 25 1.00 0.75 Cumulative Probability 0.50 0.25 0.00 65 70 75 80 85 oleic

ECDF plot vi olive %>% ggplot ( aes (oleic, colour = region)) + stat_ecdf () + ylab ("Cumulative Probability") + theme (legend.position = 'top') 26

ECDF plot vii 27 region Northern Italy Sardinia Southern Italy 1.00 0.75 Cumulative Probability 0.50 0.25 0.00 65 70 75 80 85 oleic

Boxplot i • Box plots are a simple way to display important quantiles and identify outliers • Components (per Tukey): • A box delimiting the fjrst and third quartile; • A line indicating the median; • Whiskers corresponding to the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile; • Any datum that falls outside the whiskers is considered a (potential) outlier. 28

Boxplot ii olive %>% ggplot ( aes (y = oleic)) + geom_boxplot (x = 0) 29

30 Boxplot iii 85 80 75 oleic 70 65 −0.4 −0.2 0.0 0.2 0.4

Boxplot iv olive %>% ggplot ( aes (x = region, y = oleic)) + geom_boxplot () 31

Boxplot v 32 85 80 75 oleic 70 65 Northern Italy Sardinia Southern Italy region

Boxplot vi # Add all points on top of boxplots # Note: need to remove outliers or you will get # duplicates olive %>% ggplot ( aes (x = region, y = oleic)) + geom_boxplot (outlier.colour = NA) + geom_jitter (width = 0.25, height = 0) 33

Boxplot vii 34 85 80 75 oleic 70 65 Northern Italy Sardinia Southern Italy region

Bivariate plots 35

Scatter plot i • The plots above displayed information on a single variable at a time. • The simplest way to represent the relationship between two variables is a scatter plot . • Technically still possible with three variables, but typically more diffjcult to read. stars %>% ggplot ( aes (magnitude, temp)) + geom_point () 36

37 Scatter plot ii 30000 20000 temp 10000 0 10 magnitude

Scatter plot iii stars %>% ggplot ( aes (magnitude, temp)) + geom_point ( aes (colour = type)) 38

Scatter plot iv 39 30000 type A B DA 20000 DB temp DF F G K M O 10000 0 10 magnitude

Scatter plot v library (scatterplot3d) greenhouse_gases %>% spread (gas, concentration) %>% with ( scatterplot3d (CH4, # x axis CO2, # y axis N2O # z axis )) 40

Scatter plot vi 41 320 310 300 N2O 290 CO2 380 360 280 340 320 270 300 280 260 260 600 800 1000 1200 1400 1600 1800 CH4

Bivariate density plot i stars %>% ggplot ( aes (magnitude, temp)) + geom_point ( aes (colour = type)) + geom_density_2d () 42

Bivariate density plot ii 43 30000 type A B DA 20000 DB temp DF F G K M O 10000 0 10 magnitude

Bagplot i • Introduced in 1999 by Rousseuw et al. as a bivariate generalization of Tukey’s boxplot. • Help visualize location, spread, skewness, and identify potential outliers. • Components (details omitted): • The bag , a polygon “at the center of the data cloud” that contains at most 50% of the data points. • The fence , corresponding to an infmation of the bag (typically by a factor of 3). Observations outside the fence are potential outliers. • The loop , which is the convex hull of the non-outliers. 44

Bagplot ii devtools ::source_gist ("00772ccea2dd0b0f1745", filename = "000_geom_bag.r") devtools ::source_gist ("00772ccea2dd0b0f1745", filename = "001_bag_functions.r") ggplot ( aes (magnitude, temp)) + geom_bag () + theme_bw () 45 stars %>%

Bagplot iii 46 30000 20000 temp 10000 + 0 10 magnitude

Bagplot iv stars %>% ggplot ( aes (magnitude, temp)) + geom_bag () + geom_point ( aes (colour = type)) + theme_bw () 47

Bagplot v 48 30000 type A B DA 20000 DB temp DF F G K M O 10000 + 0 10 magnitude

Bagplot vi gapminder %>% filter (year == 2012, !is.na (infant_mortality)) %>% ggplot ( aes (infant_mortality, life_expectancy)) + geom_bag ( aes (fill = continent)) + geom_point ( aes (colour = continent)) + theme_bw () 49

Bagplot vii 50 + + + 75 + continent life_expectancy Africa Americas 65 Asia + Europe Oceania 55 45 0 25 50 75 100 infant_mortality

Bagplot viii gapminder %>% filter (year == 2012, !is.na (infant_mortality)) %>% ggplot ( aes (infant_mortality, life_expectancy)) + geom_bag ( aes (fill = continent)) + geom_point ( aes (colour = continent)) + facet_wrap ( ~ continent) + theme_bw () 51

Bagplot ix 52 Africa Americas Asia + + 75 65 + 55 continent life_expectancy Africa 45 Americas 0 25 50 75 100 Europe Oceania Asia Europe + Oceania 75 + 65 55 45 0 25 50 75 100 0 25 50 75 100 infant_mortality

Beyond two variables 53

Visualization Max Turgeon STAT 4690Applied Multivariate Analysis - PowerPoint PPT Presentation

Visualization Max Turgeon STAT 4690Applied Multivariate Analysis Tidyverse For graphics, I personally prefer using ggplot2 than base R functions. Of course, youre free to use whatever you prefer! Therefore, I often use the

Security Visualization Tim Vidas & Hanan Hibshi UPS 2011 1 Visualization Visualization can

Visualization Visualization Understand what ConvNets learn 2 Visualization The development of

Data Visualization Brait ispuu Types of Visualization Mathematical Visualization y =

Visualization CS 299 Introduction to Data Science Overview 1. What Is Visualization? 2.

Visualization Systems 11-1 Ronald Peikert SciVis 2008 - Visualization Systems Modular

Data Visualization Tools, How do you make a visualization? Is it the right visualization?

Information Visualization Text: Information visualization, Robert Spence, Addison-Wesley, 2001

Glyph-based Visualization Applications David H. S. Chung Swansea University Outline Glyph

Scientific Visualization : From Data to Insight Vijay Natarajan Indian Institute of Science

Visualization History Visual Programming Visualization History Visual Programming

Code Visualization 2 Code Visualization PaiMei and uDraw(Graph)

Scientific Visualization Algorithms Graphics & Visualization: Principles & Algorithms

Interactive Data Visualization with Bokeh Interactive Data Visualization with Bokeh What is

Volume Visualization Overview: Volume Visualization (1) Introduction to volume visualization On

VISUALIZATION VISUALIZATION AND MENTAL AND MENTAL IMAGERY IMAGERY Learning Outcomes Learning

NAVIGATING THE IN SITU VISUALIZATION LANDSCAPE Tom Fogal, 4/6/2016 VISUALIZATION & ANALYSIS

Counterfactual Visual Explanations Yash Goyal Ziyan Wu Jan Ernst Dhruv Batra Devi Parikh

Open Broadband, LLC Better Internet. Better North Carolina. http://openbb.net 704-237-0102 Open

New Classes of Distributed Time Complexity Alkida Balliu Joint work with: Juho Hirvonen, Janne H.

Romans Series Lesson #124 November 21, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

AMSU Bias Adjustments Larry McMillin Jiang Zhao Climate Research and Applications Division

Kelsey Weight Accredited Practising Dietitian ABOUT ME I'm new around here I'm not the

Noise, Image Reconstruction with Noise ! EE367/CS448I: Computational Imaging and Display !

Hunting for Ultra High Energy Neutrinos Eray Sabancilar Physics Department, Arizona State

Visualization Max Turgeon STAT 4690Applied Multivariate Analysis - PowerPoint PPT Presentation

Visualization Max Turgeon STAT 4690Applied Multivariate Analysis Tidyverse For graphics, I personally prefer using ggplot2 than base R functions. Of course, youre free to use whatever you prefer! Therefore, I often use the

Security Visualization Tim Vidas &amp; Hanan Hibshi UPS 2011 1 Visualization Visualization can

Visualization Visualization Understand what ConvNets learn 2 Visualization The development of

Data Visualization Brait ispuu Types of Visualization Mathematical Visualization y =

Visualization CS 299 Introduction to Data Science Overview 1. What Is Visualization? 2.

Visualization Systems 11-1 Ronald Peikert SciVis 2008 - Visualization Systems Modular

Data Visualization Tools, How do you make a visualization? Is it the right visualization?

Information Visualization Text: Information visualization, Robert Spence, Addison-Wesley, 2001

Glyph-based Visualization Applications David H. S. Chung Swansea University Outline Glyph

Scientific Visualization : From Data to Insight Vijay Natarajan Indian Institute of Science

Visualization History Visual Programming Visualization History Visual Programming

Code Visualization 2 Code Visualization PaiMei and uDraw(Graph)

Scientific Visualization Algorithms Graphics &amp; Visualization: Principles &amp; Algorithms

Interactive Data Visualization with Bokeh Interactive Data Visualization with Bokeh What is

Volume Visualization Overview: Volume Visualization (1) Introduction to volume visualization On

VISUALIZATION VISUALIZATION AND MENTAL AND MENTAL IMAGERY IMAGERY Learning Outcomes Learning

NAVIGATING THE IN SITU VISUALIZATION LANDSCAPE Tom Fogal, 4/6/2016 VISUALIZATION &amp; ANALYSIS

Counterfactual Visual Explanations Yash Goyal Ziyan Wu Jan Ernst Dhruv Batra Devi Parikh

Open Broadband, LLC Better Internet. Better North Carolina. http://openbb.net 704-237-0102 Open

New Classes of Distributed Time Complexity Alkida Balliu Joint work with: Juho Hirvonen, Janne H.

Romans Series Lesson #124 November 21, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

AMSU Bias Adjustments Larry McMillin Jiang Zhao Climate Research and Applications Division

Kelsey Weight Accredited Practising Dietitian ABOUT ME I'm new around here I'm not the

Noise, Image Reconstruction with Noise ! EE367/CS448I: Computational Imaging and Display !

Hunting for Ultra High Energy Neutrinos Eray Sabancilar Physics Department, Arizona State

Security Visualization Tim Vidas & Hanan Hibshi UPS 2011 1 Visualization Visualization can

Scientific Visualization Algorithms Graphics & Visualization: Principles & Algorithms

NAVIGATING THE IN SITU VISUALIZATION LANDSCAPE Tom Fogal, 4/6/2016 VISUALIZATION & ANALYSIS