introduction to data science basic
play

Introduction to Data Science: Basic Let's create a dot plot of the - PowerPoint PPT Presentation

Data Visualization Grammar of Graphics (ggplot) Boxplot Histogram Bar graph Scatter plot Frequently Used Plots Plot Construction Details Plot Construction Details Grammar of Graphics (ggplot) Grammar of Graphics (ggplot) Grammar of


  1. Data Visualization Grammar of Graphics (ggplot) Boxplot Histogram Bar graph Scatter plot Frequently Used Plots Plot Construction Details Plot Construction Details Grammar of Graphics (ggplot) Grammar of Graphics (ggplot) Grammar of Graphics (ggplot) Plot Construction Details Grammar of Graphics (ggplot) Mappings Representations In general, the ggplot call will have the following structure: In this section, we will introduce basic functionality of the ggplot The ggplot design is very elegant, takes some thinking to get used to, We will look comprehensively at data visualization in more detail later in So in our example we can fill in these three parts as follows: Used to visualize the relationship between a continuous variable to a We can include multiple geometric representations in a single plot, for We will spend a good amount of time in the course discussing data Used to visualize the distribution of the values of a numeric attribute Used to visualize the relationship between two attributes. Used to visualize the distribution of a numeric attribute based on a Introduction to Data Science: Basic Let's create a dot plot of the example points and text, by adding ( + ) multiple the course, but for now will list a few common plots we use in data package (available in both R and python) to start our discussion of categorical (or discrete) attribute visualization. but is extremely powerful. categorical attribute 1) Data : We pass a data frame to the ggplot function with the %>% number of arrests per district Plotting geom_<representation> functions. analysis and how they are created using ggplot . Argument Function Representation Definition visualization throughout the course. <data_frame> %>% operator at the end of the group_by­summarize pipeline. in our dataset: It serves many important roles in data analysis. The central premise is to characterize the building pieces behind mpg %>% mpg %>% ggplot(mapping=aes(<graphical_characteristic>=<attribute>)) + x geom_point position along x axis points Let's switch data frame to the mpg dataset for our examples: The ggplot package is designed around the Entity­Attribute data ggplot plots as follows: Also, we can include mappings inside a geom_ call to map ggplot(mapping=aes(x=displ, y=hwy)) + mpg %>% ggplot(mapping=aes(x=hwy)) + mpg %>% geom_<representation>() Héctor Corrada Bravo 2) Mapping : Here we map the num_arrests attribute to the x position We use it to gain understanding of dataset characteristics throughout y geom_bar position along y axis rectangles characteristics to attributes strictly for that specific representation. model. geom_point(mapping=aes(color=cyl)) arrest_tab %>% group_by(cyl) %>% geom_histogram() ggplot(mapping=aes(x=class, y=hwy)) + in the plot and the district attribute to the y position in the plot. Every analyses and it is a key element of communicating insights we have 1. The data that goes into a plot, a data frame of entities and attributes group_by(district) %>% geom_boxplot() mpg summarize(mean_mpg=mean(hwy)) %>% color geom_text color strings University of Maryland, College Park, USA For example geom_point(mapping=aes(color=<attribute>)) ggplot will contain one or more aes calls. Also, it can be included as part of data frame operation pipelines. derived from data analyses with our target audience. 2. The mapping between data attributes and graphical (aesthetic) ggplot(mapping=aes(x=cyl, y=mean_mpg)) + summarize(num_arrests=n()) %>% 2020­01­31 shape geom_smooth smoothed line (advanced) shape (applicable to e.g., points) maps color to some attribute only for the point representation specified characteristics ## # A tibble: 234 x 11 ggplot(mapping=aes(y=district, geom_bar(stat="identity") 3) Geometry : Here we choose points as the geometric representations size geom_hex by that call. Mappings given in the ggplot call apply to all size hexagonal binning 3. The geometric representation of these graphical characteristics ## manufacturer model displ year cyl trans drv cty hwy fl class x=num_arrests) of our chosen graphical characteristics using the geom_point function. representations added to the plot. ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> label string used as label (applicable to text) geom_point() ## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 10 / 14 14 / 14 13 / 14 12 / 14 11 / 14 2 / 14 8 / 14 4 / 14 9 / 14 3 / 14 5 / 14 7 / 14 6 / 14 1 / 14 f

  2. Data Visualization We will spend a good amount of time in the course discussing data visualization. It serves many important roles in data analysis. We use it to gain understanding of dataset characteristics throughout analyses and it is a key element of communicating insights we have derived from data analyses with our target audience. 1 / 14

  3. Grammar of Graphics (ggplot) In this section, we will introduce basic functionality of the ggplot package (available in both R and python) to start our discussion of visualization throughout the course. The ggplot package is designed around the Entity­Attribute data model. Also, it can be included as part of data frame operation pipelines. 2 / 14

  4. Grammar of Graphics (ggplot) Let's create a dot plot of the number of arrests per district in our dataset: arrest_tab %>% group_by(district) %>% summarize(num_arrests=n()) %>% ggplot(mapping=aes(y=district, x=num_arrests) geom_point() 3 / 14

  5. Grammar of Graphics (ggplot) The ggplot design is very elegant, takes some thinking to get used to, but is extremely powerful. The central premise is to characterize the building pieces behind ggplot plots as follows: 1. The data that goes into a plot, a data frame of entities and attributes 2. The mapping between data attributes and graphical (aesthetic) characteristics 3. The geometric representation of these graphical characteristics 4 / 14

  6. Grammar of Graphics (ggplot) So in our example we can fill in these three parts as follows: 1) Data : We pass a data frame to the ggplot function with the %>% operator at the end of the group_by­summarize pipeline. 2) Mapping : Here we map the num_arrests attribute to the x position in the plot and the district attribute to the y position in the plot. Every ggplot will contain one or more aes calls. 3) Geometry : Here we choose points as the geometric representations of our chosen graphical characteristics using the geom_point function. 5 / 14

  7. Grammar of Graphics (ggplot) In general, the ggplot call will have the following structure: <data_frame> %>% ggplot(mapping=aes(<graphical_characteristic>=<attribute>)) + geom_<representation>() 6 / 14

  8. Plot Construction Details Mappings Argument Definition x position along x axis y position along y axis color color shape shape (applicable to e.g., points) size size label string used as label (applicable to text) 7 / 14

  9. Plot Construction Details Representations Function Representation geom_point points geom_bar rectangles geom_text strings geom_smooth smoothed line (advanced) geom_hex hexagonal binning 8 / 14

  10. Plot Construction Details We can include multiple geometric representations in a single plot, for example points and text, by adding ( + ) multiple geom_<representation> functions. Also, we can include mappings inside a geom_ call to map characteristics to attributes strictly for that specific representation. For example geom_point(mapping=aes(color=<attribute>)) maps color to some attribute only for the point representation specified by that call. Mappings given in the ggplot call apply to all representations added to the plot. 9 / 14

  11. Frequently Used Plots We will look comprehensively at data visualization in more detail later in the course, but for now will list a few common plots we use in data analysis and how they are created using ggplot . Let's switch data frame to the mpg dataset for our examples: mpg ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> ## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 10 / 14 f

  12. Scatter plot Used to visualize the relationship between two attributes. mpg %>% ggplot(mapping=aes(x=displ, y=hwy)) + geom_point(mapping=aes(color=cyl)) 11 / 14

  13. Bar graph Used to visualize the relationship between a continuous variable to a categorical (or discrete) attribute mpg %>% group_by(cyl) %>% summarize(mean_mpg=mean(hwy)) %>% ggplot(mapping=aes(x=cyl, y=mean_mpg)) + geom_bar(stat="identity") 12 / 14

  14. Histogram Used to visualize the distribution of the values of a numeric attribute mpg %>% ggplot(mapping=aes(x=hwy)) + geom_histogram() 13 / 14

  15. Boxplot Used to visualize the distribution of a numeric attribute based on a categorical attribute mpg %>% ggplot(mapping=aes(x=class, y=hwy)) + geom_boxplot() 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend