INFO 1998: Introduction to Machine Learning Lecture 3: Data - - PowerPoint PPT Presentation

info 1998 introduction to machine learning lecture 3 data
SMART_READER_LITE
LIVE PREVIEW

INFO 1998: Introduction to Machine Learning Lecture 3: Data - - PowerPoint PPT Presentation

INFO 1998: Introduction to Machine Learning Lecture 3: Data Visualization INFO 1998: Introduction to Machine Learning Agenda 1. Why Data Visualization is Important 2. Data Visualization Libraries 3. Basic Visualizations 4. Advanced Visualizations


slide-1
SLIDE 1

INFO 1998: Introduction to Machine Learning

slide-2
SLIDE 2

Lecture 3: Data Visualization

INFO 1998: Introduction to Machine Learning

slide-3
SLIDE 3

Agenda

  • 1. Why Data Visualization is Important
  • 2. Data Visualization Libraries
  • 3. Basic Visualizations
  • 4. Advanced Visualizations
  • 5. Challenges of Visualization
slide-4
SLIDE 4

Raw data Usable data Statistical and predictive results Meaningful

  • utput

Data cleaning, imputation, normalization Data analysis, predictive modeling, etc. Debugging, improving models and analysis Summary and visualization

We are also here!

https://towardsdatascience.com/5-steps-of-a-data- science-project-lifecycle-26c50372b492

The Data Pipeline

Problem Statement Solution

We are here!

slide-5
SLIDE 5

Why Data Visualization is Important?

Source

Data Visualization me Raw CSV file

slide-6
SLIDE 6

Why Data Visualization is Important?

Informative Appealing Universal Predictive

slide-7
SLIDE 7

Why Data Visualization is Important? Same summary stats (mean, median, mode) but different distributions! We need to see how the actual data looks!

Source df.describe() is not enough

slide-8
SLIDE 8

Data Visualization Simple Example: Yelp Question: What do you notice? What trends do you see?

slide-9
SLIDE 9

Data Visualization Libraries

  • matplotlib
  • Python data visualization package
  • Capable of handling most data visualization needs
  • Simple object-oriented library inspired from MATLAB
  • Cheatsheet
  • seaborn
  • Another visualization package built on matplotlib
slide-10
SLIDE 10

Bar Graph

  • Represent magnitude or

frequency of discrete variables

  • Allows us to compare

features

Source

slide-11
SLIDE 11

Histograms

  • Used to observe

frequency distribution of continuous variables

  • Data split into bins

Source

slide-12
SLIDE 12

Histograms: Different Bin Sizes

Source

slide-13
SLIDE 13

Density Plot Like a histogram, but smooths the shape of the distribution

Source

slide-14
SLIDE 14

Histogram vs Density Plot

Source

slide-15
SLIDE 15

Boxplot (a.k.a box and whisker plot)

  • Summary of data
  • Shows spread of data
  • Gives range, interquartile range,

median, and outlier information

Source

slide-16
SLIDE 16

Violin Plot

  • Combination of boxplot and

density plot to show the spread and shape of the data

  • Can show whether the data is

normal

slide-17
SLIDE 17

Demo 1

slide-18
SLIDE 18

Scatterplot

  • See relationship between

two features

  • Can be useful for

extrapolating information

slide-19
SLIDE 19

Heatmap

  • Varying degrees of one metric

are represented using color

  • Especially useful in the context
  • f maps to show geographical

variation

slide-20
SLIDE 20

Heatmap: Click Density / Website Heatmaps

slide-21
SLIDE 21

Correlation Plots

  • 2D matrix with all variables on

each axis

  • Entries represent the correlation

coefficients between each pair of variables

Source

Why are all entries on the diagonal ‘1’?

slide-22
SLIDE 22

Using Maps

➢ Map visualization → contextual information ○ Trends are not always apparent in the data itself ○ Ex) Longitudes + Latitudes → Geographical Map

slide-23
SLIDE 23

Example: Pittsburgh Data

slide-24
SLIDE 24

Demo 2

slide-25
SLIDE 25

Challenges of Visualization Higher Dimension Hard to Show Uncertainty Time Consuming Non-Trivial

slide-26
SLIDE 26

High Dimensional Data

  • Color, time animations, or point shape

can be used for higher dimensions

  • There is a limit to the number of

features that can be displayed

4D Plot For Earthquake Data

slide-27
SLIDE 27

Error Bars

  • Used to show uncertainty
  • Usually display 95 percent confidence interval
slide-28
SLIDE 28

Coming Up

  • Assignment 2: Due at 5:30pm on Mar 4, 2020
  • Next Lecture: Fundamentals of Machine Learning
  • Data Scraping Workshop: March 2 (Mon), 4:30pm – 5:30pm, Rhodes 406