Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN - - PowerPoint PPT Presentation

introd u ction to e x plorator y data anal y sis
SMART_READER_LITE
LIVE PREVIEW

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN - - PowerPoint PPT Presentation

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y E x plorator y data anal y sis The process of organi z ing , plo ing ,


slide-1
SLIDE 1

Introduction to Exploratory Data Analysis

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

Justin Bois

Lecturer at the California Institute of Technology

slide-2
SLIDE 2

STATISTICAL THINKING IN PYTHON (PART 1)

Exploratory data analysis

The process of organizing, ploing, and summarizing a data set

slide-3
SLIDE 3

STATISTICAL THINKING IN PYTHON (PART 1)

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” —John Tukey

slide-4
SLIDE 4

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-5
SLIDE 5

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

import pandas as pd df_swing = pd.read_csv('2008_swing_states.csv') df_swing[['state', 'county', 'dem_share']] state county dem_share 0 PA Erie County 60.08 1 PA Bradford County 40.64 2 PA Tioga County 36.07 3 PA McKean County 41.21 4 PA Potter County 31.04 5 PA Wayne County 43.78 6 PA Susquehanna County 44.08 7 PA Warren County 46.85 8 OH Ashtabula County 56.94

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-6
SLIDE 6

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-7
SLIDE 7

Let's practice!

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

slide-8
SLIDE 8

Plotting a histogram

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

Justin Bois

Lecturer at the California Institute of Technology

slide-9
SLIDE 9

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

slide-10
SLIDE 10

STATISTICAL THINKING IN PYTHON (PART 1)

Generating a histogram

import matplotlib.pyplot as plt _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show()

slide-11
SLIDE 11

STATISTICAL THINKING IN PYTHON (PART 1)

Always label your axes

slide-12
SLIDE 12

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

slide-13
SLIDE 13

STATISTICAL THINKING IN PYTHON (PART 1)

Histograms with different binning

Data retrieved from Data.gov (hps://www.data.gov/)

slide-14
SLIDE 14

STATISTICAL THINKING IN PYTHON (PART 1)

Setting the bins of a histogram

bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] _ = plt.hist(df_swing['dem_share'], bins=bin_edges) plt.show()

slide-15
SLIDE 15

STATISTICAL THINKING IN PYTHON (PART 1)

Setting the bins of a histogram

_ = plt.hist(df_swing['dem_share'], bins=20) plt.show()

slide-16
SLIDE 16

STATISTICAL THINKING IN PYTHON (PART 1)

Seaborn

An excellent Matplotlib-based statistical data visualization package wrien by Michael Waskom

slide-17
SLIDE 17

STATISTICAL THINKING IN PYTHON (PART 1)

Setting Seaborn styling

import seaborn as sns sns.set() _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show()

slide-18
SLIDE 18

STATISTICAL THINKING IN PYTHON (PART 1)

A Seaborn-styled histogram

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-19
SLIDE 19

Let's practice!

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

slide-20
SLIDE 20

Plot all of your data: Bee swarm plots

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

Justin Bois

Lecturer at the California Institute of Technology

slide-21
SLIDE 21

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-22
SLIDE 22

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-23
SLIDE 23

STATISTICAL THINKING IN PYTHON (PART 1)

Binning bias

The same data may be interpreted dierently depending on choice of bins

slide-24
SLIDE 24

STATISTICAL THINKING IN PYTHON (PART 1)

Bee swarm plot

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-25
SLIDE 25

STATISTICAL THINKING IN PYTHON (PART 1)

Organization of the data frame

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-26
SLIDE 26

STATISTICAL THINKING IN PYTHON (PART 1)

Organization of the data frame

slide-27
SLIDE 27

STATISTICAL THINKING IN PYTHON (PART 1)

Organization of the data frame

slide-28
SLIDE 28

STATISTICAL THINKING IN PYTHON (PART 1)

Generating a bee swarm plot

_ = sns.swarmplot(x='state', y='dem_share', data=df_swing) _ = plt.xlabel('state') _ = plt.ylabel('percent of vote for Obama') plt.show()

slide-29
SLIDE 29

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-30
SLIDE 30

Let's practice!

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

slide-31
SLIDE 31

Plot all of your data: ECDFs

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

Justin Bois

Lecturer at the California Institute of Technology

slide-32
SLIDE 32

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election results

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-33
SLIDE 33

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US election results: East and West

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-34
SLIDE 34

STATISTICAL THINKING IN PYTHON (PART 1)

Empirical cumulative distribution function (ECDF)

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-35
SLIDE 35

STATISTICAL THINKING IN PYTHON (PART 1)

Empirical cumulative distribution function (ECDF)

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-36
SLIDE 36

STATISTICAL THINKING IN PYTHON (PART 1)

Empirical cumulative distribution function (ECDF)

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-37
SLIDE 37

STATISTICAL THINKING IN PYTHON (PART 1)

Making an ECDF

import numpy as np x = np.sort(df_swing['dem_share']) y = np.arange(1, len(x)+1) / len(x) _ = plt.plot(x, y, marker='.', linestyle='none') _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('ECDF') plt.margins(0.02) # Keeps data off plot edges plt.show()

slide-38
SLIDE 38

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election ECDF

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-39
SLIDE 39

STATISTICAL THINKING IN PYTHON (PART 1)

2008 US swing state election ECDFs

Data retrieved from Data.gov (hps://www.data.gov/)

1

slide-40
SLIDE 40

Let's practice!

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

slide-41
SLIDE 41

Onward toward the whole story!

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )

Justin Bois

Lecturer at the California Institute of Technology

slide-42
SLIDE 42

STATISTICAL THINKING IN PYTHON (PART 1)

slide-43
SLIDE 43

STATISTICAL THINKING IN PYTHON (PART 1)

“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” — John Tukey

slide-44
SLIDE 44

STATISTICAL THINKING IN PYTHON (PART 1)

Coming up…

Thinking probabilistically Discrete and continuous distributions The power of hacker statistics using np.random

slide-45
SLIDE 45

Let's practice!

STATISTIC AL TH IN K IN G IN P YTH ON (PAR T 1 )