introd u ction to e x plorator y data anal y sis
play

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN - PowerPoint PPT Presentation

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y E x plorator y data anal y sis The process of organi z ing , plo ing ,


  1. Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  2. E x plorator y data anal y sis The process of organi z ing , plo � ing , and s u mmari z ing a data set STATISTICAL THINKING IN PYTHON ( PART 1)

  3. “ E x plorator y data anal y sis can ne v er be the w hole stor y, b u t nothing else can ser v e as the fo u ndation stone .” — John T u ke y STATISTICAL THINKING IN PYTHON ( PART 1)

  4. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  5. 2008 US s w ing state election res u lts import pandas as pd df_swing = pd.read_csv('2008_swing_states.csv') df_swing[['state', 'county', 'dem_share']] state county dem_share 0 PA Erie County 60.08 1 PA Bradford County 40.64 2 PA Tioga County 36.07 3 PA McKean County 41.21 4 PA Potter County 31.04 5 PA Wayne County 43.78 6 PA Susquehanna County 44.08 7 PA Warren County 46.85 8 OH Ashtabula County 56.94 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  6. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  7. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  8. Plotting a histogram STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  9. 2008 US s w ing state election res u lts Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)

  10. Generating a histogram import matplotlib.pyplot as plt _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  11. Al w a y s label y o u r a x es STATISTICAL THINKING IN PYTHON ( PART 1)

  12. 2008 US s w ing state election res u lts Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)

  13. Histograms w ith different binning Data retrie v ed from Data . go v ( h � ps ://www. data . go v/ ) STATISTICAL THINKING IN PYTHON ( PART 1)

  14. Setting the bins of a histogram bin_edges = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] _ = plt.hist(df_swing['dem_share'], bins=bin_edges) plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  15. Setting the bins of a histogram _ = plt.hist(df_swing['dem_share'], bins=20) plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  16. Seaborn An e x cellent Matplotlib - based statistical data v is u ali z ation package w ri � en b y Michael Waskom STATISTICAL THINKING IN PYTHON ( PART 1)

  17. Setting Seaborn st y ling import seaborn as sns sns.set() _ = plt.hist(df_swing['dem_share']) _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('number of counties') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  18. A Seaborn - st y led histogram 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  19. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  20. Plot all of y o u r data : Bee s w arm plots STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  21. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  22. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  23. Binning bias The same data ma y be interpreted di � erentl y depending on choice of bins STATISTICAL THINKING IN PYTHON ( PART 1)

  24. Bee s w arm plot 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  25. Organi z ation of the data frame 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  26. Organi z ation of the data frame STATISTICAL THINKING IN PYTHON ( PART 1)

  27. Organi z ation of the data frame STATISTICAL THINKING IN PYTHON ( PART 1)

  28. Generating a bee s w arm plot _ = sns.swarmplot(x='state', y='dem_share', data=df_swing) _ = plt.xlabel('state') _ = plt.ylabel('percent of vote for Obama') plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  29. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  30. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  31. Plot all of y o u r data : ECDFs STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  32. 2008 US s w ing state election res u lts 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  33. 2008 US election res u lts : East and West 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  34. Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  35. Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  36. Empirical c u m u lati v e distrib u tion f u nction ( ECDF ) 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  37. Making an ECDF import numpy as np x = np.sort(df_swing['dem_share']) y = np.arange(1, len(x)+1) / len(x) _ = plt.plot(x, y, marker='.', linestyle='none') _ = plt.xlabel('percent of vote for Obama') _ = plt.ylabel('ECDF') plt.margins(0.02) # Keeps data off plot edges plt.show() STATISTICAL THINKING IN PYTHON ( PART 1)

  38. 2008 US s w ing state election ECDF 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  39. 2008 US s w ing state election ECDFs 1 Data retrie v ed from Data . go v ( h � ps ://www. data . go v/) STATISTICAL THINKING IN PYTHON ( PART 1)

  40. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

  41. On w ard to w ard the w hole stor y! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 ) J u stin Bois Lect u rer at the California Instit u te of Technolog y

  42. STATISTICAL THINKING IN PYTHON ( PART 1)

  43. “ E x plorator y data anal y sis can ne v er be the w hole stor y, b u t nothing else can ser v e as the fo u ndation stone .” — John T u ke y STATISTICAL THINKING IN PYTHON ( PART 1)

  44. Coming u p … Thinking probabilisticall y Discrete and contin u o u s distrib u tions The po w er of hacker statistics u sing np.random STATISTICAL THINKING IN PYTHON ( PART 1)

  45. Let ' s practice ! STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1 )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend