Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION - PowerPoint PPT Presentation

Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

Spreadsheets Also kno w n as E x cel � les Data stored in tab u lar form , w ith cells arranged in ro w s and col u mns Unlike � at � les , can ha v e forma � ing and form u las M u ltiple spreadsheets can e x ist in a w orkbook STREAMLINED DATA INGESTION WITH PANDAS

Loading Spreadsheets Spreadsheets ha v e their o w n loading f u nction in pandas : read_excel() STREAMLINED DATA INGESTION WITH PANDAS

Loading Spreadsheets import pandas as pd # Read the Excel file survey_data = pd.read_excel("fcc_survey.xlsx") # View the first 5 lines of data print(survey_data.head()) Age AttendedBootcamp ... SchoolMajor StudentDebtOwe 0 28.0 0.0 ... NaN 20000 1 22.0 0.0 ... NaN NaN 2 19.0 0.0 ... NaN NaN 3 26.0 0.0 ... Cinematography And Film 7000 4 20.0 0.0 ... NaN NaN [5 rows x 98 columns] STREAMLINED DATA INGESTION WITH PANDAS

Loading Select Col u mns and Ro w s STREAMLINED DATA INGESTION WITH PANDAS

Loading Select Col u mns and Ro w s read_excel() has man y ke yw ord arg u ments in common w ith read_csv() nrows : limit n u mber of ro w s to load skiprows : specif y n u mber of ro w s or ro w n u mbers to skip usecols : choose col u mns b y name , positional n u mber , or le � er ( e . g . " A : P ") STREAMLINED DATA INGESTION WITH PANDAS

Loading Select Col u mns and Ro w s STREAMLINED DATA INGESTION WITH PANDAS

Loading Select Col u mns and Ro w s # Read columns W-AB and AR of file, skipping metadata header survey_data = pd.read_excel("fcc_survey_with_headers.xlsx", skiprows=2, usecols="W:AB, AR") # View data print(survey_data.head()) CommuteTime CountryCitizen ... EmploymentFieldOther EmploymentStatus Income 0 35.0 United States of America ... NaN Employed for wages 32000.0 1 90.0 United States of America ... NaN Employed for wages 15000.0 2 45.0 United States of America ... NaN Employed for wages 48000.0 3 45.0 United States of America ... NaN Employed for wages 43000.0 4 10.0 United States of America ... NaN Employed for wages 6000.0 [5 rows x 7 columns] STREAMLINED DATA INGESTION WITH PANDAS

Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

Getting data from m u ltiple w orksheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

Selecting Sheets to Load read_excel() loads the � rst sheet in an E x cel � le b y defa u lt Use the sheet_name ke yw ord arg u ment to load other sheets Specif y spreadsheets b y name and / or (z ero - inde x ed ) position n u mber Pass a list of names / n u mbers to load more than one sheet at a time An y arg u ments passed to read_excel() appl y to all sheets read STREAMLINED DATA INGESTION WITH PANDAS

Selecting Sheets to Load STREAMLINED DATA INGESTION WITH PANDAS

Loading Select Sheets # Get the second sheet by position index survey_data_sheet2 = pd.read_excel('fcc_survey.xlsx', sheet_name=1) # Get the second sheet by name survey_data_2017 = pd.read_excel('fcc_survey.xlsx', sheet_name='2017') print(survey_data_sheet2.equals(survey_data_2017)) True STREAMLINED DATA INGESTION WITH PANDAS

Loading All Sheets Passing sheet_name=None to read_excel() reads all sheets in a w orkbook survey_responses = pd.read_excel("fcc_survey.xlsx", sheet_name=None) print(type(survey_responses)) <class 'collections.OrderedDict'> for key, value in survey_responses.items(): print(key, type(value)) 2016 <class 'pandas.core.frame.DataFrame'> 2017 <class 'pandas.core.frame.DataFrame'> STREAMLINED DATA INGESTION WITH PANDAS

P u tting It All Together # Create empty data frame to hold all loaded sheets all_responses = pd.DataFrame() # Iterate through data frames in dictionary for sheet_name, frame in survey_responses.items(): # Add a column so we know which year data is from frame["Year"] = sheet_name # Add each data frame to all_responses all_responses = all_responses.append(frame) # View years in data print(all_responses.Year.unique()) ['2016' '2017'] STREAMLINED DATA INGESTION WITH PANDAS

Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

Modif y ing imports : tr u e / false data STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

Boolean Data True / False data STREAMLINED DATA INGESTION WITH PANDAS

Boolean Data STREAMLINED DATA INGESTION WITH PANDAS

pandas and Booleans bootcamp_data = pd.read_excel("fcc_survey_booleans.xlsx") print(bootcamp_data.dtypes) ID.x object AttendedBootcamp float64 AttendedBootCampYesNo object AttendedBootcampTF float64 BootcampLoan float64 LoanYesNo object LoanTF float64 dtype: object STREAMLINED DATA INGESTION WITH PANDAS

pandas and Booleans # Count True values # Count NAs print(bootcamp_data.sum()) print(bootcamp_data.isna().sum()) AttendedBootcamp 38 ID.x 0 AttendedBootcampTF 38 AttendedBootcamp 0 BootcampLoan 14 AttendedBootCampYesNo 0 LoanTF 14 AttendedBootcampTF 0 dtype: object BootcampLoan 964 LoanYesNo 964 LoanTF 964 dtype: int64 STREAMLINED DATA INGESTION WITH PANDAS

# Load data, casting True/False columns as Boolean bool_data = pd.read_excel("fcc_survey_booleans.xlsx", dtype={"AttendedBootcamp": bool, "AttendedBootCampYesNo": bool, "AttendedBootcampTF":bool, "BootcampLoan": bool, "LoanYesNo": bool, "LoanTF": bool}) print(bool_data.dtypes) ID.x object AttendedBootcamp bool AttendedBootCampYesNo bool AttendedBootcampTF bool BootcampLoan bool LoanYesNo bool LoanTF bool dtype: object STREAMLINED DATA INGESTION WITH PANDAS

# Count True values # Count NA values print(bool_data.sum()) print(bool_data.isna().sum()) AttendedBootcamp 38 ID.x 0 AttendedBootCampYesNo 1000 AttendedBootcamp 0 AttendedBootcampTF 38 AttendedBootCampYesNo 0 BootcampLoan 978 AttendedBootcampTF 0 LoanYesNo 1000 BootcampLoan 0 LoanTF 978 LoanYesNo 0 dtype: object LoanTF 0 dtype: int64 STREAMLINED DATA INGESTION WITH PANDAS

pandas and Booleans pandas loads True / False col u mns as � oat data b y defa u lt Specif y a col u mn sho u ld be bool w ith read_excel() ' s dtype arg u ment Boolean col u mns can onl y ha v e True and False v al u es NA / missing v al u es in Boolean col u mns are changed to True pandas a u tomaticall y recogni z es some v al u es as True / False in Boolean col u mns Unrecogni z ed v al u es in a Boolean col u mn are also changed to True STREAMLINED DATA INGESTION WITH PANDAS

Setting C u stom Tr u e / False Val u es Use read_excel() ' s true_values arg u ment to set c u stom True v al u es Use false_values to set c u stom False v al u es Each takes a list of v al u es to treat as True / False , respecti v el y C u stom True / False v al u es are onl y applied to col u mns set as Boolean STREAMLINED DATA INGESTION WITH PANDAS

Setting C u stom Tr u e / False Val u es # Load data with Boolean dtypes and custom T/F values bool_data = pd.read_excel("fcc_survey_booleans.xlsx", dtype={"AttendedBootcamp": bool, "AttendedBootCampYesNo": bool, "AttendedBootcampTF":bool, "BootcampLoan": bool, "LoanYesNo": bool, "LoanTF": bool}, true_values=["Yes"], false_values=["No"]) STREAMLINED DATA INGESTION WITH PANDAS

Setting C u stom Tr u e / False Val u es print(bool_data.sum()) AttendedBootcamp 38 AttendedBootCampYesNo 38 AttendedBootcampTF 38 BootcampLoan 978 LoanYesNo 978 LoanTF 978 dtype: object STREAMLINED DATA INGESTION WITH PANDAS

Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION - PowerPoint PPT Presentation

Introd u ction to spreadsheets STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor Spreadsheets Also kno w n as E x cel les Data stored in tab u lar form , w ith cells arranged in ro w s and col u mns Unlike

Data O Organization in S Spreadsheets Learning Objectives Good data entry practices -

INTROD TRODUCT CTION TO TO PRI RIOR ORITY TY-BASED ED B BUDGET ET BUDGETI TING F FOR

Introd u ction to a u dio data in P y thon SP OK E N L AN G U AG E P R OC E SSIN G IN P YTH ON

Introd u ction to P y D u b SP OK E N L AN G U AG E P R OC E SSIN G IN P YTH ON Daniel Bo u

Introd u ction IN TE R ME D IATE IN TE R AC TIVE DATA VISU AL IZATION W ITH P L OTLY IN R

Introd u ction VISU AL IZIN G G E OSPATIAL DATA IN P YTH ON Mar y v an Valkenb u rg Data

Introd u ction to signals FIN AN C IAL TR AD IN G IN R Il y a Kipnis Professional Q u antitati

Introd u ction to E x plorator y Data Anal y sis STATISTIC AL TH IN K IN G IN P YTH ON ( PAR T 1

Introd u ction to iterators P YTH ON DATA SC IE N C E TOOL BOX ( PAR T 2 ) H u go Bo w ne -

Introd u ction to EFA FAC TOR AN ALYSIS IN R Jennifer Br u sso w Ps y chometrician Ps y cho +

Introd u ction to the NASA fireball data set BU IL D IN G DASH BOAR D S W ITH SH IN YDASH BOAR

Introd u ction to Tid y Data W OR K IN G W ITH DATA IN TH E TIDYVE R SE Alison Hill

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Introd u ction to relational databases IN TR OD U C TION TO IMP OR TIN G DATA IN P YTH ON H u

Introd u ction to spaC y AD VAN C E D N L P W ITH SPAC Y Ines Montani spaC y core de v eloper

Introd u ction to statistical seismolog y C ASE STU D IE S IN STATISTIC AL TH IN K IN G J u

ORTHOGONALIZATION WITH A NON-STANDARD INNER PRODUCT WITH THE APPLICATION TO PRECONDITIONING

DUNE Beyond Standard Model Physics Group Meeting Nonstandard Interactions Subgroup Recent

Continuous Improvement Toolkit Standard Work Continuous Improvement Toolkit . www.citoolkit.com

Non-Standard Interactions with light mediators Yasaman Farzan IPM, Tehran Effects of NSI on

Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with:

Scheduling on clusters and grids Gr egory Mouni e, Yves Robert et Denis Trystram ID-IMAG 6

Malaysian Healthy Ageing Society Psychological well being Psychological well being is a

Towards Enabling Internet-Scale Context-as-a-Service: A Position Paper Alexandru SORICI, Andrei