introd u ction to databases
play

Introd u ction to databases STR E AML IN E D DATA IN G E STION W - PowerPoint PPT Presentation

Introd u ction to databases STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor Relational Databases Data abo u t entities is organi z ed into tables Each ro w or record is an instance of an entit y Each col u mn has


  1. Introd u ction to databases STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  2. Relational Databases Data abo u t entities is organi z ed into tables Each ro w or record is an instance of an entit y Each col u mn has information abo u t an a � rib u te Tables can be linked to each other v ia u niq u e ke y s S u pport more data , m u ltiple sim u ltaneo u s u sers , and data q u alit y controls Data t y pes are speci � ed for each col u mn SQL ( Str u ct u red Q u er y Lang u age ) to interact w ith databases STREAMLINED DATA INGESTION WITH PANDAS

  3. Common Relational Databases SQLite databases are comp u ter � les STREAMLINED DATA INGESTION WITH PANDAS

  4. Connecting to Databases T w o - step process : 1. Create w a y to connect to database 2. Q u er y database STREAMLINED DATA INGESTION WITH PANDAS

  5. Creating a Database Engine sqlalchemy ' s create_engine() makes an engine to handle database connections Needs string URL of database to connect to SQLite URL format : sqlite:///filename.db STREAMLINED DATA INGESTION WITH PANDAS

  6. Q u er y ing Databases pd.read_sql(query, engine) to load in data from a database Arg u ments query : String containing SQL q u er y to r u n or table to load engine : Connection / database engine object STREAMLINED DATA INGESTION WITH PANDAS

  7. SQL Re v ie w: SELECT Used to q u er y data from a database Basic s y nta x: SELECT [column_names] FROM [table_name]; To get all data in a table : SELECT * FROM [table_name]; Code st y le : ke yw ords in ALL CAPS , semicolon (;) to end a statement STREAMLINED DATA INGESTION WITH PANDAS

  8. Getting Data from a Database # Load pandas and sqlalchemy's create_engine import pandas as pd from sqlalchemy import create_engine # Create database engine to manage connections engine = create_engine("sqlite:///data.db") # Load entire weather table by table name weather = pd.read_sql("weather", engine) STREAMLINED DATA INGESTION WITH PANDAS

  9. # Create database engine to manage connections engine = create_engine("sqlite:///data.db") # Load entire weather table with SQL weather = pd.read_sql("SELECT * FROM weather", engine) print(weather.head()) station name latitude ... prcp snow tavg tmax tmin 0 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 52 42 1 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 48 39 2 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 48 42 3 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.00 0.0 51 40 4 USW00094728 NY CITY CENTRAL PARK, NY US 40.77898 ... 0.75 0.0 61 50 [5 rows x 13 columns] STREAMLINED DATA INGESTION WITH PANDAS

  10. Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

  11. Refining imports w ith SQL q u eries STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  12. SELECTing Col u mns SELECT [column names] FROM [table name]; E x ample : SELECT date, tavg FROM weather; STREAMLINED DATA INGESTION WITH PANDAS

  13. WHERE Cla u ses Use a WHERE cla u se to selecti v el y import records SELECT [column_names] FROM [table_name] WHERE [condition]; STREAMLINED DATA INGESTION WITH PANDAS

  14. Filtering b y N u mbers Compare n u mbers w ith mathematical operators = > and >= < and <= <> ( not eq u al to ) E x ample : SELECT * FROM weather WHERE tmax > 32; STREAMLINED DATA INGESTION WITH PANDAS

  15. Filtering Te x t Match e x act strings w ith the = sign and the te x t to match String matching is case - sensiti v e E x ample : /* Get records about incidents in Brooklyn */ SELECT * FROM hpd311calls WHERE borough = 'BROOKLYN'; STREAMLINED DATA INGESTION WITH PANDAS

  16. SQL and pandas # Load libraries import pandas as pd from sqlalchemy import create_engine # Create database engine engine = create_engine("sqlite:///data.db") # Write query to get records from Brooklyn query = """SELECT * FROM hpd311calls WHERE borough = 'BROOKLYN';""" # Query the database brooklyn_calls = pd.read_sql(query, engine) print(brookyn_calls.borough.unique()) ['BROOKLYN'] STREAMLINED DATA INGESTION WITH PANDAS

  17. Combining Conditions : AND WHERE cla u ses w ith AND ret u rn records that meet all conditions # Write query to get records about plumbing in the Bronx and_query = """SELECT * FROM hpd311calls WHERE borough = 'BRONX' AND complaint_type = 'PLUMBING';""" # Get calls about plumbing issues in the Bronx bx_plumbing_calls = pd.read_sql(and_query, engine) # Check record count print(bx_plumbing_calls.shape) (2016, 8) STREAMLINED DATA INGESTION WITH PANDAS

  18. Combining Conditions : OR WHERE cla u ses w ith OR ret u rn records that meet at least one condition # Write query to get records about water leaks or plumbing or_query = """SELECT * FROM hpd311calls WHERE complaint_type = 'WATER LEAK' OR complaint_type = 'PLUMBING';""" # Get calls that are about plumbing or water leaks leaks_or_plumbing = pd.read_sql(or_query, engine) # Check record count print(leaks_or_plumbing.shape) (10684, 8) STREAMLINED DATA INGESTION WITH PANDAS

  19. Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

  20. More comple x SQL q u eries STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  21. Getting DISTINCT Val u es Get u niq u e v al u es for one or more col u mns w ith SELECT DISTINCT S y nta x: SELECT DISTINCT [column names] FROM [table]; Remo v e d u plicate records : SELECT DISTINCT * FROM [table]; /* Get unique street addresses and boroughs */ SELECT DISTINCT incident_address, borough FROM hpd311calls; STREAMLINED DATA INGESTION WITH PANDAS

  22. Aggregate F u nctions Q u er y a database directl y for descripti v e statistics Aggregate f u nctions SUM AVG MAX MIN COUNT STREAMLINED DATA INGESTION WITH PANDAS

  23. Aggregate F u nctions SUM , AVG , MAX , MIN Each takes a single col u mn name SELECT AVG(tmax) FROM weather; COUNT Get n u mber of ro w s that meet q u er y conditions SELECT COUNT(*) FROM [table_name]; Get n u mber of u niq u e v al u es in a col u mn SELECT COUNT(DISTINCT [column_names]) FROM [table_name]; STREAMLINED DATA INGESTION WITH PANDAS

  24. GROUP BY Aggregate f u nctions calc u late a single s u mmar y statistic b y defa u lt S u mmari z e data b y categories w ith GROUP BY statements Remember to also select the col u mn y o u' re gro u ping b y! /* Get counts of plumbing calls by borough */ SELECT borough, COUNT(*) FROM hpd311calls WHERE complaint_type = 'PLUMBING' GROUP BY borough; STREAMLINED DATA INGESTION WITH PANDAS

  25. Co u nting b y Gro u ps # Create database engine engine = create_engine("sqlite:///data.db") # Write query to get plumbing call counts by borough query = """SELECT borough, COUNT(*) FROM hpd311calls WHERE complaint_type = 'PLUMBING' GROUP BY borough;""" # Query databse and create data frame plumbing_call_counts = pd.read_sql(query, engine) STREAMLINED DATA INGESTION WITH PANDAS

  26. Co u nting b y Gro u ps print(plumbing_call_counts) borough COUNT(*) 0 BRONX 2016 1 BROOKLYN 2702 2 MANHATTAN 1413 3 QUEENS 808 4 STATEN ISLAND 178 STREAMLINED DATA INGESTION WITH PANDAS

  27. Let ' s practice ! STR E AML IN E D DATA IN G E STION W ITH PAN DAS

  28. Loading m u ltiple tables w ith joins STR E AML IN E D DATA IN G E STION W ITH PAN DAS Aman y Mahfo uz Instr u ctor

  29. Ke y s Database records ha v e u niq u e identi � ers , or ke y s STREAMLINED DATA INGESTION WITH PANDAS

  30. Ke y s Database records ha v e u niq u e identi � ers , or ke y s STREAMLINED DATA INGESTION WITH PANDAS

  31. Ke y s Database records ha v e u niq u e identi � ers , or ke y s STREAMLINED DATA INGESTION WITH PANDAS

  32. Ke y s STREAMLINED DATA INGESTION WITH PANDAS

  33. Ke y s STREAMLINED DATA INGESTION WITH PANDAS

  34. Joining Tables STREAMLINED DATA INGESTION WITH PANDAS

  35. Joining Tables SELECT * FROM hpd311calls STREAMLINED DATA INGESTION WITH PANDAS

  36. Joining Tables SELECT * FROM hpd311calls JOIN weather ON hpd311calls.created_date = weather.date; Use dot notation ( table.column ) w hen w orking w ith m u ltiple tables Defa u lt join onl y ret u rns records w hose ke y v al u es appear in both tables Make s u re join ke y s are the same data t y pe or nothing w ill match STREAMLINED DATA INGESTION WITH PANDAS

  37. Joining and Filtering /* Get only heat/hot water calls and join in weather data */ SELECT * FROM hpd311calls JOIN weather ON hpd311calls.created_date = weather.date WHERE hpd311calls.complaint_type = 'HEAT/HOT WATER'; STREAMLINED DATA INGESTION WITH PANDAS

  38. Joining and Aggregating /* Get call counts by borough */ SELECT hpd311calls.borough, COUNT(*) FROM hpd311calls GROUP BY hpd311calls.borough; STREAMLINED DATA INGESTION WITH PANDAS

  39. Joining and Aggregating /* Get call counts by borough and join in population and housing counts */ SELECT hpd311calls.borough, COUNT(*), boro_census.total_population, boro_census.housing_units FROM hpd311calls GROUP BY hpd311calls.borough STREAMLINED DATA INGESTION WITH PANDAS

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend