Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P - PowerPoint PPT Presentation

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

What is Data Cleaning? Data Cleaning : Preparing raw data for use in data processing pipelines. Possible tasks in data cleaning: Reformatting or replacing text Performing calculations Removing garbage or incomplete data CLEANING DATA WITH PYSPARK

Why perform data cleaning with Spark? Problems with typical data systems: Performance Organizing data �ow Advantages of Spark: Scalable Powerful framework for data handling CLEANING DATA WITH PYSPARK

Data cleaning example Raw data: Cleaned data: name age (years) city last name �rst name age (months) state Smith, John 37 Dallas Smith John 444 TX Wilson, A. 59 Chicago Wilson A. 708 IL null 215 CLEANING DATA WITH PYSPARK

Spark Schemas De�ne the format of a DataFrame May contain various data types: Strings, dates, integers, arrays Can �lter garbage data during import Improves read performance CLEANING DATA WITH PYSPARK

Example Spark Schema Import schema import pyspark.sql.types peopleSchema = StructType([ # Define the name field StructField('name', StringType(), True), # Add the age field StructField('age', IntegerType(), True), # Add the city field StructField('city', StringType(), True) ]) Read CSV �le containing data people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema) CLEANING DATA WITH PYSPARK

Let's practice! CLEAN IN G DATA W ITH P YS PARK

Immutability and Lazy Processing CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

Variable review Python variables: Mutable Flexibility Potential for issues with concurrency Likely adds complexity CLEANING DATA WITH PYSPARK

Immutability Immutable variables are: A component of functional programming De�ned once Unable to be directly modi�ed Re-created if reassigned Able to be shared ef�ciently CLEANING DATA WITH PYSPARK

Immutability Example De�ne a new data frame: voter_df = spark.read.csv('voterdata.csv') Making changes: voter_df = voter_df.withColumn('fullyear', voter_df.year + 2000) voter_df = voter_df.drop(voter_df.year) CLEANING DATA WITH PYSPARK

Lazy Processing Isn't this slow? Transformations Actions Allows ef�cient planning voter_df = voter_df.withColumn('fullyear', voter_df.year + 2000) voter_df = voter_df.drop(voter_df.year) voter_df.count() CLEANING DATA WITH PYSPARK

Let's practice! CLEAN IN G DATA W ITH P YS PARK

Understanding Parquet CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant

Dif�culties with CSV �les No de�ned schema Nested data requires special handling Encoding format limited CLEANING DATA WITH PYSPARK

Spark and CSV �les Slow to parse Files cannot be �ltered (no "predicate pushdown") Any intermediate use requires rede�ning schema CLEANING DATA WITH PYSPARK

The Parquet Format A columnar data format Supported in Spark and other data processing frameworks Supports predicate pushdown Automatically stores schema information CLEANING DATA WITH PYSPARK

Working with Parquet Reading Parquet �les df = spark.read.format('parquet').load('filename.parquet') df = spark.read.parquet('filename.parquet') Writing Parquet �les df.write.format('parquet').save('filename.parquet') df.write.parquet('filename.parquet') CLEANING DATA WITH PYSPARK

Parquet and SQL Parquet as backing stores for SparkSQL operations flight_df = spark.read.parquet('flights.parquet') flight_df.createOrReplaceTempView('flights') short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100') CLEANING DATA WITH PYSPARK

Let's Practice! CLEAN IN G DATA W ITH P YS PARK

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P - PowerPoint PPT Presentation

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant What is Data Cleaning? Data Cleaning : Preparing raw data for use in data processing pipelines. Possible tasks in data

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Dynamic generation of parallel computations James Hanlon, Simon J. Hollis Many-core project June

High-speed parallel software implementation of the T pairing Diego F. Aranha Institute of

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P - PowerPoint PPT Presentation

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant What is Data Cleaning? Data Cleaning : Preparing raw data for use in data processing pipelines. Possible tasks in data

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Workflow basics, RMarkdown, git/Github Cleaning up Cleaning up Cleaning up Cleaning up

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Floor Cleaning By Vacuum After vacuum Cleaning After vacuum Cleaning After vacuum Cleaning

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang, QCRI Big Data Cleaning Nan Tang,

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

24 Databases Intro to Database Systems Andy Pavlo AP AP 15-445/15-645 Computer Science

lecture 7 Integer multiplication (grade school) How to do (unsigned) integer multiplication in

Dynamic generation of parallel computations James Hanlon, Simon J. Hollis Many-core project June

High-speed parallel software implementation of the T pairing Diego F. Aranha Institute of

Leveraging Redshift Spectrum for Fun and Profit About This Talk As a software engineer at a

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

PostgreSQL as a Big Data Platform Chris Travers May 10, 2019 Introduction Our Environments

Big Data for Data Science SQL on Big Data event.cwi.nl/lsde THE DEBATE: DATABASE SYSTEMS VS

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark