Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P - - PowerPoint PPT Presentation

intro to data cleaning with apache spark
SMART_READER_LITE
LIVE PREVIEW

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P - - PowerPoint PPT Presentation

Intro to data cleaning with Apache Spark CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant What is Data Cleaning? Data Cleaning : Preparing raw data for use in data processing pipelines. Possible tasks in data


slide-1
SLIDE 1

Intro to data cleaning with Apache Spark

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-2
SLIDE 2

CLEANING DATA WITH PYSPARK

What is Data Cleaning?

Data Cleaning: Preparing raw data for use in data processing pipelines. Possible tasks in data cleaning: Reformatting or replacing text Performing calculations Removing garbage or incomplete data

slide-3
SLIDE 3

CLEANING DATA WITH PYSPARK

Why perform data cleaning with Spark?

Problems with typical data systems: Performance Organizing data ow Advantages of Spark: Scalable Powerful framework for data handling

slide-4
SLIDE 4

CLEANING DATA WITH PYSPARK

Data cleaning example

Raw data: name age (years) city Smith, John 37 Dallas Wilson, A. 59 Chicago null 215 Cleaned data: last name rst name age (months) state Smith John 444 TX Wilson A. 708 IL

slide-5
SLIDE 5

CLEANING DATA WITH PYSPARK

Spark Schemas

Dene the format of a DataFrame May contain various data types: Strings, dates, integers, arrays Can lter garbage data during import Improves read performance

slide-6
SLIDE 6

CLEANING DATA WITH PYSPARK

Example Spark Schema

Import schema

import pyspark.sql.types peopleSchema = StructType([ # Define the name field StructField('name', StringType(), True), # Add the age field StructField('age', IntegerType(), True), # Add the city field StructField('city', StringType(), True) ])

Read CSV le containing data

people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)

slide-7
SLIDE 7

Let's practice!

CLEAN IN G DATA W ITH P YS PARK

slide-8
SLIDE 8

Immutability and Lazy Processing

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-9
SLIDE 9

CLEANING DATA WITH PYSPARK

Variable review

Python variables: Mutable Flexibility Potential for issues with concurrency Likely adds complexity

slide-10
SLIDE 10

CLEANING DATA WITH PYSPARK

Immutability

Immutable variables are: A component of functional programming Dened once Unable to be directly modied Re-created if reassigned Able to be shared efciently

slide-11
SLIDE 11

CLEANING DATA WITH PYSPARK

Immutability Example

Dene a new data frame:

voter_df = spark.read.csv('voterdata.csv')

Making changes:

voter_df = voter_df.withColumn('fullyear', voter_df.year + 2000) voter_df = voter_df.drop(voter_df.year)

slide-12
SLIDE 12

CLEANING DATA WITH PYSPARK

Lazy Processing

Isn't this slow? Transformations Actions Allows efcient planning

voter_df = voter_df.withColumn('fullyear', voter_df.year + 2000) voter_df = voter_df.drop(voter_df.year) voter_df.count()

slide-13
SLIDE 13

Let's practice!

CLEAN IN G DATA W ITH P YS PARK

slide-14
SLIDE 14

Understanding Parquet

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-15
SLIDE 15

CLEANING DATA WITH PYSPARK

Difculties with CSV les

No dened schema Nested data requires special handling Encoding format limited

slide-16
SLIDE 16

CLEANING DATA WITH PYSPARK

Spark and CSV les

Slow to parse Files cannot be ltered (no "predicate pushdown") Any intermediate use requires redening schema

slide-17
SLIDE 17

CLEANING DATA WITH PYSPARK

The Parquet Format

A columnar data format Supported in Spark and other data processing frameworks Supports predicate pushdown Automatically stores schema information

slide-18
SLIDE 18

CLEANING DATA WITH PYSPARK

Working with Parquet

Reading Parquet les

df = spark.read.format('parquet').load('filename.parquet') df = spark.read.parquet('filename.parquet')

Writing Parquet les

df.write.format('parquet').save('filename.parquet') df.write.parquet('filename.parquet')

slide-19
SLIDE 19

CLEANING DATA WITH PYSPARK

Parquet and SQL

Parquet as backing stores for SparkSQL operations

flight_df = spark.read.parquet('flights.parquet') flight_df.createOrReplaceTempView('flights') short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')

slide-20
SLIDE 20

Let's Practice!

CLEAN IN G DATA W ITH P YS PARK