DataFrame column operations CLEAN IN G DATA W ITH P YS PARK Mike - - PowerPoint PPT Presentation

dataframe column operations
SMART_READER_LITE
LIVE PREVIEW

DataFrame column operations CLEAN IN G DATA W ITH P YS PARK Mike - - PowerPoint PPT Presentation

DataFrame column operations CLEAN IN G DATA W ITH P YS PARK Mike Metzger Data Engineering Consultant DataFrame refresher DataFrames: Made up of rows & columns Immutable Use various transformation operations to modify data # Return


slide-1
SLIDE 1

DataFrame column

  • perations

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-2
SLIDE 2

CLEANING DATA WITH PYSPARK

DataFrame refresher

DataFrames: Made up of rows & columns Immutable Use various transformation operations to modify data

# Return rows where name starts with "M" voter_df.filter(voter_df.name.like('M%')) # Return name and position only voters = voter_df.select('name', 'position')

slide-3
SLIDE 3

CLEANING DATA WITH PYSPARK

Common DataFrame transformations

Filter / Where voter_df.filter(voter_df.date > '1/1/2019') # or voter_df.where(...) Select voter_df.select(voter_df.name) withColumn voter_df.withColumn('year', voter_df.date.year) drop voter_df.drop('unused_column')

slide-4
SLIDE 4

CLEANING DATA WITH PYSPARK

Filtering data

Remove nulls Remove odd entries Split data from combined sources Negate with ~ voter_df.filter(voter_df['name'].isNotNull()) voter_df.filter(voter_df.date.year > 1800) voter_df.where(voter_df['_c0'].contains('VOTE')) voter_df.where(~ voter_df._c1.isNull())

slide-5
SLIDE 5

CLEANING DATA WITH PYSPARK

Column string transformations

Contained in pyspark.sql.functions

import pyspark.sql.functions as F

Applied per column as transformation

voter_df.withColumn('upper', F.upper('name'))

Can create intermediary columns

voter_df.withColumn('splits', F.split('name', ' '))

Can cast to other types

voter_df.withColumn('year', voter_df['_c4'].cast(IntegerType()))

slide-6
SLIDE 6

CLEANING DATA WITH PYSPARK

ArrayType() column functions

Various utility functions / transformations to interact with ArrayType()

.size(<column>) - returns length of arrayType() column .getItem(<index>) - used to retrieve a specic item at index of list column.

slide-7
SLIDE 7

Let's practice!

CLEAN IN G DATA W ITH P YS PARK

slide-8
SLIDE 8

Conditional DataFrame column

  • perations

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-9
SLIDE 9

CLEANING DATA WITH PYSPARK

Conditional clauses

Conditional Clauses are: Inline version of if / then / else

.when() .otherwise()

slide-10
SLIDE 10

CLEANING DATA WITH PYSPARK

Conditional example

.when(<if condition>, <then x>) df.select(df.Name, df.Age, F.when(df.Age >= 18, "Adult"))

Name Age Alice 14 Bob 18 Adult Candice 38 Adult

slide-11
SLIDE 11

CLEANING DATA WITH PYSPARK

Another example

Multiple .when()

df.select(df.Name, df.Age, .when(df.Age >= 18, "Adult") .when(df.Age < 18, "Minor"))

Name Age Alice 14 Minor Bob 18 Adult Candice 38 Adult

slide-12
SLIDE 12

CLEANING DATA WITH PYSPARK

Otherwise

.otherwise() is like else df.select(df.Name, df.Age, .when(df.Age >= 18, "Adult") .otherwise("Minor"))

Name Age Alice 14 Minor Bob 18 Adult Candice 38 Adult

slide-13
SLIDE 13

Let's practice!

CLEAN IN G DATA W ITH P YS PARK

slide-14
SLIDE 14

User dened functions

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-15
SLIDE 15

CLEANING DATA WITH PYSPARK

Dened...

User dened functions or UDFs Python method Wrapped via the pyspark.sql.functions.udf method Stored as a variable Called like a normal Spark function

slide-16
SLIDE 16

CLEANING DATA WITH PYSPARK

Reverse string UDF

Dene a Python method

def reverseString(mystr): return mystr[::-1]

Wrap the function and store as a variable

udfReverseString = udf(reverseString, StringType())

Use with Spark

user_df = user_df.withColumn('ReverseName', udfReverseString(user_df.Name))

slide-17
SLIDE 17

CLEANING DATA WITH PYSPARK

Argument-less example

def sortingCap(): return random.choice(['G', 'H', 'R', 'S']) udfSortingCap = udf(sortingCap, StringType()) user_df = user_df.withColumn('Class', udfSortingCap())

Name Age Class Alice 14 H Bob 18 S Candice 63 G

slide-18
SLIDE 18

Let's practice!

CLEAN IN G DATA W ITH P YS PARK

slide-19
SLIDE 19

Partitioning and lazy processing

CLEAN IN G DATA W ITH P YS PARK

Mike Metzger

Data Engineering Consultant

slide-20
SLIDE 20

CLEANING DATA WITH PYSPARK

Partitioning

DataFrames are broken up into partitions Partition size can vary Each partition is handled independently

slide-21
SLIDE 21

CLEANING DATA WITH PYSPARK

Lazy processing

Transformations are lazy .withColumn(...) .select(...) Nothing is actually done until an action is performed .count() .write(...) Transformations can be re-ordered for best performance Sometimes causes unexpected behavior

slide-22
SLIDE 22

CLEANING DATA WITH PYSPARK

Adding IDs

Normal ID elds: Common in relational databases Most usually an integer increasing, sequential and unique Not very parallel id last name rst name state Smith John TX 1 Wilson A. IL 2 Adams Wendy OR

slide-23
SLIDE 23

CLEANING DATA WITH PYSPARK

Monotonically increasing IDs

pyspark.sql.functions.monotonically_increasing_id() Integer (64-bit), increases in value, unique Not necessarily sequential (gaps exist) Completely parallel id last name rst name state Smith John TX 134520871 Wilson A. IL 675824594 Adams Wendy OR

slide-24
SLIDE 24

CLEANING DATA WITH PYSPARK

Notes

Remember, Spark is lazy! Occasionally out of order If performing a join, ID may be assigned after the join T est your transformations

slide-25
SLIDE 25

Let's practice!

CLEAN IN G DATA W ITH P YS PARK