Extract Transform Select IN TRODUCTION TO S PARK S QL IN P YTH - - PowerPoint PPT Presentation

extract transform select
SMART_READER_LITE
LIVE PREVIEW

Extract Transform Select IN TRODUCTION TO S PARK S QL IN P YTH - - PowerPoint PPT Presentation

Extract Transform Select IN TRODUCTION TO S PARK S QL IN P YTH ON Mark Plutowski Data Scientist INTRODUCTION TO SPARK SQL IN PYTHON INTRODUCTION TO SPARK SQL IN PYTHON Extract, Transform, and Select Extraction Transformation Selection


slide-1
SLIDE 1

Extract Transform Select

IN TRODUCTION TO S PARK S QL IN P YTH ON

Mark Plutowski

Data Scientist

slide-2
SLIDE 2

INTRODUCTION TO SPARK SQL IN PYTHON

slide-3
SLIDE 3

INTRODUCTION TO SPARK SQL IN PYTHON

slide-4
SLIDE 4

INTRODUCTION TO SPARK SQL IN PYTHON

Extract, Transform, and Select

Extraction Transformation Selection

slide-5
SLIDE 5

INTRODUCTION TO SPARK SQL IN PYTHON

Built-in functions

from pyspark.sql.functions import split, explode

slide-6
SLIDE 6

INTRODUCTION TO SPARK SQL IN PYTHON

The length function

from pyspark.sql.functions import length df.where(length('sentence') == 0)

slide-7
SLIDE 7

INTRODUCTION TO SPARK SQL IN PYTHON

Creating a custom function

User Dened Function UDF

slide-8
SLIDE 8

INTRODUCTION TO SPARK SQL IN PYTHON

Importing the udf function

from pyspark.sql.functions import udf

slide-9
SLIDE 9

INTRODUCTION TO SPARK SQL IN PYTHON

Creating a boolean UDF

print(df) DataFrame[textdata: string] from pyspark.sql.functions import udf from pyspark.sql.types import BooleanType

slide-10
SLIDE 10

INTRODUCTION TO SPARK SQL IN PYTHON

Creating a boolean UDF

short_udf = udf(lambda x: True if not x or len(x) < 10 else False, BooleanType()) df.select(short_udf('textdata')\ .alias("is short"))\ .show(3) +--------+ |is short| +--------+ | false| | true| | false| +--------+

slide-11
SLIDE 11

INTRODUCTION TO SPARK SQL IN PYTHON

Important UDF return types

from pyspark.sql.types import StringType, IntegerType, FloatType, ArrayType

slide-12
SLIDE 12

INTRODUCTION TO SPARK SQL IN PYTHON

Creating an array UDF

df3.select('word array', in_udf('word array').alias('without endword'))\ .show(5, truncate=30) +-----------------------------+----------------------+ | word array| without endword| +-----------------------------+----------------------+ |[then, how, many, are, there]|[then, how, many, are]| | [how, many]| [how]| | [i, donot, know]| [i, donot]| | [quite, so]| [quite]| | [you, have, not, observed]| [you, have, not]| +-----------------------------+----------------------+

slide-13
SLIDE 13

INTRODUCTION TO SPARK SQL IN PYTHON

Creating an array UDF

from pyspark.sql.types import StringType, ArrayType # Removes last item in array in_udf = udf(lambda x: x[0:len(x)-1] if x and len(x) > 1 else [], ArrayType(StringType()))

slide-14
SLIDE 14

INTRODUCTION TO SPARK SQL IN PYTHON

Sparse vector format

  • 1. Indices
  • 2. Values

Example: Array: [1.0, 0.0, 0.0, 3.0] Sparse vector: (4, [0, 3], [1.0, 3.0])

slide-15
SLIDE 15

INTRODUCTION TO SPARK SQL IN PYTHON

Working with vector data

hasattr(x, "toArray") x.numNonzeros())

slide-16
SLIDE 16

Let's practice!

IN TRODUCTION TO S PARK S QL IN P YTH ON

slide-17
SLIDE 17

Creating feature data for classication

IN TRODUCTION TO S PARK S QL IN P YTH ON

Mark Plutowski

Data Scientist

slide-18
SLIDE 18

INTRODUCTION TO SPARK SQL IN PYTHON

Transforming a dense array

from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType bad_udf = udf(lambda x: x.indices[0] if (x and hasattr(x, "toArray") and x.numNonzeros()) else 0, IntegerType())

slide-19
SLIDE 19

INTRODUCTION TO SPARK SQL IN PYTHON

Transforming a dense array

try: df.select(bad_udf('outvec').alias('label')).first() except Exception as e: print(e.__class__) print(e.errmsg) <class 'py4j.protocol.Py4JJavaError'> An error occurred while calling o90.collectToPython.

slide-20
SLIDE 20

INTRODUCTION TO SPARK SQL IN PYTHON

UDF return type must be properly cast

first_udf = udf(lambda x: int(x.indices[0]) if (x and hasattr(x, "toArray") and x.numNonzeros()) else 0, IntegerType())

slide-21
SLIDE 21

INTRODUCTION TO SPARK SQL IN PYTHON

The UDF in action

+-------+--------------------+-----+--------------------+-------------------+ |endword| doc|count| features| outvec| +-------+--------------------+-----+--------------------+-------------------+ | it|[please, do, not,...| 1149|(12847,[15,47,502...| (12847,[7],[1.0])| | holmes|[start, of, the, ...| 107|(12847,[0,3,183,1...|(12847,[145],[1.0])| | i|[the, adventures,...| 103|(12847,[0,3,35,14...| (12847,[11],[1.0])| +-------+--------------------+-----+--------------------+-------------------+ df.withColumn('label', k_udf('outvec')).drop('outvec').show(3) +-------+--------------------+-----+--------------------+-----+ |endword| doc|count| features|label| +-------+--------------------+-----+--------------------+-----+ | it|[please, do, not,...| 1149|(12847,[15,47,502...| 7| | holmes|[start, of, the, ...| 107|(12847,[0,3,183,1...| 145| | i|[the, adventures,...| 103|(12847,[0,3,35,14...| 11| +-------+--------------------+-----+--------------------+-----+

slide-22
SLIDE 22

INTRODUCTION TO SPARK SQL IN PYTHON

CountVectorizer

ETS : Extract Transform Select CountVectorizer is a Feature Extractor Its input is an array of strings Its output is a vector

slide-23
SLIDE 23

INTRODUCTION TO SPARK SQL IN PYTHON

Fitting the CountVectorizer

from pyspark.ml.feature import CountVectorizer cv = CountVectorizer(inputCol='words',

  • utputCol="features")

model = cv.fit(df) result = model.transform(df) print(result) DataFrame[words: array<string>, features: vector] # Dense string array on left, dense integer vector on right +-------------------------+--------------------------------------+ |words |features | +-------------------------+--------------------------------------+ |[Hello, world] |(10,[7,9],[1.0,1.0]) | |[How, are, you?] |(10,[1,3,4],[1.0,1.0,1.0]) | |[I, am, fine, thank, you]|(10,[0,2,5,6,8],[1.0,1.0,1.0,1.0,1.0])| +-------------------------+--------------------------------------+

slide-24
SLIDE 24

Let's practice!

IN TRODUCTION TO S PARK S QL IN P YTH ON

slide-25
SLIDE 25

Text Classication

IN TRODUCTION TO S PARK S QL IN P YTH ON

Mark Plutowski

Data Scientist

slide-26
SLIDE 26

INTRODUCTION TO SPARK SQL IN PYTHON

slide-27
SLIDE 27

INTRODUCTION TO SPARK SQL IN PYTHON

slide-28
SLIDE 28

INTRODUCTION TO SPARK SQL IN PYTHON

slide-29
SLIDE 29

INTRODUCTION TO SPARK SQL IN PYTHON

slide-30
SLIDE 30

INTRODUCTION TO SPARK SQL IN PYTHON

slide-31
SLIDE 31

INTRODUCTION TO SPARK SQL IN PYTHON

slide-32
SLIDE 32

INTRODUCTION TO SPARK SQL IN PYTHON

slide-33
SLIDE 33

INTRODUCTION TO SPARK SQL IN PYTHON

slide-34
SLIDE 34

INTRODUCTION TO SPARK SQL IN PYTHON

Selecting the data

df_true = df.where("endword in ('she', 'he', 'hers', 'his', 'her', 'him')")\ .withColumn('label', lit(1)) df_false = df.where("endword not in ('she', 'he', 'hers', 'his', 'her', 'him')")\ .withColumn('label', lit(0))

slide-35
SLIDE 35

INTRODUCTION TO SPARK SQL IN PYTHON

Combining the positive and negative data

df_examples = df_true.union(df_false)

slide-36
SLIDE 36

INTRODUCTION TO SPARK SQL IN PYTHON

Splitting the data into training and evaluation sets

df_train, df_eval = df_examples.randomSplit((0.60, 0.40), 42)

slide-37
SLIDE 37

INTRODUCTION TO SPARK SQL IN PYTHON

Training

from pyspark.ml.classification import LogisticRegression logistic = LogisticRegression(maxIter=50, regParam=0.6, elasticNetParam=0.3) model = logistic.fit(df_train) print("Training iterations: ", model.summary.totalIterations)

slide-38
SLIDE 38

Let's practice!

IN TRODUCTION TO S PARK S QL IN P YTH ON

slide-39
SLIDE 39

Predicting and evaluating

IN TRODUCTION TO S PARK S QL IN P YTH ON

Mark Plutowski

Data Scientist

slide-40
SLIDE 40

INTRODUCTION TO SPARK SQL IN PYTHON

Applying a model to evaluation data

predicted = df_trained.transform(df_test)

prediction column: double probability column: vector of length two

x = predicted.first print("Right!" if x.label == int(x.prediction) else "Wrong")

slide-41
SLIDE 41

INTRODUCTION TO SPARK SQL IN PYTHON

Evaluating classication accuracy

model_stats = model.evaluate(df_eval) type(model_stats) pyspark.ml.classification.BinaryLogisticRegressionSummary) print("\nAccuracy: %.2f" % model_stats.areaUnderROC)

slide-42
SLIDE 42

INTRODUCTION TO SPARK SQL IN PYTHON

Example of classifying text

Positive labels: ['her', 'him', 'he', 'she', 'them', 'us', 'they', 'himself', 'herself', 'we'] Number of examples: 5746 Number of examples: 2873 positive, 2873 negative Number of training examples: 4607 Number of test examples: 1139 training iterations: 21 T est AUC: 0.87

slide-43
SLIDE 43

INTRODUCTION TO SPARK SQL IN PYTHON

Predicting the endword

Positive label: 'it' Number of examples: 438 Number of examples: 219 positive, 219 negative Number of training examples: 340 Number of test examples: 98 T est AUC: 0.85

slide-44
SLIDE 44

Let's practice!

IN TRODUCTION TO S PARK S QL IN P YTH ON

slide-45
SLIDE 45

Recap

IN TRODUCTION TO S PARK S QL IN P YTH ON

Mark Plutowski

Data Scientist

slide-46
SLIDE 46

INTRODUCTION TO SPARK SQL IN PYTHON

Recap

Window function SQL Extract Transform Select Train Predict Evaluate

slide-47
SLIDE 47

Congratulations!

IN TRODUCTION TO S PARK S QL IN P YTH ON