Python & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes - - PowerPoint PPT Presentation

python spark ptt18 19
SMART_READER_LITE
LIVE PREVIEW

Python & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes - - PowerPoint PPT Presentation

Python & Spark PTT18/19 Prof. Dr. Ralf Lmmel Msc. Johannes Hrtel Msc. Marcel Heinz (C) 2018, SoftLang Team, University of Koblenz-Landau The Big Picture [Aggarwal15] (C) 2018, SoftLang Team, University of Koblenz-Landau Plenty


slide-1
SLIDE 1

(C) 2018, SoftLang Team, University of Koblenz-Landau

Python & Spark PTT18/19

  • Prof. Dr. Ralf Lämmel
  • Msc. Johannes Härtel
  • Msc. Marcel Heinz
slide-2
SLIDE 2

(C) 2018, SoftLang Team, University of Koblenz-Landau

The ‘Big Picture’

[Aggarwal15]

slide-3
SLIDE 3

(C) 2018, SoftLang Team, University of Koblenz-Landau

Plenty of Building Blocks are involved in this ‘Big Picture’

slide-4
SLIDE 4

(C) 2018, SoftLang Team, University of Koblenz-Landau

Back to the ‘Big Picture’

[Aggarwal15]

slide-5
SLIDE 5

(C) 2018, SoftLang Team, University of Koblenz-Landau

Foundations

slide-6
SLIDE 6

(C) 2018, SoftLang Team, University of Koblenz-Landau

Technologies and APIs

There are several technologies and APIs related to data-analysis in Python but the most convenient one is Pandas. The following tutorial is inspired by the Book ‘Python for data Analysis’ [McKinney12].

slide-7
SLIDE 7

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is contained in this CSV?

Some imports and configuration needed to read and print a CSV with Pandas.

CSV File Python Jack Nicholson (angry)

slide-8
SLIDE 8

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is contained in this CSV?

Reading and printing CSV data with Pandas.

slide-9
SLIDE 9

(C) 2018, SoftLang Team, University of Koblenz-Landau

What are the first 5 ratings in this CSV?

Selecting a range of rows returns another Dataframe.

slide-10
SLIDE 10

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the title a rating refers to?

Selecting one column returns a Series (╯°□°)╯︵ ┻━┻

slide-11
SLIDE 11

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the gender and the genre of a rating?

Selecting columns by passing a list returns a Dataframe ┬──┬◡ノ(° -°ノ)

slide-12
SLIDE 12

(C) 2018, SoftLang Team, University of Koblenz-Landau

What are ratings of female persons?

First we need a condition for filtering. Such condition can be stated as a Series of booleans.

slide-13
SLIDE 13

(C) 2018, SoftLang Team, University of Koblenz-Landau

What are ratings of female persons?

We can use this condition as a selection mechanism for rows.

slide-14
SLIDE 14

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the amount of female and male ratings?

Let’s try this!

slide-15
SLIDE 15

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the amount of female and male ratings?

But we can also use dedicated Pandas functionality to create a Series that is indexed by the the distinct values.

slide-16
SLIDE 16

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the amount of female and male ratings?

… and we can make python plot this.

slide-17
SLIDE 17

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the average rating given by a user?

First we need to group the ratings of users. The following shows how to get all ratings of one user.

slide-18
SLIDE 18

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the average rating given by a user?

After grouping we can select the rating column and take the mean for each group.

slide-19
SLIDE 19

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the average rating given by a user?

We can also create a summarization in terms of a boxplot.

slide-20
SLIDE 20

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is a gender’s average rating of a film?

A pivot table species rows and columns and aggregates the values using a passed function.

slide-21
SLIDE 21

(C) 2018, SoftLang Team, University of Koblenz-Landau

What are the top female rated films?

i) We filter out films below a rating count of 250 to concentrate on the important

  • candidates. ii) We increase the max rows since this is serious data! iii) We sort by

column ‘F’ containing the average female ratings.

slide-22
SLIDE 22

(C) 2018, SoftLang Team, University of Koblenz-Landau

What are the top female rated films?

slide-23
SLIDE 23

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the film with the biggest disagreement in female and male rating?

We add a new column to the ‘film_mean_ratings’ Dataframe assigned to the difference between the female and male column.

slide-24
SLIDE 24

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the film with the biggest disagreement in female and male rating?

slide-25
SLIDE 25

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the movies with the most disagreement among all viewers?

The standard deviation can be used to describe such disagreement in ratings.

slide-26
SLIDE 26

(C) 2018, SoftLang Team, University of Koblenz-Landau

What is the movie with the most disagreement among all viewers?

slide-27
SLIDE 27

(C) 2018, SoftLang Team, University of Koblenz-Landau

Back to the ‘Big Picture’

[Aggarwal15]

slide-28
SLIDE 28

(C) 2018, SoftLang Team, University of Koblenz-Landau

Data

slide-29
SLIDE 29

(C) 2018, SoftLang Team, University of Koblenz-Landau

Data Integration (JSON)

JSON data can be loaded from a file and accessed comparable to dictionaries.

JSON File Python

  • cf. [web_json]
slide-30
SLIDE 30

(C) 2018, SoftLang Team, University of Koblenz-Landau

Data Integration (SQL)

An sqlite package provides, for instance, an in-memory database.

  • cf. [web_sql]
slide-31
SLIDE 31

(C) 2018, SoftLang Team, University of Koblenz-Landau

Data Integration (CSV)

Some CSV data needs to be combined before being processed.

  • cf. [McKinney12]
slide-32
SLIDE 32

(C) 2018, SoftLang Team, University of Koblenz-Landau

Data Integration (CSV)

Comparable to joining tables in SQL, Pandas can merge different Dataframes.

  • cf. [McKinney12]
slide-33
SLIDE 33

(C) 2018, SoftLang Team, University of Koblenz-Landau

Some Class Doing Nothing SomeClassDoingNothing

Feature Extraction (Java)

The ‘right’ features need to be extracted from artifacts for further processing.

[AntoniolCCD00] some class doing nothing

slide-34
SLIDE 34

(C) 2018, SoftLang Team, University of Koblenz-Landau

Feature Extraction (Java)

The ‘javalang’ package provides a parser for Java written in Python that can be installed from git.

[web_jl]

slide-35
SLIDE 35

(C) 2018, SoftLang Team, University of Koblenz-Landau

Feature Extraction (Java)

The Java abstract syntax tree can be created from a file using ‘javalang’.

Java

slide-36
SLIDE 36

(C) 2018, SoftLang Team, University of Koblenz-Landau

Java SomeClassDoingNothing

Feature Extraction (Java)

Intuitively, the most relevant feature in this artifact is the classname.

slide-37
SLIDE 37

(C) 2018, SoftLang Team, University of Koblenz-Landau

Feature Extraction (Java)

Camel-case is split and strings are made lower-case.

SomeClassDoingNothing Some Class Doing Nothing some class doing nothing

slide-38
SLIDE 38

(C) 2018, SoftLang Team, University of Koblenz-Landau

Back to the ‘Big Picture’

[Aggarwal15]

slide-39
SLIDE 39

(C) 2018, SoftLang Team, University of Koblenz-Landau

Analytical Processing

slide-40
SLIDE 40

(C) 2018, SoftLang Team, University of Koblenz-Landau

Classification

Support vector machines are provided by the ‘scikit-learn’ package as a supervised machine learning technique doing classification.

  • cf. [scikit_cls]

[Aggarwal15]

slide-41
SLIDE 41

(C) 2018, SoftLang Team, University of Koblenz-Landau

Classification

Support vector machines in Python Spark.

[spark]

slide-42
SLIDE 42

(C) 2018, SoftLang Team, University of Koblenz-Landau

Clustering

The ‘scipy’ package provides hierarchical clustering as a unsupervised machine learning technique used to group this two-dimensional data.

  • cf. [web_cluster]
slide-43
SLIDE 43

(C) 2018, SoftLang Team, University of Koblenz-Landau

Clustering

Hierarchical clustering outputs a linkage array that can be depicted as a dendrogram.

  • cf. [web_cluster]
slide-44
SLIDE 44

(C) 2018, SoftLang Team, University of Koblenz-Landau

Clustering

K-means clustering in Python Spark.

[spark]

slide-45
SLIDE 45

(C) 2018, SoftLang Team, University of Koblenz-Landau

Back to the ‘Big Picture’

[Aggarwal15]

slide-46
SLIDE 46

(C) 2018, SoftLang Team, University of Koblenz-Landau

Output

slide-47
SLIDE 47

(C) 2018, SoftLang Team, University of Koblenz-Landau

Plot Types (Boxplot)

Gives a summary of distribution of numeric variables. Package:

  • Matplotlib
  • Seaborn
  • cf. [seaborn]
slide-48
SLIDE 48

(C) 2018, SoftLang Team, University of Koblenz-Landau

Plot Types (Line chart)

Depicts the evolution of one or many columns. Package:

  • Matplotlib
slide-49
SLIDE 49

(C) 2018, SoftLang Team, University of Koblenz-Landau

Plot Types (Bar chart)

Depicts the ranking present in one column. Package:

  • Matplotlib
slide-50
SLIDE 50

(C) 2018, SoftLang Team, University of Koblenz-Landau

Plot Types (Scatter plot)

Depicts the correlation of two columns. Package:

  • Matplotlib
  • Seaborn
slide-51
SLIDE 51

(C) 2018, SoftLang Team, University of Koblenz-Landau

Plot Types (Pie plot)

Depicts the part-whole relation.

  • cf. [py_pie]

Package:

  • Matplotlib
slide-52
SLIDE 52

(C) 2018, SoftLang Team, University of Koblenz-Landau

Scaling and Axis

The table shows metrics on, e.g., the contributed code of Developers (column ‘DCon_PE_d’). While a few developers share very high contribution values most developer’s contributions is very low for one project.

slide-53
SLIDE 53

(C) 2018, SoftLang Team, University of Koblenz-Landau

Scaling and Axis

Axis can have different scales to correctly depict the data.

slide-54
SLIDE 54

(C) 2018, SoftLang Team, University of Koblenz-Landau

Scaling and Axis

Setting the axis on log does not work due to the 0 entries.

slide-55
SLIDE 55

(C) 2018, SoftLang Team, University of Koblenz-Landau

Scaling and Axis

However, symlog works as it starts to scale linear under a given threshold.

slide-56
SLIDE 56

(C) 2018, SoftLang Team, University of Koblenz-Landau

Subplots

Supplots can be used to group multiple plots that optionally share axis.

slide-57
SLIDE 57

(C) 2018, SoftLang Team, University of Koblenz-Landau

Subplots

Some sample of subplots showing the relation between API usage and lines of code for individual APIs.

slide-58
SLIDE 58

(C) 2018, SoftLang Team, University of Koblenz-Landau

Subplots

Some other sample of different kinds of subplots sharing axis.

slide-59
SLIDE 59

(C) 2018, SoftLang Team, University of Koblenz-Landau

Back to the ‘Big Picture’

[Aggarwal15]

slide-60
SLIDE 60

(C) 2018, SoftLang Team, University of Koblenz-Landau

References

  • [Aggarwal15] Aggarwal, Charu C. “Data mining: the textbook”, Springer, 2015.
  • [McKinney12] Wes, McKinney. "Python for data analysis.", 2012.
  • [AntoniolCCD00] Antoniol, Giuliano, et al. "Information retrieval models for recovering traceability links between code and

documentation." icsm. IEEE, 2000.

  • [Haslwanter16] Haslwanter, Thomas. "An Introduction to Statistics with Python.", Springer, 2016.
  • [web_json] https://developer.rhino3d.com/guides/rhinopython/python-xml-json/
  • [web_sql] https://www.pythoncentral.io/introduction-to-sqlite-in-python/
  • [webGG] https://python-graph-gallery.com/
  • [web_jl] https://github.com/c2nes/javalang
  • [pandas_interpolate] https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.interpolate.html
  • [scikit_cls] http://scikit-learn.org/stable/modules/svm.html
  • [web_cluster] https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
  • [NL_reuters] https://github.com/fergiemcdowall/reuters-21578-json.git
  • [seborn] https://seaborn.pydata.org/
  • [py_pie] https://pythonspot.com/matplotlib-pie-chart/
  • [spark] https://spark.apache.org/docs/latest/
  • [spark_bp]

https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoiding_shuffle_less_stage,_more_fast.html

slide-61
SLIDE 61

(C) 2018, SoftLang Team, University of Koblenz-Landau