Data Understanding Compendium slides for Guide to Intelligent Data - - PowerPoint PPT Presentation

data understanding
SMART_READER_LITE
LIVE PREVIEW

Data Understanding Compendium slides for Guide to Intelligent Data - - PowerPoint PPT Presentation

Data Understanding Compendium slides for Guide to Intelligent Data Analysis, Springer 2011. 1 / 45 Michael R. Berthold, Christian Borgelt, Frank H c oppner, Frank Klawonn and Iris Ad a Questions in Data Understanding Goal Gain


slide-1
SLIDE 1

Data Understanding

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

1 / 45

slide-2
SLIDE 2

Questions in Data Understanding

Goal

Gain insight in your data

1 with respect to your project goals 2 and general

Find answers to the questions

1 What kind of attributes do we have? 2 How is the data quality? 3 Does a visualization helps? 4 Are attributes correlated? 5 What about outliers? 6 How are missing values handled?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

2 / 45

slide-3
SLIDE 3

Attribute understanding

We (often) assume that the data set is provided in the form of a simple table. attribute1 . . . attributem record1 . . . recordn The rows of the table are called instances, records or data objects. The columns of the table are called attributes, features or variables.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

3 / 45

slide-4
SLIDE 4

Types of attributes

categorical (nominal): finite domain The values of a categorical attribute are often called classes

  • r categories.

Examples: {female,male}, {ordered,sent,received}

  • rdinal: finite domain with a linear ordering on the domain.

Examples: {B.Sc.,M.Sc.,Ph.D.} numerical: values are numbers. discrete: categorical attribute or numerical attribute whose domain is a subset of the integer number. continuous: numerical attribute with values in the real numbers or in an interval

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

4 / 45

slide-5
SLIDE 5

Data quality

Low data quality makes it impossible to trust analysis results: “Garbage in, garbage out” Accuracy: Closeness between the value in the data and the true value. Reason of low accuracy of numerical attributes: noisy measurements, limited precision, wrong measurements, transposition of digits (when entered manually). Reason of low accuracy of categorical attributes: erroneous entries, typos.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

5 / 45

slide-6
SLIDE 6

Data quality

Syntactic accuracy : Entry is not in the domain. Examples: fmale in gender, text in numerical attributes, ... Can be checked quite easy. Semantic accuracy : Entry is in the domain but not correct. Example: John Smith is female Needs more information to be checked (e.g. “business rules”). Completeness : is violated if an entry is not correct although it belongs to the domain of the attribute. Example: Complete records are missing, the data is biased (A bank has rejected customers with low income.) Unbalanced data: The data set might be biased extremely to one type of records. Example: Defective goods are a very small fraction of all. Timeliness: Is the available data up to date?

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

6 / 45

slide-7
SLIDE 7

Data visualisation

Tukey: There is no excuse for failing to plot and look.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

7 / 45

slide-8
SLIDE 8

Hidden missing values

5 10 15 20 1 2 3 4 5

time wind speed

The zero values might come from a broken or blocked sensor and might be consider as missing values.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

8 / 45

slide-9
SLIDE 9

Bar charts

A bar chart is a simple way to depict the frequencies of the values of a categorical attribute.

a b c d e f 20 40 60 80 100

categorical attribute frequency

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

9 / 45

slide-10
SLIDE 10

Histograms

A histogram shows the frequency distribution for a numerical attribute. The range of the numerical attribute is discretized into a fixed number of intervals (called bins), usually of equal length. For each interval the (absolute) frequency of values falling into it is indicated by the height of a bar.

–3 –2 –1 1 2 3 4 5 6 7 25 50 75 100 125 150 175

numerical attribute frequency

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

10 / 45

slide-11
SLIDE 11

Histograms: Number of bins

–3 –2 –1 1 2 3 4 5 6 0.05 0.1 0.15 0.2

attribute value probability density

–3 –2 –1 1 2 3 4 5 6 7 50 100 150 200 250 300 350

attribute value frequency

–3 –2 –1 1 2 3 4 5 6 7 20 40 60 80 100 120

attribute value frequency

–3 –2 –1 1 2 3 4 5 6 7 5 10 15

attribute value frequency

Three histograms with 5, 17 and 200 bins for a sample from the same bimodal distribution.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

11 / 45

slide-12
SLIDE 12

Histograms: Number of bins

Number of bins according to Sturges’ rule: k = ⌈log2(n) + 1⌉ where n is the sample size. (Sturges’ rule is suitable for data from normal distributions and from data sets of moderate size.)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

12 / 45

slide-13
SLIDE 13

Reminder: Median, quantiles, quartiles, interquartile range

Median: The value in the middle (for the values given in increasing

  • rder).

q%-quantile (0 < q < 100): The value for which q% of the values are smaller and 100-q% are larger. The median is the 50%-quantile. Quartiles: 25%-quantile (1st quartile), median (2nd quantile), 75%-quantile (3rd quartile). Interquartile range (IQR): 3rd quantile - 1st quantile.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

13 / 45

slide-14
SLIDE 14

Example data set: Iris data

iris setosa iris versicolor iris virginica collected by E. Anderson in 1935 contains measurements of four real-valued variables: sepal length, sepal widths, petal lengths and petal width of 150 iris flowers of types Iris Setosa, Iris Versicolor, Iris Virginica (50 each) The fifth attribute is the name of the flower type.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

14 / 45

slide-15
SLIDE 15

Example data set: Iris data

Sepal Sepal Petal Petal Species Length Width Length Width 5.1 3.5 1.4 0.2 Iris-setosa

... ...

5.0 3.3 1.4 0.2 Iris-setosa 7.0 3.2 4.7 1.4 Iris-versicolor

... ...

5.1 2.5 3.0 1.1 Iris-versicolor 5.7 2.8 4.1 1.3 Iris-versicolor

... ...

5.9 3.0 5.1 1.8 Iris-virginica

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

15 / 45

slide-16
SLIDE 16

Boxplots

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

16 / 45

slide-17
SLIDE 17

Boxplots

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

17 / 45

slide-18
SLIDE 18

Boxplots

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

18 / 45

slide-19
SLIDE 19

Boxplots

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

19 / 45

slide-20
SLIDE 20

Boxplots

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

20 / 45

slide-21
SLIDE 21

Scatter plots

Scatter plots visualize two variables in a two-dimensional plot. Each axes corresponds to one variable.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

21 / 45

slide-22
SLIDE 22

Scatter plots

5 6 7 8 2 2.5 3 3.5 4 4.5

sepal length / cm sepal width / cm Iris setosa Iris versicolor Iris virginica

Scatter plots can be enriched with additional information: Colour or different symbols to incorporate a third attribute in the scatter plot.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

22 / 45

slide-23
SLIDE 23

Scatter plots

1 2 3 4 5 6 7 0.5 1 1.5 2 2.5

petal length / cm petal width / cm Iris setosa Iris versicolor Iris virginica

The two attributes petal length and width provide a better separation of the classes Iris versicolor and Iris virginica than the sepal length and width.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

23 / 45

slide-24
SLIDE 24

Scatter plots

1 2 3 4 5 6 7 0.5 1 1.5 2 2.5

petal length / cm petal width / cm Iris setosa Iris versicolor Iris virginica

Data objects with the same values cannot be distinguished in a scatter

  • plot. To avoid this effect, jitter is used, i.e. before plotting the points,

small random values are added to the coordinates. Jitter is essential for categorical attributes.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

24 / 45

slide-25
SLIDE 25

Scatter plots

1 2 3 4 5 6 7 0.5 1 1.5 2 2.5

petal length / cm petal width / cm Iris setosa Iris versicolor Iris virginica

The Iris data set with two (additional artificial) outliers. One is an outlier for the whole data set, one for the class Iris setosa.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

25 / 45

slide-26
SLIDE 26

3D scatter plots

For data sets of moderate size, scatter plots can be extended to three dimensions.

. 2 . 4 . 6 . 8 . 2 . 4 . 6 . 8

. 2 . 4 . 6 . 8 x1 x2

x3

A 3D scatter plot of an artificial data set filling a cube in a chessboard-like manner with one outlier.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

26 / 45

slide-27
SLIDE 27

Methods for higher-dimensional data

Fact :

We do have only 2-3 dimension for plotting data. (Third, e.g. colour) Principle approach for incorporating all attributes in a plot: Preserve as much of the “structure” . Define a measure that evaluates lower-dimensional representations (plots) of the data in terms of how well a representation preserves the

  • riginal “structure” of the high-dimensional data set.

Find the representation (plot) that gives the best value for the defined measure. There is no unique measure for “structure” preservation. Two very popular are : PCA and MDS.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

27 / 45

slide-28
SLIDE 28

Visualisation as a test

When visualisations reveal patterns or exceptions, then there is “something” in the data set. When visualisations do not indicate anything specific, there might still be patterns or structures in the data that cannot be revealed by the corresponding (simple) visualisation techniques.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

28 / 45

slide-29
SLIDE 29

Parallel coordinates

Parallel coordinates draw the coordinate axes parallel to each other, so that there is no limitation for the the number of axes to be displayed. For a data object, a polyline is drawn connecting the values of the data

  • bject for the attributes on the corresponding axes.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

29 / 45

slide-30
SLIDE 30

Parallel coordinates: Iris data

sepal length sepal width petal length petal width species Iris setosa Iris versicolor Iris virginica

sepal length sepal width petal length petal width sepal length sepal width petal length petal width sepal length sepal width petal length petal width

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

30 / 45

slide-31
SLIDE 31

Parallel coordinates: “Cube data”

x1 x2 x3

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

31 / 45

slide-32
SLIDE 32

Radar plots

Radar plots are based on a similar idea as parallel coordinates with the difference that the coordinate axes are drawn as parallel lines, but in a star-like fashion intersecting in one point.

sepal length sepal width petal length petal width Iris setosa Iris versicolor Iris virginica

Radar plot for the Iris data set

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

32 / 45

slide-33
SLIDE 33

Star plots

Star plots are the same as radar plots where each data object is drawn separately.

sepal length sepal width petal length petal width

Star plot for the Iris data set

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

33 / 45

slide-34
SLIDE 34

Correlation Analysis

How can the similiar behaviour of two attributes be proved?

Pearson’s correlation coefficient Spearman’s rank correlation coefficient (Spearman’s rho) more in the book ...

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

34 / 45

slide-35
SLIDE 35

Pearson’s correlation coefficient

The (sample) Pearson’s correlation coefficient is a measure for a linear relationship between two numerical attributes X and Y and is defined as rxy = n

i=1(xi − ¯

x)(yi − ¯ y) (n − 1)sxsy where ¯ x and ¯ y are the mean values of the attributes X and Y ,

  • respectively. sx and sy are the corresponding (sample) standard deviations.

−1 ≤ rxy ≤ 1 The larger the absolute value of the Pearson correlation coefficient, the stronger the linear relationship between the two attributes. For |rxy| = 1 the values of X and Y lie exactly on a line. Positive (negative) correlation indicates a line with positive (negative) slope.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

35 / 45

slide-36
SLIDE 36

Pearson’s correlation coefficient: Iris data set

sepal length sepal width petal length petal width sepal length 1.000 −0.118 0.872 0.818 sepal width −0.118 1.000 −0.428 −0.366 petal length 0.872 −0.428 1.000 0.963 petal width 0.818 −0.366 0.963 1.000

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

36 / 45

slide-37
SLIDE 37

Spearman’s rank correlation coefficient (Spearman’s rho)

Spearman’s rank correlation coefficient (Spearman’s rho) is defined as ρ = 1 − 6 n

i=1(r(xi) − r(yi))2

n(n2 − 1) , where r(xi) is the rank of value xi when we sort the list (x1, . . . , xn) in increasing order. r(yi) is defined analogously. When the rankings of the x- and y-values are exactly in the same

  • rder, Spearman’s rho will yield the value 1.

If they are in reverse order, we will obtain the value −1.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

37 / 45

slide-38
SLIDE 38

Spearman’s rho: Iris data set

sepal length sepal width petal length petal width sepal length 1.000 −0.167 0.882 0.834 sepal width −0.167 1.000 −0.289 −0.289 petal length 0.882 −0.289 1.000 0.938 petal width 0.834 −0.289 0.938 1.000

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

38 / 45

slide-39
SLIDE 39

Outlier detection

An outlier is a value or data object that is far away or very different from all or most of the other data. Causes for outliers: Data quality problems (erroneous data coming from wrong measurements or typing mistakes) Exceptional or unusual situations/data objects. Outliers coming from erroneous data should be excluded from the analysis. Even if the outliers are correct (exceptional data), it is sometime useful to exclude them from the analysis. For example, a single extremely large outlier can lead to completely misleading values for the mean value.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

39 / 45

slide-40
SLIDE 40

Outlier detection: Single attributes

Categorical attributes: An outlier is a value that occurs with a frequency extremely lower than the frequency of all other values. Numerical attributes: Outliers in boxplots. Statistical tests, for example Grubb’s test: ... (see the exercise)

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

40 / 45

slide-41
SLIDE 41

Outlier detection for multidimensional data

Scatter plots for (visually detecting) outliers w.r.t. two attributes. PCA or MDS plots for (visually detecting) outliers. Cluster analysis techniques: Outliers are those points which cannot be assigned to any cluster.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

41 / 45

slide-42
SLIDE 42

Missing values

For some instances values of single attributes might be missing. Causes for missing values: broken sensors refusal to answer a question irrelevant attribute for the corresponding object (pregnant (yes/no) for men) Missing value might not necessarily be indicated as missing (instead: zero

  • r default values).

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

42 / 45

slide-43
SLIDE 43

Types of missing values

Consider the attribute Xobs. A missing value is denoted by ?. X is the true value, i.e. we have Xobs = X, if Xobs =? Let Y be all other attributes apart from X. Missing completely at random (MCAR): The probability that a value for X is missing does neither depend on the true value of X nor

  • n other variables.

P(Xobs =?) = P(Xobs =? | X, Y ) Missing at random (MAR): The probability that a value for X is missing does not depend on the true value of X. P(Xobs =? | Y ) = P(Xobs =? | X, Y ) Nonignorable: The probability that a value for X is missing depends

  • n the true value of X.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

43 / 45

slide-44
SLIDE 44

A checklist for data understanding

Determine the quality of the data. (e.g. syntactic accuracy) Find outliers. (e.g. using visualization techniques) Detect and examine missing values. Possible hidden by default values. Discover new or confirm expected dependencies or correlations between attributes. Check specific application dependent assumptions (e.g. the attribute follows a normal distribution) Compare statistics with the expected behavior.

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

44 / 45

slide-45
SLIDE 45

A checklist for data understanding: Must Do

Check the distributions for each attribute

(unexpected properties like outliers, correct domains, correct medians)

Check correlations or dependencies between pairs of attributes

Compendium slides for “Guide to Intelligent Data Analysis”, Springer 2011. c Michael R. Berthold, Christian Borgelt, Frank H¨

  • ppner, Frank Klawonn and Iris Ad¨

a

45 / 45