Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

data exploration visualization
SMART_READER_LITE
LIVE PREVIEW

Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

Geometric Data Analysis Data Exploration & Visualization MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 1 / 46 Outline Tabular data


slide-1
SLIDE 1

Geometric Data Analysis

Data Exploration & Visualization

MAT 6480W / STT 6705V

Guy Wolf guy.wolf@umontreal.ca

Universit´ e de Montr´ eal Fall 2019

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 1 / 46

slide-2
SLIDE 2

Outline

1

Tabular data Observations/Data-Points vs. Features/Attributes Qualitative vs. Quantitative attributes Qualitative: Nominal vs. Ordinal Quantitative: Interval vs. Ratio

2

Summary statistics Frequency, mode, & percentiles Mean & median Range & variance Covariance & correlation Data quality

3

Visualizations Box plots Histograms Star plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 2 / 46

slide-3
SLIDE 3

Outline (cont.)

Parallel coordinate plots Scatter plots Quiver plots

4

Transactional data Term matrix Text documents

5

Structured signals (e.g., audio and EEG) Fourier & wavelets Spectrogram & scalogram

6

Multidimensional signals (e.g., images and videos) Visualization with contour plots

7

Nonparametric (affinity-/distance-based) representations Graph data Visualization with matrix plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 3 / 46

slide-4
SLIDE 4

What is data?

❅ ❅ ❅ ❅ ❘

❅ ❅ ❅ ❅ ■

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 4 / 46

slide-5
SLIDE 5

What is data?

Experimental vs. observational data

Experimental data

Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity.

Examples

Medical clinical trials Election polls

Observational data

Data collected from “real-world” settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Most data in “data science” is observational data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 5 / 46

slide-6
SLIDE 6

Tabular Data

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 6 / 46

slide-7
SLIDE 7

Tabular data

Organizing data in a table of observations-by-features is considered the most convenient and standard format for data analysis.

Example

Consider the following procedure:

1

From each machine, collect 3 temperature measurements (MOBO, CPU, GPU), 4 software parameters (CPU, RAM, HDD, # processes), and 2 power consumption values (MOBO, GPU)

2

Attach unique identifiers of the machine, OS, and hardware manufacturer

3

Every second, store a record with these values from every machine in the system. We end up with hundreds of thousands of records, each containing 12 fields.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 7 / 46

slide-8
SLIDE 8

Tabular data

Observations/Data-Points vs. Features/Attributes

Observations/objects/data- points/samples/records Features/attributes/properties/fields

  • Timestamp

OS Temp · · · CPU # proc

                                              

. . . . . . . . . . . . . . . . . .

9/1/161:00AM LNX 45◦ C

· · ·

65% 23

. . . . . . . . . . . . . . . . . .

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 8 / 46

slide-9
SLIDE 9

Tabular data

Types of features/attributes

It is important to recognize the types of values each feature/attribute takes in order to understand which operations make sense for it.

Examples

Can we compute an average eye color? How do we compute the difference between phone numbers? Can we say today is “twice as hot/cold” as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 9 / 46

slide-10
SLIDE 10

Tabular data

Qualitative vs. Quantitative attributes

Attribute values can be split into two types:

Qualitative attributes

Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an

  • bject/observation, rather than measure its properties.

Quantitative attributes

Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 10 / 46

slide-11
SLIDE 11

Tabular data

Qualitative: Nominal vs. Ordinal

Qualitative attributes can be split further into two types:

Nominal attributes

Examples: zip codes, eye color, operating system, gender Values of such attributes just specify names without any particular

  • rder or relation between them (except for = and =).

Binary attributes are nominal attributes with only two values (Yes/No

  • r 0/1). They can be symmetric or asymmetric based in whether or

not their values are equally informative.

Ordinal attributes

Examples: ratings, grades, street/avenue numbers Values of such attributes have some order, even though they don’t specify an exact quantity

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 11 / 46

slide-12
SLIDE 12

Tabular data

Quantitative: Interval vs. Ratio

Quantitative attributes can also be split into two types:

Interval attributes

Examples: calendar dates, azimuth direction, Fahrenheit temperatures Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations).

Ratio attributes

Examples: mass, length, distance, currency, age, electrical current Such attributes represent quantities that have meaningful ratios between their values. Unlike interval attributes, ratio ones usually have an “absolute zero”. We can also split quantities into discrete and continuous ones. All qualitative attributes are considered discrete.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 12 / 46

slide-13
SLIDE 13

Tabular data

Summary of attribute types

The types of attributes can be regarded via the operations that can be applied to them: Comparison (= and =) - every type Ordering (> and <) - every type except nominal Differences (−) and addition (+) - only quantitative Division (/) and multiplication (×, ·) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 13 / 46

slide-14
SLIDE 14

Tabular data

Technical formats

Tabular data can be stored, collected, or given in several standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables There are several techniques and standard designs to collect and store big data in databases. Data warehouse, ETL (extract-transform-load), and OLAP (Online Analytical Processing) are some related terms en- countered frequently in the IT industry.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 14 / 46

slide-15
SLIDE 15

Tabular data

Data warehouse: star and snowflake schemas

Star schema

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 15 / 46

slide-16
SLIDE 16

Tabular data

Data warehouse: star and snowflake schemas

Snowflake schema

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 15 / 46

slide-17
SLIDE 17

Summary statistics

The raw representation of the data is often not convenient for initial exploration and understanding of the data. How do we get general insights into the data and its attributes as a whole?

Summary statistics

Properties that summarize global information, such as central tendency, spread, and variations of observations and features. These statistics provide an important first step in data analysis and most of them are not difficult to compute in linear time w.r.t the size

  • f the data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 16 / 46

slide-18
SLIDE 18

Summary statistics

Frequency, mode, & percentiles

Frequency

The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-19
SLIDE 19

Summary statistics

Frequency, mode, & percentiles

Frequency

The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute.

Mode

The most frequent value of an attribute in the data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-20
SLIDE 20

Summary statistics

Frequency, mode, & percentiles

Frequency

The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute.

Mode

The most frequent value of an attribute in the data.

Percentiles

The p-th percentile (with 0 ≤ p ≤ 100) of an attribute is a value Pp such that p% of the observed values of this attributes are less than

  • Pp. We typically take Pp as one of the observed values of the
  • attributes. Alternatives: quartile Qi (i = 1, 2, 3), quantile, etc.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-21
SLIDE 21

Summary statistics

Frequency, mode, & percentiles

Frequency

The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute.

Mode

The most frequent value of an attribute in the data.

Percentiles

The p-th percentile (with 0 ≤ p ≤ 100) of an attribute is a value Pp such that p% of the observed values of this attributes are less than

  • Pp. We typically take Pp as one of the observed values of the
  • attributes. Alternatives: quartile Qi (i = 1, 2, 3), quantile, etc.

Visual examples: stem-and-leaf displays; quantile & Q − Q plots.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-22
SLIDE 22

Summary statistics

Frequency, mode, & percentiles

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-23
SLIDE 23

Summary statistics

Frequency, mode, & percentiles

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-24
SLIDE 24

Summary statistics

Frequency, mode, & percentiles

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 17 / 46

slide-25
SLIDE 25

Summary statistics

Mean & median

Mean

The mean (or average) ¯ x = 1

n

n

i=1 xn is the most common way to

measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 18 / 46

slide-26
SLIDE 26

Summary statistics

Mean & median

Mean

The mean (or average) ¯ x = 1

n

n

i=1 xn is the most common way to

measure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation.

Median

The median of an attribute is a value such that half of the observed values are above it and half are below it. It is the middle value for an

  • dd number of observations, or the average (when it makes sense)

between the two middle numbers for an even number of observations. The median corresponds to P50 and Q2.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 18 / 46

slide-27
SLIDE 27

Summary statistics

Centrality and skewed data

Relations between three measures of centrality (mean, median, and mode) can indicate symmetric or skewed distributions of attributes: symmetric positively skewed negatively skewed

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 19 / 46

slide-28
SLIDE 28

Summary statistics

Range & variance

Range

Range is the difference between max and min observed values of an attribute

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 20 / 46

slide-29
SLIDE 29

Summary statistics

Range & variance

Range

Range is the difference between max and min observed values of an attribute

Variance

Variance s2

x = 1 n

n

i=1(xi − ¯

x)2 and standard deviation (STD) sx =

  • s2

x are the most common ways to measure the spread of

  • values. However, like the mean, they are sensitive to outliers.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 20 / 46

slide-30
SLIDE 30

Summary statistics

Range & variance

Range

Range is the difference between max and min observed values of an attribute

Variance

Variance s2

x = 1 n

n

i=1(xi − ¯

x)2 and standard deviation (STD) sx =

  • s2

x are the most common ways to measure the spread of

  • values. However, like the mean, they are sensitive to outliers.

Other spread measures include: average absolute deviation - the average of |xi − ¯ x| median absolute deviation - the median of |xi − ¯ x| interquartile range - the difference x75% − x25%

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 20 / 46

slide-31
SLIDE 31

Summary statistics

Covariance & correlation

Covariance

Measures the degree to which attributes vary together and is computed by cov(x, y) = 1

n

n

i=1(xi − ¯

x)(yi − ¯ y). This value depends

  • n the magnitude/spread of the attribute values.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

slide-32
SLIDE 32

Summary statistics

Covariance & correlation

Covariance

Measures the degree to which attributes vary together and is computed by cov(x, y) = 1

n

n

i=1(xi − ¯

x)(yi − ¯ y). This value depends

  • n the magnitude/spread of the attribute values.

For k attributes, these form a k × k covariance matrix, with variances s2

x = cov(x, x) on its diagonal.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

slide-33
SLIDE 33

Summary statistics

Covariance & correlation

Covariance

Measures the degree to which attributes vary together and is computed by cov(x, y) = 1

n

n

i=1(xi − ¯

x)(yi − ¯ y). This value depends

  • n the magnitude/spread of the attribute values.

For k attributes, these form a k × k covariance matrix, with variances s2

x = cov(x, x) on its diagonal.

Correlation

A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y)

sxsy

. Notice that it is independent magnitudes/spreads and corr(x, x) = 1. Notice that Pearson correlation is the covariance or dot-product between standardized attributes.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 21 / 46

slide-34
SLIDE 34

Summary statistics

Covariance & correlation

Correlation

A value between 0 and 1 that indicates how strongly two attributes are (linearly) related. Pearson correlation: corr(x, y) = cov(x,y)

sxsy

. Notice that it is independent magnitudes/spreads and corr(x, x) = 1. Notice that Pearson correlation is the covariance or dot-product between standardized attributes. corr(x, y) = s−1

x s−1 y

1 n

n

  • i=1

(xi − ¯ x)(yi − ¯ y) = 1 n

n

  • i=1

xi − ¯

x sx

yi − ¯

y sy

  • MAT 6480W (Guy Wolf)

Data Exploration UdeM - Fall 2019 21 / 46

slide-35
SLIDE 35

Summary statistics

Misleading example: pirates & global warming

Taken from Wikipedia MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 22 / 46

slide-36
SLIDE 36

Summary statistics

Data quality

Summary statistics enable identification of various data quality issues, such as

Precision

The closeness of repeated measurements to one another.

Bias

A systematic variation of measurements from the quantity being measured.

Accuracy

The closeness of measurements to the true value of the quantity being measured. Other issues include missing values, outliers, and duplicate values.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 23 / 46

slide-37
SLIDE 37

Visualizations

Why do we need visualizations?

While summary statistics provide useful information about the data, they can be overwhelming and hard to track when many attributes are considered.

Visualization

Conversion of data into visual elements that express characteristics, relationships, and information about data points and attributes. Visualizations provide graphic representations that enable us to draw insights at a single glance.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

slide-38
SLIDE 38

Visualizations

Why do we need visualizations?

Example

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

slide-39
SLIDE 39

Visualizations

Why do we need visualizations?

Example (TreeMap)

Taken from Wikipedia MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

slide-40
SLIDE 40

Visualizations

Why do we need visualizations?

Example (TreeMap)

Taken from Wikipedia MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 24 / 46

slide-41
SLIDE 41

Visualizations

What constitutes a good visualization?

No really good answer... but there are some general guidelines:

ACCENT principles

Apprehension: we can correctly perceive relations among variables. Clarity: visually distinguish important relations and elements. Consistency: comparing graphical elements/displays shows faithful (dis)similarities in the data. Efficiency: simplify complex relations and patterns in the visualization. Necessity:

  • nly include necessary graphical elements - no

extraneous elements. Truthfulness: true values (absolute or relative) can be determined from graphical elements.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 25 / 46

slide-42
SLIDE 42

Visualizations

Box plots

Box plots (invented by J.Tukey) show the five-number distribution of attribute values based on percentiles: Outliers ❳❳❳❳

③ ✘✘✘✘ ✿

90th percentile

75th percentile ❍❍❍❍

Median

25th percentile ✏✏✏✏

10th percentile ✏✏✏✏

✏ ✶

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 26 / 46

slide-43
SLIDE 43

Visualizations

Histograms

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

slide-44
SLIDE 44

Visualizations

Histograms

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

slide-45
SLIDE 45

Visualizations

Histograms

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

slide-46
SLIDE 46

Visualizations

Histograms

Taken from: Pierchala, C. “The choice of age groupings may affect the quality of tabular presentations.” 2002. MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 27 / 46

slide-47
SLIDE 47

Visualizations

Star plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 28 / 46

slide-48
SLIDE 48

Visualizations

Star plots

Taken from: www.coffeeanalysts.com/2011/11/coffee-spider-graphs-explained/ MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 28 / 46

slide-49
SLIDE 49

Visualizations

Parallel coordinate plots

Notice that attributes in this case do not have a particular order.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 29 / 46

slide-50
SLIDE 50

Visualizations

Scatter plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 30 / 46

slide-51
SLIDE 51

Visualizations

Quiver plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 31 / 46

slide-52
SLIDE 52

Non-tabular Data

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 32 / 46

slide-53
SLIDE 53

Transactional data

In transactional data, each observation is a transaction that contains a collection of items or sequence of events.

Example

Market basket data Customer #1: {milk, bread, butter}; Customer #2: {orange juice, milk}; Customer #3: {orange juice, peanut butter, jelly, bread}; . . . Transaction items can also contain numerical attributes, such as the number of purchased items (e.g., 3 boxes of cookies) or their price. When sequences (e.g., events, actions, or genes) are considered, temporal/order information may also be included.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 33 / 46

slide-54
SLIDE 54

Transactional data

Term matrix

In some cases, transactional data can be converted to tabular form by considering term matrix (a.k.a. bag of words/features techniques).

Example

CustomerID milk bread butter O.J. cheese P.B. jelly Customer#1 1 1 1 Customer#2 1 1 Customer#3 1 1 1 1 . . . . . . . . . . . . . . . . . . . . . . . . This representation looses sequential information, and to applying it to continuous values requires a discretization step.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 34 / 46

slide-55
SLIDE 55

Transactional data

Text documents

Text documents can be considered as transactional data in one of two ways: ❘

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

slide-56
SLIDE 56

Transactional data

Text documents

Text documents can be considered as transactional data in one of two ways:

1

Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word

  • ccurrences. Similar approaches can also be applied to images,

questionnaires, etc., with an appropriate dictionary-building clustering step. ❘

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

slide-57
SLIDE 57

Transactional data

Text documents

Text documents can be considered as transactional data in one of two ways:

1

Each document can be considered as a big transaction containing words. Bag of words techniques ignore grammatical structures and represent a document as a histogram of word

  • ccurrences. Similar approaches can also be applied to images,

questionnaires, etc., with an appropriate dictionary-building clustering step.

2

A document can be considered as a transactional dataset on its

  • wn, which contains word contexts (e.g., with n-grams or

skip-grams). Word2vec techniques use this approach to associate numerical coordinates (typically in ❘300) to words based on contexts in which they appear.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

slide-58
SLIDE 58

Transactional data

Text documents

Example (Term analysis of Donald Trump’s twits)

Most frequent words: iPhone vs. Android:

Taken from varianceexplained.org/r/trump-tweets/ MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

slide-59
SLIDE 59

Transactional data

Text documents

Taken from github.com/aubry74/visual-word2vec/ MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 35 / 46

slide-60
SLIDE 60

Structured signals

Structured signals have well known relations between their “attributes”. They are typically numerical, with temporal or spatial

  • rdering.

Examples

Audio recordings EEG signals Heart rate Room temperatures Each data-point is then a signal collected over time (or space), and we can be analyzed with signal processing tools.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 36 / 46

slide-61
SLIDE 61

Structured signals

Fourier & power spectrum

Time series

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 37 / 46

slide-62
SLIDE 62

Structured signals

Fourier & power spectrum

Time series Power spectrum

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 37 / 46

slide-63
SLIDE 63

Structured signals

Fourier & power spectrum

Time series Power spectrum

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 37 / 46

slide-64
SLIDE 64

Structured signals

STFT & wavelets

STFT

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 38 / 46

slide-65
SLIDE 65

Structured signals

STFT & wavelets

Wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 38 / 46

slide-66
SLIDE 66

Structured signals

STFT & wavelets

Lowpass Scale 1 Scale 2

              

Haar wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 38 / 46

slide-67
SLIDE 67

Structured signals

Spectrogram & scalogram

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 39 / 46

slide-68
SLIDE 68

Structured signals

Spectrogram & scalogram

Spectrogram Scalogram

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 39 / 46

slide-69
SLIDE 69

Multidimensional signals

Multidimensional signals are similar have several coordinates that specify the relations between their “attributes”.

Examples

Grayscale images have two spatial coordinates that determine pixel positions. Videos have two spatial & one temporal coordinates that determine pixel positions. Geographic data has two or three coordinates determining longitude, latitude, and elevation. Colored and hyperspectral images have two spatial coordinates and one spectral/channel coordinate. In general, many signal processing approaches can be extended from

  • ne-dimensional signals to multidimensional ones.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 40 / 46

slide-70
SLIDE 70

Multidimensional signals

Two-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

slide-71
SLIDE 71

Multidimensional signals

Two-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

slide-72
SLIDE 72

Multidimensional signals

Two-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

slide-73
SLIDE 73

Multidimensional signals

Two-dimensional wavelets

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 41 / 46

slide-74
SLIDE 74

Multidimensional signals

Visualization with contour plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 42 / 46

slide-75
SLIDE 75

Multidimensional signals

Visualization with contour plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 42 / 46

slide-76
SLIDE 76

Nonparametric representations

In some cases, the important information in the data is the relations between data points, rather than their attributes.

Examples

Spatial locations and trajectories Phone calls and email correspondences Gene interactions and cell progressions In these cases an affinity matrix, based on similarity or distances, between data points can be used for analysis. Essentially, each data point is represented by its relations to other data points rather than by its own attributes.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 43 / 46

slide-77
SLIDE 77

Nonparametric representations

Graph data

Graphs can be used to formalize relations in data in two ways:

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

slide-78
SLIDE 78

Nonparametric representations

Graph data

Graphs can be used to formalize relations in data in two ways:

1

Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

slide-79
SLIDE 79

Nonparametric representations

Graph data

Graphs can be used to formalize relations in data in two ways:

1

Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data. Benzene (C6H6):

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

slide-80
SLIDE 80

Nonparametric representations

Graph data

Graphs can be used to formalize relations in data in two ways:

1

Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data.

2

The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

slide-81
SLIDE 81

Nonparametric representations

Graph data

Graphs can be used to formalize relations in data in two ways:

1

Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data.

2

The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

slide-82
SLIDE 82

Nonparametric representations

Graph data

Graphs can be used to formalize relations in data in two ways:

1

Relationships between attributes can form graphs (e.g., molecule data). In this case each data point is a graph on its own, and this is a more complicated example of structured data.

2

The graph is considered as the dataset, and each node is a data point (e.g., social networks and web-reference data). In this case, an adjacency matrix can form an affinity matrix. Conversely, affinity matrices can form adjacency matrices, so nonparametric data is often considered as graph data. Spectral graph methods (e.g., SVD of graph Laplacian) can be used to associate coordinates to data points in the second case visualization with scatter plots and further analysis.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 44 / 46

slide-83
SLIDE 83

Nonparametric representations

Visualization with matrix plots

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 45 / 46

slide-84
SLIDE 84

Summary

We considered the following data and attribute types, and briefly showed how to handle, process, and visualize them:

Types of attributes

Nominal Ordinal Interval Ratio

Types of data

Tabular data Transactional & text data Structured (1D, 2D, & more) data Nonparametric & graph data Exploratory data analysis crucial for obtaining intelligible results, e.g., by identifying valid applicable operations on the data and possibly transforming it to more amenable representation for analysis. Other preprocessing steps include normalization/standardization, sam- pling, discretization, aggregation and dimensionality reduction.

MAT 6480W (Guy Wolf) Data Exploration UdeM - Fall 2019 46 / 46