Introduction to Data Science MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

introduction to data science
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Science MAT 6480W / STT 6705V Guy Wolf - - PowerPoint PPT Presentation

Geometric Data Analysis Introduction to Data Science MAT 6480W / STT 6705V Guy Wolf guy.wolf@umontreal.ca Universit e de Montr eal Fall 2019 MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 1 / 19 Outline What is Data Science? 1


slide-1
SLIDE 1

Geometric Data Analysis

Introduction to Data Science

MAT 6480W / STT 6705V

Guy Wolf guy.wolf@umontreal.ca

Universit´ e de Montr´ eal Fall 2019

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 1 / 19

slide-2
SLIDE 2

Outline

1

What is Data Science? From data to information Predictive vs. descriptive information Supervised vs. unsupervised learning

2

Data Analysis Tasks Classification & regression Clustering & anomaly detection Association rules & sequential patterns Visualization & dimensionality reduction

3

Data Analysis Process

4

Software for Data Analysis

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 2 / 19

slide-3
SLIDE 3

Optional textbooks on data mining

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 3 / 19

slide-4
SLIDE 4

What is data science?

Data Mining

Non-trivial extraction of useful, new, hidden, and/or implicit infor- mation from data.

Deep Learning

A set of algorithms that attempt to model high-level data abstrac- tions in data by using multiple pro- cessing layers, composed of multi- ple linear and non-linear transfor- mations.

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Big Data

Extremely large data sets that may be analyzed computationally to re- veal patterns, trends, and associa- tions, especially relating to human behavior and interactions. Related terms: knowledge discovery in databases (KDD), pattern recognition, data warehousing, OLAP, ETL, IT, etc.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 4 / 19

slide-5
SLIDE 5

What is data science?

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 4 / 19

slide-6
SLIDE 6

What is data science?

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 4 / 19

slide-7
SLIDE 7

What is data science?

From data to information

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-8
SLIDE 8

What is data science?

From data to information

collected data

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-9
SLIDE 9

What is data science?

From data to information

collected data

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-10
SLIDE 10

What is data science?

From data to information

collected data

✲ ✲

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-11
SLIDE 11

What is data science?

From data to information

collected data

✲ ✲ ✲

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-12
SLIDE 12

What is data science?

From data to information

collected data

✲ ✲ ✲ ✲

⊆ RO(100+)

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-13
SLIDE 13

What is data science?

From data to information

collected data

✲ ✲ ✲ ✲

⊆ RO(100+)

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-14
SLIDE 14

What is data science?

From data to information

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-15
SLIDE 15

What is data science?

From data to information

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 5 / 19

slide-16
SLIDE 16

What is data science?

From data to information

Examples of data mining / analysis tasks: Recommend movies on Netflix or books on Amazon. Object recognition in images and automatic image tagging Community detection in social networks (e.g., Facebook) Automatic medical diagnosis and treatment recommendation Examples of data processing tasks that do not require data mining: Signature-based anti-virus Retrieving details from a contact list Text-based search in a document or on the web Quicksort, balanced trees, heaps, etc.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 6 / 19

slide-17
SLIDE 17

What is data science?

Predictive vs. descriptive methods

Predictive methods

Predict unknown information from known data. How much would my house sell for, based on sales stats? Will Bob like Ghostbusters, based on his Netflix history?

Descriptive methods

Infer or extract interpretable patterns to describe data. What consumer profiles should my ads target? If Jim’s card is trying to charge $300 in a Disney store today, is it reasonable or a fraud?

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 7 / 19

slide-18
SLIDE 18

What is data science?

Supervised vs. unsupervised learning

Machine learning data analysis tasks are roughly divided into:

Supervised learning

Inferring information from labeled training data.

Unsupervised learning

Finding hidden patterns in unlabeled data.

Semi-supervised learning

Combine information from labeled and unlabeled data to model and deduce information.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 8 / 19

slide-19
SLIDE 19

Data analysis tasks

Classification

Classification

Classify “items” into a finite set of classes, or “categories”.

Training phase

Labeled data:

  • {(x1, ℓ1), . . . , (xn, ℓn)} ⊂ X × L ⇒

Classification model:

  • F : X → L, F(xi) = ℓi

|L| < ∞

Testing phase

New data:

  • y1, y2, . . . ∈ X → classification model ⇒

Classification result:

  • F(y1), . . . , F(yn) ∈ L

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 9 / 19

slide-20
SLIDE 20

Data analysis tasks

Classification - examples

Example (MNIST digit classification)

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 10 / 19

slide-21
SLIDE 21

Data analysis tasks

Classification - examples

Example (CalTech 101 image classification)

Anchor Joshua-Tree Beaver Lotus Water-Lily

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 10 / 19

slide-22
SLIDE 22

Data analysis tasks

Regression

Regression

Compute (or infer) the value of a (piecewise) continuous function from a finite number of sampled “items” & values. This task is similar to classification, but here the model F can have an infinite range (e.g., ❘ or [0, 1]).

Examples

Market pricing of a house/apartment/car based on its features. Trend line & model fitting from collected experimental data. Weather predictions, such as temperature and probability of rain/snow. Confidence rating in diagnostics (or binary classifier).

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 11 / 19

slide-23
SLIDE 23

Data analysis tasks

Clustering

Clustering

Group together similar “items” while separating ones that are different from each other.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 12 / 19

slide-24
SLIDE 24

Data analysis tasks

Clustering

Clustering

Group together similar “items” while separating ones that are different from each other.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 12 / 19

slide-25
SLIDE 25

Data analysis tasks

Clustering

Clustering

Group together similar “items” while separating ones that are different from each other. The quality of obtained clusters stems from their interpretability. Variations include known or unknown number of cluster number, as well as multiscale hierarchical clustering structures.

Examples

Clustering stocks to diversify stock market investment Community detection in social networks by clustering profiles Clustering genes and cells to uncover activities, reactions, and interactions. Network activity profiling by clustering packets/sessions.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 12 / 19

slide-26
SLIDE 26

Data analysis tasks

Anomaly detection

Anomaly/outlier detection

Detect significant deviations from normal behavior expressed by inferred data patterns. The notion of “normal behavior” can be defined in several ways, such as clustering or model fitting.

Examples

Fraud detection in credit cards Intrusion detection in cybersecurity Detecting bot traffic in online advertising Malfunction detection in process monitoring

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 13 / 19

slide-27
SLIDE 27

Data analysis tasks

Association rules

Association rule discovery

Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it.

Training phase

Observed transactions:

  • T1, . . . , Tn ⊆ X ⇒

Association rules:

  • F : 2X → 2X, T ⊆ Ti → F(T) ≈ Ti \ T

Testing phase

Partial transactions:

  • S1, S2, . . . ⊆ X → association rules ⇒

Predicted information:

  • ∀i, Si → F(Si) ⊆ X \ Si

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 14 / 19

slide-28
SLIDE 28

Data analysis tasks

Association rules

Association rule discovery

Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it.

Examples

Active advertisements & recommendations (e.g., “Users who liked/bought this product also liked/bought that product”) Support decision making on shelve organization stores & supermarkets Name completions in emails, social networks, etc. Unlike classification, the actual testing phase is often less important than the discovered rules in this case.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 14 / 19

slide-29
SLIDE 29

Data analysis tasks

Sequential patterns

Sequential pattern discovery

Given a set of ordered event sequences, produce rules to predict unknown/missing/future events from prior and/or subsequent events. Similar in some sense to association rule discovery, but with an order

  • r timeline aspect to each transaction.

Examples

String mining:

Natural language processing Gene sequencing in DNA and RNA

Frequent item purchase sequences Predicting outcomes of medical treatment

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 15 / 19

slide-30
SLIDE 30

Data anlysis tasks

Dimensionality reduction & visualization

Dimensionality reduction

Find low dimensional coordinates (e.g., in ❘d with d < 10) to represent “items”. Used as a helpful, sometimes critical, preprocessing step to alleviate data analysis challenges arising from the curse of dimensionality.

Visualization

Find human interpretable 2D or 3D representations of the data via elements, patterns, trends, structures, etc., in it. Used to enable manual data processing and enable a human user to draw conclusions, support decision making, or guide further data exploration, from the data.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 16 / 19

slide-31
SLIDE 31

Data anlysis tasks

Dimensionality reduction & visualization

Dimensionality reduction

Find low dimensional coordinates (e.g., in ❘d with d < 10) to represent “items”.

Visualization

Find human interpretable 2D or 3D representations of the data via elements, patterns, trends, structures, etc., in it. A combination of these techniques can help create interactive data processing algorithms that utilize unsupervised descriptive elements to request and enable human input, and then use semi-supervised predictive approaches to produce stronger results.

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 16 / 19

slide-32
SLIDE 32

Data anlysis tasks

Dimensionality reduction & visualization - example

Modeling lip motions in speech:

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 17 / 19

slide-33
SLIDE 33

Data anlysis tasks

Dimensionality reduction & visualization - example

Modeling lip motions in speech: Dominating parameters: lips opening and teeth showing

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 17 / 19

slide-34
SLIDE 34

Data anlysis tasks

Dimensionality reduction & visualization - example

Modeling lip motions in speech:

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 17 / 19

slide-35
SLIDE 35

Data Analysis Process

Typical steps in a data analysis process

1

Recognizing the specific task

2

Knowing your data

3

Preprocessing

4

Apply algorithms

5

Postprocessing & getting interpretable results

6

Evaluation & cross validation

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 18 / 19

slide-36
SLIDE 36

Data Analysis Process

Typical steps in a data analysis process

1

Recognizing the specific task

2

Knowing your data

3

Preprocessing

4

Apply algorithms

5

Postprocessing & getting interpretable results

6

Evaluation & cross validation

P ✐

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 18 / 19

slide-37
SLIDE 37

Data Analysis Process

Typical steps in a data analysis process

1

Recognizing the specific task

2

Knowing your data

3

Preprocessing

4

Apply algorithms

5

Postprocessing & getting interpretable results

6

Evaluation & cross validation

❍ ❨

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 18 / 19

slide-38
SLIDE 38

Software for Data Analysis

Software recommended in this course:

Matlab Python (with numpy, scipy, scikit-learn)

Other software:

R (especially popular in statistics) Scilab & Octave (can be used in lieau of Matlab) C/C++, Java, & C# (.Net) Weka Fortran (sometimes still used in numerical analysis) Many other scripting and programming platforms

MAT 6480W (Guy Wolf) Introduction UdeM - Fall 2019 19 / 19