Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf - - PowerPoint PPT Presentation

introduction to data mining
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf - - PowerPoint PPT Presentation

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19 Outline What is Data Mining? 1 From data to information Predictive vs.


slide-1
SLIDE 1

Introduction to Data Mining

CPSC/AMTH 445a/545a

Guy Wolf guy.wolf@yale.edu

Yale University Fall 2016

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19

slide-2
SLIDE 2

Outline

1

What is Data Mining? From data to information Predictive vs. descriptive information Supervised vs. unsupervised learning

2

Data Mining Tasks Classification & regression Clustering & anomaly detection Association rules & sequential patterns Visualization & dimensionality reduction

3

Data mining process

4

Software for data mining

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 2 / 19

slide-3
SLIDE 3

Recommended textbooks

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 3 / 19

slide-4
SLIDE 4

What is data mining?

Data Mining

Non-trivial extraction of useful, new, hidden, and/or implicit infor- mation from data.

Deep Learning

A set of algorithms that attempt to model high-level data abstrac- tions in data by using multiple pro- cessing layers, composed of multi- ple linear and non-linear transfor- mations.

Machine Learning

Field of study that gives computers the ability to learn without being explicitly programmed.

Big Data

Extremely large data sets that may be analyzed computationally to re- veal patterns, trends, and associa- tions, especially relating to human behavior and interactions. Related terms: knowledge discovery in databases (KDD), pattern recognition, data warehousing, OLAP, ETL, IT, etc.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

slide-5
SLIDE 5

What is data mining?

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

slide-6
SLIDE 6

What is data mining?

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

slide-7
SLIDE 7

What is data mining?

From data to information

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-8
SLIDE 8

What is data mining?

From data to information

collected data

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-9
SLIDE 9

What is data mining?

From data to information

collected data

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-10
SLIDE 10

What is data mining?

From data to information

collected data

✲ ✲

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-11
SLIDE 11

What is data mining?

From data to information

collected data

✲ ✲ ✲

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-12
SLIDE 12

What is data mining?

From data to information

collected data

✲ ✲ ✲ ✲

⊆ RO(100+)

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-13
SLIDE 13

What is data mining?

From data to information

collected data

✲ ✲ ✲ ✲

⊆ RO(100+)

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-14
SLIDE 14

What is data mining?

From data to information

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-15
SLIDE 15

What is data mining?

From data to information

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

slide-16
SLIDE 16

What is data mining?

From data to information

Examples of data mining tasks: Recommend movies on Netflix or books on Amazon. Object recognition in images and automatic image tagging Community detection in social networks (e.g., Facebook) Automatic medical diagnosis and treatment recommendation Examples of data processing tasks that are not considered data mining: Signature-based anti-virus Retrieving details from a contact list Text-based search in a document or on the web Quicksort, balanced trees, heaps, etc.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 6 / 19

slide-17
SLIDE 17

What is data mining?

Predictive vs. descriptive methods

Predictive methods

Predict unknown information from known data. How much would my house sell for, based on sales stats? Will Bob like Ghostbusters, based on his Netflix history?

Descriptive methods

Infer or extract interpretable patterns to describe data. What consumer profiles should my ads target? If Jim’s card is trying to charge $300 in a Disney store today, is it reasonable or a fraud?

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 7 / 19

slide-18
SLIDE 18

What is data mining?

Supervised vs. unsupervised learning

Machine learning data analysis tasks are roughly divided into:

Supervised learning

Inferring information from labeled training data.

Unsupervised learning

Finding hidden patterns in unlabeled data.

Semi-supervised learning

Combine information from labeled and unlabeled data to model and deduce information.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 8 / 19

slide-19
SLIDE 19

Data mining tasks

Classification

Classification

Classify “items” into a finite set of classes, or “categories”.

Training phase

Labeled data:

  • {(x1, ℓ1), . . . , (xn, ℓn)} ⊂ X × L ⇒

Classification model:

  • F : X → L, F(xi) = ℓi

|L| < ∞

Testing phase

New data:

  • y1, y2, . . . ∈ X → classification model ⇒

Classification result:

  • F(y1), . . . , F(yn) ∈ L

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 9 / 19

slide-20
SLIDE 20

Data mining tasks

Classification - examples

Example (MNIST digit classification)

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19

slide-21
SLIDE 21

Data mining tasks

Classification - examples

Example (CalTech 101 image classification)

Anchor Joshua-Tree Beaver Lotus Water-Lily

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19

slide-22
SLIDE 22

Data mining tasks

Regression

Regression

Compute (or infer) the value of a (piecewise) continuous function from a finite number of sampled “items” & values. This task is similar to classification, but here the model F can have an infinite range (e.g., ❘ or [0, 1]).

Examples

Market pricing of a house/apartment/car based on its features. Trend line & model fitting from collected experimental data. Weather predictions, such as temperature and probability of rain/snow. Confidence rating in diagnostics (or binary classifier).

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 11 / 19

slide-23
SLIDE 23

Data mining tasks

Clustering

Clustering

Group together similar “items” while separating ones that are different from each other.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

slide-24
SLIDE 24

Data mining tasks

Clustering

Clustering

Group together similar “items” while separating ones that are different from each other.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

slide-25
SLIDE 25

Data mining tasks

Clustering

Clustering

Group together similar “items” while separating ones that are different from each other. The quality of obtained clusters stems from their interpretability. Variations include known or unknown number of cluster number, as well as multiscale hierarchical clustering structures.

Examples

Clustering stocks to diversify stock market investment Community detection in social networks by clustering profiles Clustering genes and cells to uncover activities, reactions, and interactions. Network activity profiling by clustering packets/sessions.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

slide-26
SLIDE 26

Data mining tasks

Anomaly detection

Anomaly/outlier detection

Detect significant deviations from normal behavior expressed by inferred data patterns. The notion of “normal behavior” can be defined in several ways, such as clustering or model fitting.

Examples

Fraud detection in credit cards Intrusion detection in cybersecurity Detecting bot traffic in online advertising Malfunction detection in process monitoring

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 13 / 19

slide-27
SLIDE 27

Data mining tasks

Association rules

Association rule discovery

Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it.

Training phase

Observed transactions:

  • T1, . . . , Tn ⊆ X ⇒

Association rules:

  • F : 2X → 2X, T ⊆ Ti → F(T) ≈ Ti \ T

Testing phase

Partial transactions:

  • S1, S2, . . . ⊆ X → association rules ⇒

Predicted information:

  • ∀i, Si → F(Si) ⊆ X \ Si

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19

slide-28
SLIDE 28

Data mining tasks

Association rules

Association rule discovery

Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it.

Examples

Active advertisements & recommendations (e.g., “Users who liked/bought this product also liked/bought that product”) Support decision making on shelve organization stores & supermarkets Name completions in emails, social networks, etc. Unlike classification, the actual testing phase is often less important than the discovered rules in this case.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19

slide-29
SLIDE 29

Data mining tasks

Sequential patterns

Sequential pattern discovery

Given a set of ordered event sequences, produce rules to predict unknown/missing/future events from prior and/or subsequent events. Similar in some sense to association rule discovery, but with an order

  • r timeline aspect to each transaction.

Examples

String mining:

Natural language processing Gene sequencing in DNA and RNA

Frequent item purchase sequences Predicting outcomes of medical treatment

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 15 / 19

slide-30
SLIDE 30

Data mining tasks

Dimensionality reduction & visualization

Dimensionality reduction

Find low dimensional coordinates (e.g., in ❘d with d < 10) to represent “items”. Used as a helpful, sometimes critical, preprocessing step to alleviate data analysis challenges arising from the curse of dimensionality.

Visualization

Find human interpretable 2D or 3D representations of the data via elements, patterns, trends, structures, etc., in it. Used to enable manual data processing and enable a human user to draw conclusions, support decision making, or guide further data exploration, from the data.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 16 / 19

slide-31
SLIDE 31

Data mining tasks

Dimensionality reduction & visualization

Dimensionality reduction

Find low dimensional coordinates (e.g., in ❘d with d < 10) to represent “items”.

Visualization

Find human interpretable 2D or 3D representations of the data via elements, patterns, trends, structures, etc., in it. A combination of these techniques can help create interactive data processing algorithms that utilize unsupervised descriptive elements to request and enable human input, and then use semi-supervised predictive approaches to produce stronger results.

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 16 / 19

slide-32
SLIDE 32

Data mining tasks

Dimensionality reduction & visualization - example

Modeling lip motions in speech:

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 17 / 19

slide-33
SLIDE 33

Data mining tasks

Dimensionality reduction & visualization - example

Modeling lip motions in speech: Dominating parameters: lips opening and teeth showing

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 17 / 19

slide-34
SLIDE 34

Data mining tasks

Dimensionality reduction & visualization - example

Modeling lip motions in speech:

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 17 / 19

slide-35
SLIDE 35

Data mining process

Typical steps in a data mining process

1

Recognizing the specific task

2

Knowing your data

3

Preprocessing

4

Apply algorithms

5

Postprocessing & getting interpretable results

6

Evaluation & cross validation

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 18 / 19

slide-36
SLIDE 36

Data mining process

Typical steps in a data mining process

1

Recognizing the specific task

2

Knowing your data

3

Preprocessing

4

Apply algorithms

5

Postprocessing & getting interpretable results

6

Evaluation & cross validation

P ✐

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 18 / 19

slide-37
SLIDE 37

Data mining process

Typical steps in a data mining process

1

Recognizing the specific task

2

Knowing your data

3

Preprocessing

4

Apply algorithms

5

Postprocessing & getting interpretable results

6

Evaluation & cross validation

❍ ❨

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 18 / 19

slide-38
SLIDE 38

Software for data mining

Software used in this course:

Matlab Python (with Numpy & Scipy)

Other software:

Weka R (especially for statisticians) Scilab & Octave (can be used in lieau of Matlab) C/C++, Java, & C# (.Net) Fortran Many other scripting and programming platforms

CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 19 / 19