Introduction to Data Mining Umberto Nanni Seminars of Software and - - PowerPoint PPT Presentation

introduction to data mining
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Mining Umberto Nanni Seminars of Software and - - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction


slide-1
SLIDE 1

Master of Science in Engineering in Computer Science (MSE-CS)

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI

(MSE-CS)

Seminars in Software and Services for the Information Society

Umberto Nanni

1 Seminars of Software and Services for the Information Society Umberto Nanni

Introduction to Data Mining

slide-2
SLIDE 2

Data Mining

  • born before the Data Warehouse
  • collection of techniques from: Artificial Intelligence,
  • collection of techniques from: Artificial Intelligence,

Pattern Recognition, Statistics (e.g., genetic algorithms, fuzzy logic, expert systems, neural networks, etc.)

  • targets:

– descriptive goals: identify patterns of behavior, cause-

2 Seminars of Software and Services for the Information Society Umberto Nanni

– descriptive goals: identify patterns of behavior, cause- effect relationships, classifying individuals, etc. – predictive goals: predict trends, to classify individuals according to risk, etc.

slide-3
SLIDE 3

Some applications for Data Mining

  • Data Analysis and Decision Support Systems
  • Market Analysis and Marketing

Target Marketing, Customer Relationship Management – Target Marketing, Customer Relationship Management (CRM), Market Basket Analysis (MBA), market segmentation

  • Analysis and risk management

– reliability forecasts, user loyalty, quality control, ... – detection of frauds and unusual patterns (outliers)

3 Seminars of Software and Services for the Information Society Umberto Nanni

detection of frauds and unusual patterns (outliers)

  • Text Mining
  • Web Mining, ClickStream Analysis
  • Genetic engineering, DNA interpretation, ...
slide-4
SLIDE 4

Data Mining: associative rules

IF X (“the customer purchases beer”) THEN Y (“the customer purchases diaper”) X → Y X → Y Support (what fraction of individual follows the rule): s = |X ∩ Y| |all| Confidence (what fraction of individual to whom the rule applies, follows the rule): s(X→Y) = F(X∧Y)

4 Seminars of Software and Services for the Information Society Umberto Nanni

applies, follows the rule): c = |X ∩ Y| |X| Range: economics (e.g.: market basket analysis), telecommunication, health care, … c(X→Y) = F( Y | X )

slide-5
SLIDE 5

Data Mining: clustering

  • identify similarities, spot heterogeneity in the distribution in
  • rder to define homogeneous groups (unsupervised learning)
  • search clusters based on
  • search clusters based on

– distribution of population – a notion of “distance” Example: DFI – Disease-Free Interval (5 years) (collaboration with Ist. Regina Elena, Roma)

5 Seminars of Software and Services for the Information Society Umberto Nanni

5y disease-free interval s-phase 5y disease-free interval clustering s-phase

slide-6
SLIDE 6

Data Mining: decision tree

Determine the causes of an interesting phenomenon (with a set of output values), sorted by relevance

– internal node: attribute value to be appraised – internal node: attribute value to be appraised – branching: value (or value interval) for an attribute – leave: one of the possible output values

age? <=30 30..40 >40

Example: will the customer buy a computer ?

6 Seminars of Software and Services for the Information Society Umberto Nanni

student? credit?

yes

<=30 30..40 >40

no yes no yes

yes no high low

will the customer buy a computer ?

slide-7
SLIDE 7

Data Mining: time sequences

  • spot recurrent / unusual patterns in time

sequences

  • feature prediction
  • feature prediction

cost

Example (Least Cost Routing): routing a telephone call over the cheapest available connection (coooperation with Between – consulting firm)

KEY QUESTION: given an outbound call from an internal line X toward an external number Y, how long the

7 Seminars of Software and Services for the Information Society Umberto Nanni

flat rate connection fee Rates:

duration cos

toward an external number Y, how long the call?

slide-8
SLIDE 8

Neural Networks

Problem: can you write a program which recognizes can you write a program which recognizes human writing of capital letters...

?

8 Seminars of Software and Services for the Information Society Umberto Nanni

slide-9
SLIDE 9

Data Mining: “interesting” results

  • Simplicity - For example:

– length of rules (associative)

right predicted value value

confusion matrix – length of rules (associative) – size (decision tree)

  • Certainty - For example:

– confidence (Association Rules): c(X → Y) = #(X and Y) / #(X) – reliability of classification

  • Usefulness - For example:

– support (Association Rules) s(X → Y) = #(X and Y) / #(ALL)

right wrong effective valu

9 Seminars of Software and Services for the Information Society Umberto Nanni

– support (Association Rules) s(X → Y) = #(X and Y) / #(ALL)

  • Novelty - For example:

– not known previously – surprising – subsumption of other rules (included as special cases)

slide-10
SLIDE 10

Confusion matrix

10 Seminars of Software and Services for the Information Society Umberto Nanni

10

slide-11
SLIDE 11

Confusion matrix & Terminology

Positive (P), Negative (N) True Positive (TP), True Negative (TN) False Positive (FP), False Negative (FN) True Positive Rate [sensitivity, recall] TPR = TP / P = TP / (TP+FN) False Positive Rate FPR = FP / N = FP / (FP + TN) ACCuracy ACC = (TP + TN) / (P + N) SPeCificity (True Negative Rate) SPC = TN / N = TN / (FP + TN) = 1 - FPR

11 Seminars of Software and Services for the Information Society Umberto Nanni

SPC = TN / N = TN / (FP + TN) = 1 - FPR Positive Predictive Value [precision] PPV = TP / (TP + FP) Negative Predictive Value NPV = TN / (TN + FN) False Discovery Rate FDR = FP / (FP + TP)

slide-12
SLIDE 12

ROC curve

Receiver Operating Characteristic (from signal detection theory)

Fundamental tool for evaluation of a learning algorithm. learning algorithm. Y axis: True Positive Rate (Sensitivity) X axis: False Positive Rate (100-Specificity) Each point on the ROC curve represents a

12 Seminars of Software and Services for the Information Society Umberto Nanni

Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The Area Under the ROC Curve (AUC) is a measure of how well a parameter can distinguish between two groups (YES/NO decision).

slide-13
SLIDE 13

ROC curve: examples

13 Seminars of Software and Services for the Information Society Umberto Nanni

slide-14
SLIDE 14

Mining Rules from Databases – Algorithm: APRIORI

Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. 20th International Conference on Very Large Data Bases (VLDB), pp.487-499, Santiago, Chile, September 1994.

generation pruning

APRIORI Algorithm: 1. L1 = { large 1-itemsets } 2. for ( k = 2; Lk -1 ≠ ∅ ; k++ ) do begin 3. Ck = apriori-generate (Lk -1) // Candidates (extending prev. tuples) 4. forall transactions t∈D do begin 5. Ct = subset(Ck , t) // Candidates contained in t 6. forall candidates c ∈ Ct do

14 Seminars of Software and Services for the Information Society Umberto Nanni

pruning

6. forall candidates c ∈ Ct do 7. c.count++ 8. end 9. L k = { c ∈ C k | c.count ≥ minsupport }

  • 10. end
  • 11. ANSWER = U k L k