introduction to data mining
play

Introduction to Data Mining Umberto Nanni Seminars of Software and - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction


  1. D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Introduction to Data Mining Umberto Nanni Seminars of Software and Services for the Information Society 1

  2. Data Mining • born before the Data Warehouse • collection of techniques from: Artificial Intelligence, • collection of techniques from: Artificial Intelligence, Pattern Recognition, Statistics (e.g., genetic algorithms, fuzzy logic, expert systems, neural networks, etc.) • targets: – descriptive goals: identify patterns of behavior, cause- – descriptive goals: identify patterns of behavior, cause- effect relationships, classifying individuals, etc. – predictive goals: predict trends, to classify individuals according to risk, etc. Umberto Nanni Seminars of Software and Services for the Information Society 2

  3. Some applications for Data Mining • Data Analysis and Decision Support Systems • Market Analysis and Marketing – Target Marketing, Customer Relationship Management Target Marketing, Customer Relationship Management (CRM), Market Basket Analysis (MBA), market segmentation • Analysis and risk management – reliability forecasts, user loyalty, quality control, ... – detection of frauds and unusual patterns (outliers) detection of frauds and unusual patterns (outliers) • Text Mining • Web Mining, ClickStream Analysis • Genetic engineering, DNA interpretation, ... Umberto Nanni Seminars of Software and Services for the Information Society 3

  4. Data Mining: associative rules IF X (“the customer purchases beer”) THEN Y (“the customer purchases diaper”) X → Y X → Y Support (what fraction of individual follows the rule): s = |X ∩ Y| s(X → Y) = F(X ∧ Y) |all| Confidence (what fraction of individual to whom the rule applies, follows the rule): applies, follows the rule): c = |X ∩ Y| c(X → Y) = F( Y | X ) |X| Range: economics (e.g.: market basket analysis ), telecommunication, health care, … Umberto Nanni Seminars of Software and Services for the Information Society 4

  5. Data Mining: clustering • identify similarities, spot heterogeneity in the distribution in order to define homogeneous groups (unsupervised learning) • search clusters based on • search clusters based on – distribution of population – a notion of “distance” Example: DFI – Disease-Free Interval (5 years) (collaboration with Ist. Regina Elena, Roma) 5y disease-free interval clustering 5y disease-free interval s-phase s-phase Umberto Nanni Seminars of Software and Services for the Information Society 5

  6. Data Mining: decision tree Determine the causes of an interesting phenomenon (with a set of output values), sorted by relevance – internal node: attribute value to be appraised – internal node: attribute value to be appraised – branching: value (or value interval) for an attribute – leave: one of the possible output values Example: age? will the customer buy a computer ? will the customer buy a computer ? <=30 <=30 30..40 30..40 >40 >40 yes student? credit? no yes low high no yes no yes Umberto Nanni Seminars of Software and Services for the Information Society 6

  7. Data Mining: time sequences • spot recurrent / unusual patterns in time sequences • feature prediction • feature prediction Example (Least Cost Routing): routing a telephone call over the cheapest available connection (coooperation with Between – consulting firm) KEY QUESTION: given an outbound call from an internal line X cost cos toward an external number Y, how long the toward an external number Y, how long the call? Rates: connection fee flat rate duration Umberto Nanni Seminars of Software and Services for the Information Society 7

  8. Neural Networks Problem: can you write a program which recognizes can you write a program which recognizes human writing of capital letters... ? Umberto Nanni Seminars of Software and Services for the Information Society 8

  9. Data Mining: “interesting” results confusion matrix predicted value • Simplicity - For example: value effective valu right right – length of rules (associative) – length of rules (associative) wrong – size (decision tree) • Certainty - For example: – confidence (Association Rules): c(X → Y) = #(X and Y) / #(X) – reliability of classification • Usefulness - For example: – support (Association Rules) s(X → Y) = #(X and Y) / #(ALL) – support (Association Rules) s(X → Y) = #(X and Y) / #(ALL) • Novelty - For example: – not known previously – surprising – subsumption of other rules (included as special cases) Umberto Nanni Seminars of Software and Services for the Information Society 9

  10. Confusion matrix Umberto Nanni Seminars of Software and Services for the Information Society 10 10

  11. Confusion matrix & Terminology Positive (P), Negative (N) True Positive (TP), True Negative (TN) False Positive (FP), False Negative (FN) True Positive Rate [ sensitivity , recall ] TPR = TP / P = TP / (TP+FN) False Positive Rate FPR = FP / N = FP / (FP + TN) ACCuracy ACC = (TP + TN) / (P + N) SPeCificity (True Negative Rate) SPC = TN / N = TN / (FP + TN) = 1 - FPR SPC = TN / N = TN / (FP + TN) = 1 - FPR Positive Predictive Value [ precision ] PPV = TP / (TP + FP) Negative Predictive Value NPV = TN / (TN + FN) False Discovery Rate FDR = FP / (FP + TP) Umberto Nanni Seminars of Software and Services for the Information Society 11

  12. ROC curve Receiver Operating Characteristic (from signal detection theory) Fundamental tool for evaluation of a learning algorithm. learning algorithm. Y axis: True Positive Rate (Sensitivity) X axis: False Positive Rate (100-Specificity) Each point on the ROC curve represents a Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The Area Under the ROC Curve (AUC) is a measure of how well a parameter can distinguish between two groups (YES/NO decision). Umberto Nanni Seminars of Software and Services for the Information Society 12

  13. ROC curve: examples Umberto Nanni Seminars of Software and Services for the Information Society 13

  14. Mining Rules from Databases – Algorithm: APRIORI Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. 20th International Conference on Very Large Data Bases (VLDB) , pp.487-499, Santiago, Chile, September 1994. APRIORI Algorithm: 1. L 1 = { large 1-itemsets } for ( k = 2; L k -1 ≠ ∅ ; k++ ) do begin 2. generation 3. C k = apriori-generate (L k -1 ) // Candidates (extending prev. tuples) forall transactions t ∈ D do begin 4. 5. C t = subset(C k , t) // Candidates contained in t forall candidates c ∈ C t do forall candidates c ∈ C t do 6. 6. pruning pruning 7. c.count++ 8. end L k = { c ∈ C k | c.count ≥ minsupport } 9. 10. end 11. ANSWER = U k L k Umberto Nanni Seminars of Software and Services for the Information Society 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend