Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf - PowerPoint PPT Presentation

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19

Outline What is Data Mining? 1 From data to information Predictive vs. descriptive information Supervised vs. unsupervised learning Data Mining Tasks 2 Classification & regression Clustering & anomaly detection Association rules & sequential patterns Visualization & dimensionality reduction Data mining process 3 Software for data mining 4 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 2 / 19

Recommended textbooks CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 3 / 19

What is data mining? Data Mining Machine Learning Non-trivial extraction of useful, Field of study that gives computers new, hidden, and/or implicit infor- the ability to learn without being mation from data. explicitly programmed. Deep Learning Big Data A set of algorithms that attempt Extremely large data sets that may to model high-level data abstrac- be analyzed computationally to re- tions in data by using multiple pro- veal patterns, trends, and associa- cessing layers, composed of multi- tions, especially relating to human ple linear and non-linear transfor- behavior and interactions. mations. Related terms: knowledge discovery in databases (KDD), pattern recognition, data warehousing, OLAP, ETL, IT, etc. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

What is data mining? CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 4 / 19

What is data mining? From data to information CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information collected data ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information collected data ✲ ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information collected data ✲ ✲ ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information collected data ✲ ⊆ R O ( 100 + ) ✲ ✲ ✲ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information collected data ✲ ⊆ R O ( 100 + ) ✲ ✲ ✲ � � � � ✠ CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 5 / 19

What is data mining? From data to information Examples of data mining tasks: Recommend movies on Netflix or books on Amazon. Object recognition in images and automatic image tagging Community detection in social networks (e.g., Facebook) Automatic medical diagnosis and treatment recommendation Examples of data processing tasks that are not considered data mining: Signature-based anti-virus Retrieving details from a contact list Text-based search in a document or on the web Quicksort, balanced trees, heaps, etc. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 6 / 19

What is data mining? Predictive vs. descriptive methods Predictive methods Predict unknown information from known data. How much would my house sell for, based on sales stats? Will Bob like Ghostbusters, based on his Netflix history? Descriptive methods Infer or extract interpretable patterns to describe data. What consumer profiles should my ads target? If Jim’s card is trying to charge $300 in a Disney store today, is it reasonable or a fraud? CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 7 / 19

What is data mining? Supervised vs. unsupervised learning Machine learning data analysis tasks are roughly divided into: Supervised learning Inferring information from labeled training data. Unsupervised learning Finding hidden patterns in unlabeled data. Semi-supervised learning Combine information from labeled and unlabeled data to model and deduce information. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 8 / 19

Data mining tasks Classification Classification Classify “items” into a finite set of classes, or “categories”. Training phase Classification model: Labeled data: � �� { ( x 1 , ℓ 1 ) , . . . , ( x n , ℓ n ) } ⊂ X × L � ⇒ F : X → L , F ( x i ) = ℓ i | L | < ∞ Testing phase Classification result: New data: � �� y 1 , y 2 , . . . ∈ X �→ classification model � ⇒ F ( y 1 ) , . . . , F ( y n ) ∈ L CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 9 / 19

Data mining tasks Classification - examples Example (MNIST digit classification) CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19

Data mining tasks Classification - examples Example (CalTech 101 image classification) Anchor Joshua-Tree Beaver Lotus Water-Lily CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 10 / 19

Data mining tasks Regression Regression Compute (or infer) the value of a (piecewise) continuous function from a finite number of sampled “items” & values. This task is similar to classification, but here the model F can have an infinite range (e.g., ❘ or [ 0 , 1 ] ). Examples Market pricing of a house/apartment/car based on its features. Trend line & model fitting from collected experimental data. Weather predictions, such as temperature and probability of rain/snow. Confidence rating in diagnostics (or binary classifier). CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 11 / 19

Data mining tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

Data mining tasks Clustering Clustering Group together similar “items” while separating ones that are different from each other. The quality of obtained clusters stems from their interpretability. Variations include known or unknown number of cluster number, as well as multiscale hierarchical clustering structures. Examples Clustering stocks to diversify stock market investment Community detection in social networks by clustering profiles Clustering genes and cells to uncover activities, reactions, and interactions. Network activity profiling by clustering packets/sessions. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 12 / 19

Data mining tasks Anomaly detection Anomaly/outlier detection Detect significant deviations from normal behavior expressed by inferred data patterns. The notion of “normal behavior” can be defined in several ways, such as clustering or model fitting. Examples Fraud detection in credit cards Intrusion detection in cybersecurity Detecting bot traffic in online advertising Malfunction detection in process monitoring CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 13 / 19

Data mining tasks Association rules Association rule discovery Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it. Training phase Association rules: Observed transactions: � �� F : 2 X → 2 X , T ⊆ T i �→ F ( T ) ≈ T i \ T T 1 , . . . , T n ⊆ X � ⇒ Testing phase Predicted information: Partial transactions: � �� S 1 , S 2 , . . . ⊆ X �→ association rules � ⇒ ∀ i , S i �→ F ( S i ) ⊆ X \ S i CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19

Data mining tasks Association rules Association rule discovery Produce dependency rules that model input coocurrences of “items” to predict, given a partial “transaction”, the remaining “items” in it. Examples Active advertisements & recommendations (e.g., “Users who liked/bought this product also liked/bought that product”) Support decision making on shelve organization stores & supermarkets Name completions in emails, social networks, etc. Unlike classification, the actual testing phase is often less important than the discovered rules in this case. CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 14 / 19

Data mining tasks Sequential patterns Sequential pattern discovery Given a set of ordered event sequences, produce rules to predict unknown/missing/future events from prior and/or subsequent events. Similar in some sense to association rule discovery, but with an order or timeline aspect to each transaction. Examples String mining: Natural language processing Gene sequencing in DNA and RNA Frequent item purchase sequences Predicting outcomes of medical treatment CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 15 / 19

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf - PowerPoint PPT Presentation

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19 Outline What is Data Mining? 1 From data to information Predictive vs.

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Security & Authorization Ramakrishnan & Gehrke, Chapter 21 340151 Big Databases &

Welcome SD Coordinated Plan for Natural Resources Conservation Public Input Meetings November

Applied Text-Mining algorithms for stock price prediction based on financial news articles Adrian

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke & Mark Shropshire Todays

CS 309: Autonomous Intelligent Robotics FRI I Lecture 4: AI Part 2 & C++ Part 2 Instructor:

FX Options Trading Class 7 Gregory McDermott, MApp Fin OU Chief FX Strategist U.S.

PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes Rundi Wu Yixin Zhuang Kai Xu Hao

Linear Models Continued: Perceptron & Logistic Regression CMSC 723 / LING 723 / INST 725

Sambuz

Useful Links

Newsletter

Mail Us

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf - PowerPoint PPT Presentation

Introduction to Data Mining CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Introduction Yale - Fall 2016 1 / 19 Outline What is Data Mining? 1 From data to information Predictive vs.

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Security &amp; Authorization Ramakrishnan &amp; Gehrke, Chapter 21 340151 Big Databases &amp;

Welcome SD Coordinated Plan for Natural Resources Conservation Public Input Meetings November

Applied Text-Mining algorithms for stock price prediction based on financial news articles Adrian

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke &amp; Mark Shropshire Todays

CS 309: Autonomous Intelligent Robotics FRI I Lecture 4: AI Part 2 &amp; C++ Part 2 Instructor:

FX Options Trading Class 7 Gregory McDermott, MApp Fin OU Chief FX Strategist U.S.

PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes Rundi Wu Yixin Zhuang Kai Xu Hao

Linear Models Continued: Perceptron &amp; Logistic Regression CMSC 723 / LING 723 / INST 725

Sambuz

Useful Links

Newsletter

Mail Us

Security & Authorization Ramakrishnan & Gehrke, Chapter 21 340151 Big Databases &

Demystifying Decoupled Drupal with Contenta CMS Bayo Fodeke & Mark Shropshire Todays

CS 309: Autonomous Intelligent Robotics FRI I Lecture 4: AI Part 2 & C++ Part 2 Instructor:

Linear Models Continued: Perceptron & Logistic Regression CMSC 723 / LING 723 / INST 725