Unsupervised Machine Learning and Data Mining
DS 5230 / DS 4420 - Fall 2018
Lecture 1
Jan-Willem van de Meent
Lecture 1 Jan-Willem van de Meent What is Data Mining? - - PowerPoint PPT Presentation
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent What is Data Mining? Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data
DS 5230 / DS 4420 - Fall 2018
Jan-Willem van de Meent
Data Mining
Machine Learning Optimization Applications Statistics Visualization Algorithms Distributed Computing Databases
(slide adapted from Han et al. Data Mining Concepts and Techniques)
Data Cleaning Data Integration Databases Data Warehouse
Knowledge
Task-relevant Data Data / Feature Selection Data Mining Pattern Evaluation
(slide adapted from Han et al. Data Mining Concepts and Techniques)
Integration Normalization Feature selection Dimension reduction Association Rules Classification Clustering Outlier detection Evaluation Visualization Interpretation
Data Collection Cleaning / Preprocessing Machine Learning Posthoc Analysis Deployment
Pipelining Scaling API Sensors Scraper
Data Science
(slide adapted from Nate Derbinsky)
Data Types Methods Tasks
Reduction
Analysis
Analysis
Systems
Detection
Analysis
Supervised Learning Unsupervised Learning (This Course) Reinforcement Learning Given labeled examples, learn to make predictions for unlabeled examples. Example: Image classification. Given unlabeled examples learn to identify structure. Example: Community detection in social networks. Learn to take actions that maximize future reward. Example: Targeting advertisements.
Boston Housing Data (source: UCI ML datasets) https://archive.ics.uci.edu/ml/datasets/Housing Goal: Predict a Continuous Label
Target Variable MEDV: Median value of owner-occupied homes in $1000's
CRIM: per capita crime rate by town Real-valued Features
CHAS: Charles River variable (= 1 if tract bounds river; 0 otherwise) Discrete / Categorical Features
DIS: weighted distances to five Boston employment centers Hand-Engineered Features
source: https://am241.wordpress.com/tag/time-series/
Goal: Use past labels (red) to learn trends that generalize to future data points (green) Time-series Data
Goal: Predict a discrete label.
Input Images Hidden Units Label (one-hot) 28 x 28 256 10
[0 0 0 0 0 0 0 0 1 0]: 9 [0 0 0 0 0 0 0 1 0 0]: 8 [0 0 0 0 1 0 0 0 0 0]: 5 [0 0 0 0 0 0 1 0 0 0]: 7 [0 0 0 0 0 0 0 1 0 0]: 8
https://en.wikipedia.org/wiki/Iris_flower_data_set Iris Setosa Iris versicolor Iris virginica Example: Iris Data
Petal Sepal
Goal: Can we make predictions in absence of labels? Methods in this Course:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
TID Items
1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk
Baskets of items Association Rules
Original Data (4 dims) Projection with PCA (2 dims) Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities
Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities
Input Images PCA (Linear) TSNE (Non-linear)
MNIST
Iris Data (after PCA) Inferred Clusters Goal: Learn categories of examples (i.e. classification without labels)
Sequence of States Time Series Goal: Learn categories of time points (i.e. clustering of points within time series)
gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01
.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,
Topics Documents Topic proportions and assignments
Goal: Learn topics (categories of words) and quantify topic frequency for each document
Goal: Identify groups of connected nodes (i.e. clustering on graphs)
Goal: Identify groups of connected nodes (i.e. clustering on graphs)
Nodes: Football Teams, Edges: Matches, Communities: Conferences
Goal: Predict which website is the most authoritative. Many inbound links Few/no inbound links Links from unimportant pages Links from important pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Goal: Take action that maximizes future reward. Example: Google Plays Atari Action: Joystick direction / Buttons. Reward: Score.
Goal: Take action that maximizes future reward. Example: Netflix Website Design Action: Which movies to show. Reward: User Retention.
Goal: Predict user preferences for unseen items. Methods: Supervised learning (predict ratings), Reinforcement learning (rating is reward), Unsupervised learning (e.g. community detection on users / items)
Supervised Learning: Minimize regression or classification loss Unsupervised Learning: Maximize expected probability of data Reinforcement Learning: Maximize expected reward
Common theme in Machine Learning: Using data-driven algorithms to make predictions that are optimal according to some objective.
https://course.ccs.neu.edu/ds5230f18/
(within reason – TA must be able to run your code)
must be completed individually (absolutely no sharing of code )
by 11.59pm on day of deadline (no late submissions)
and pre-processsing
Class participation is used to adjust grade upwards (at the discretion of the instructor)
Hastie, Tribshirani, Friedman Leskovec, Rajaraman, Ullman Aggarwal Data Mining Statistics PDF freely available PDF freely available Available
network Bishop Murphy Machine Learning Recommend you buy one