Lecture 1 Jan-Willem van de Meent What is Data Mining? - - PowerPoint PPT Presentation

lecture 1
SMART_READER_LITE
LIVE PREVIEW

Lecture 1 Jan-Willem van de Meent What is Data Mining? - - PowerPoint PPT Presentation

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 1 Jan-Willem van de Meent What is Data Mining? Intersection of Disciplines Machine Statistics Optimization Learning Applications Visualization Data


slide-1
SLIDE 1

Unsupervised Machine Learning 
 and Data Mining

DS 5230 / DS 4420 - Fall 2018

Lecture 1

Jan-Willem van de Meent

slide-2
SLIDE 2

What is Data Mining?

slide-3
SLIDE 3

Intersection of Disciplines

Data Mining

Machine Learning Optimization Applications Statistics Visualization Algorithms Distributed
 Computing Databases

(slide adapted from Han et al. Data Mining Concepts and Techniques)

slide-4
SLIDE 4

Databases Perspective

Data Cleaning Data Integration Databases Data Warehouse

Knowledge

Task-relevant Data Data / Feature Selection Data Mining Pattern Evaluation

(slide adapted from Han et al. Data Mining Concepts and Techniques)

slide-5
SLIDE 5

Machine Learning Perspective

Integration Normalization Feature selection Dimension reduction Association Rules Classification Clustering Outlier detection Evaluation Visualization Interpretation

Data Collection Cleaning / Preprocessing Machine Learning Posthoc Analysis Deployment

Pipelining Scaling API Sensors Scraper

Data Science

(slide adapted from Nate Derbinsky)

slide-6
SLIDE 6

3 Aspects of Data Mining

Data Types Methods Tasks

  • Sets
  • Matrices / Tables
  • Graphs
  • Time series
  • Sequences
  • Text
  • Images
  • Association Rules
  • Dimensionality 


Reduction

  • Regression
  • Classification
  • Clustering
  • Topic Models
  • Bandits
  • Exploratory


Analysis

  • Market Basket


Analysis

  • Recommender


Systems

  • Community


Detection

  • Link


Analysis

slide-7
SLIDE 7

Machine Learning Methods

Supervised Learning Unsupervised Learning (This Course) Reinforcement Learning Given labeled examples, learn to make predictions for 
 unlabeled examples. Example: Image classification. Given unlabeled examples learn to identify structure. Example: Community detection in social networks. Learn to take actions that maximize future reward. Example: Targeting advertisements.

slide-8
SLIDE 8

Regression

Boston Housing Data (source: UCI ML datasets) https://archive.ics.uci.edu/ml/datasets/Housing Goal: Predict a Continuous Label

slide-9
SLIDE 9

Target Variable MEDV: Median value of owner-occupied homes in $1000's

Regression

slide-10
SLIDE 10

CRIM: per capita crime rate by town Real-valued Features

Regression

slide-11
SLIDE 11

CHAS: Charles River variable 
 (= 1 if tract bounds river; 0 otherwise) Discrete / Categorical Features

Regression

slide-12
SLIDE 12

DIS: weighted distances to five 
 Boston employment centers Hand-Engineered Features

Regression

slide-13
SLIDE 13

Regression

source: https://am241.wordpress.com/tag/time-series/

Goal: Use past labels (red) to learn trends that 
 generalize to future data points (green) Time-series Data

slide-14
SLIDE 14

Classification

Goal: Predict a discrete label.

Input Images Hidden Units Label (one-hot) 28 x 28 256 10

[0 0 0 0 0 0 0 0 1 0]: 9 [0 0 0 0 0 0 0 1 0 0]: 8 [0 0 0 0 1 0 0 0 0 0]: 5 [0 0 0 0 0 0 1 0 0 0]: 7 [0 0 0 0 0 0 0 1 0 0]: 8

slide-15
SLIDE 15

https://en.wikipedia.org/wiki/Iris_flower_data_set Iris Setosa Iris versicolor Iris virginica Example: Iris Data

Petal Sepal

Classification

slide-16
SLIDE 16

Unsupervised Learning

Goal: Can we make predictions in absence of labels? Methods in this Course:

  • Frequent Itemsets and Association rule mining
  • Dimensionality Reduction
  • Clustering
  • Topic Modeling
  • Community Detection
  • Link Analysis
  • Recommender Systems
slide-17
SLIDE 17

Association Rule Mining

{Milk} --> {Coke}

{Diaper, Milk} --> {Beer}

TID Items

1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk

Baskets of items Association Rules

slide-18
SLIDE 18

Dimensionality Reduction

Original Data (4 dims) Projection with PCA (2 dims) Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities

slide-19
SLIDE 19

Dimensionality Reduction

Goal: Map high dimensional data onto lower-dimensional data in a manner that preserves distances/similarities

Input Images PCA (Linear) TSNE (Non-linear)

MNIST

slide-20
SLIDE 20

Clustering

Iris Data (after PCA) Inferred Clusters Goal: Learn categories of examples (i.e. classification without labels)

slide-21
SLIDE 21

Hidden Markov Models

Sequence of States Time Series Goal: Learn categories of time points (i.e. clustering of points within time series)

slide-22
SLIDE 22

Topic Models

gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01

  • rganism 0.01

.,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,,

Topics Documents Topic proportions and assignments

Goal: Learn topics (categories of words) and quantify topic frequency for each document

slide-23
SLIDE 23

Community Detection

Goal: Identify groups of connected nodes (i.e. clustering on graphs)

slide-24
SLIDE 24

Community Detection

Goal: Identify groups of connected nodes (i.e. clustering on graphs)

Nodes: Football Teams, Edges: Matches, Communities: Conferences

slide-25
SLIDE 25

Link Analysis

  • Pages with more inbound links are more important
  • Inbound links from important pages carry more weight

Goal: Predict which website is the most authoritative. Many inbound
 links Few/no inbound
 links Links from unimportant pages Links from important pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-26
SLIDE 26

Reinforcement Learning

Goal: Take action that maximizes future reward. Example: Google Plays Atari Action: Joystick direction / Buttons. Reward: Score.

slide-27
SLIDE 27

Reinforcement Learning

Goal: Take action that maximizes future reward. Example: Netflix Website Design Action: Which movies to show. Reward: User Retention.

slide-28
SLIDE 28

Recommender Systems

Goal: Predict user preferences for unseen items. Methods: Supervised learning (predict ratings), Reinforcement learning (rating is reward), Unsupervised learning (e.g. community detection on users / items)

slide-29
SLIDE 29

Theme: Optimization of Objectives

Supervised Learning: Minimize regression or classification loss Unsupervised Learning: Maximize expected probability of data Reinforcement Learning: Maximize expected reward

Common theme in Machine Learning: 
 Using data-driven algorithms to make predictions 
 that are optimal according to some objective.

slide-30
SLIDE 30

Syllabus

https://course.ccs.neu.edu/ds5230f18/

slide-31
SLIDE 31

Homework Problems

  • 4 problem sets, 40% of grade
  • Python encouraged, but can use any language


(within reason – TA must be able to run your code)

  • Discussion is encouraged, but submissions 


must be completed individually
 (absolutely no sharing of code )

  • Submission via zip file on blackboard


by 11.59pm on day of deadline
 (no late submissions)

  • Please follow submission guidelines.
slide-32
SLIDE 32

Project Goals

  • Select a dataset / prediction problem
  • Perform exploratory analysis 


and pre-processsing

  • Apply one or more algorithms
  • Critically evaluate results
  • Submit a report and present project
slide-33
SLIDE 33
  • 21 Sep: Form teams of 2-4 people
  • 24 Oct: Submit abstract (1 paragraph)
  • 5 Nov: Milestone 1 (exploratory analysis)
  • 19 Nov: Milestone 2 (statistical analysis)
  • 28 Nov: Project Presentations
  • 7 Dec: Project Reports Due

Project Deadlines

slide-34
SLIDE 34

Grading

  • Homework: 40%
  • Midterm: 15%
  • Final: 15%
  • Project: 30%
slide-35
SLIDE 35

Participation

  • 1. Read the materials
  • 2. Attend the Lectures
  • 3. Ask questions
  • 4. Come to office hours
  • 5. Help Others

Class participation is used to adjust grade upwards (at the discretion of the instructor)

slide-36
SLIDE 36

Textbooks

Hastie, Tribshirani, Friedman Leskovec, 
 Rajaraman, Ullman Aggarwal Data Mining Statistics PDF freely
 available PDF freely
 available Available


  • n campus


network Bishop Murphy Machine Learning Recommend you buy one