CS570 Introduction to Data Mining Department of Mathematics and - - PowerPoint PPT Presentation

cs570 introduction to data mining
SMART_READER_LITE
LIVE PREVIEW

CS570 Introduction to Data Mining Department of Mathematics and - - PowerPoint PPT Presentation

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Today Meeting everybody in class Course topics Course logistics 2 Instructor Li Xiong Web: http://www.mathcs.emory.edu/~lxiong


slide-1
SLIDE 1

CS570 Introduction to Data Mining

Department of Mathematics and Computer Science Li Xiong

slide-2
SLIDE 2

2

Today

 Meeting everybody in class  Course topics  Course logistics

slide-3
SLIDE 3

Instructor

 Li Xiong

 Web: http://www.mathcs.emory.edu/~lxiong  Email: lxiong@emory.edu  Office Hours: MW 11:15-12:15pm or by appt  Office: MSC E412

slide-4
SLIDE 4

About Me

http://www.mathcs.emory.edu/~lxiong

  • Undergraduate teaching

CS170 Intro to CS I

CS171 Intro to CS II

CS377 Database systems

CS378 Data mining

  • Graduate teaching

CS550 Database systems

CS570 Data mining

CS573 Data privacy and security

CS730R/CS584 Topics in data management – big data analytics

  • Research

Data privacy and security

Spatiotemporal data management

health informatics

  • Industry experience (software engineer)

SRA International

IBM internet security systems

4

slide-5
SLIDE 5

TA

  • TA: Farnaz Tahmasebian

– Email: farnaz.tahmasebian@emory.edu – Office Hours: TBA – Office: N414

5

slide-6
SLIDE 6

Meet everyone in class

 Group introduction (2-3 people)  Introducing your group

 Name  Goals for taking the course  Something interesting to share with the class

8/23/2017

slide-7
SLIDE 7

7

Today

 Meeting everybody in class  Course topics  Course logistics

slide-8
SLIDE 8

8

Evolution of Sciences

Before 1600, empirical science

Knowledge must be based on observable phenomena

Natural science vs. social sciences

1600-1950s, theoretical science

Motivate experiments and generalize our understanding (e.g. theoretical physics)

1950s-now, computational science

Traditionally meant simulation (e.g. computational physics)

Evolving to include information management

1960-now, data science

Flood of data from new scientific instruments and simulations

Ability to economically store and manage petabytes of data online

Accessibility of the data through the Internet and computing Grid

Scientific information management poses Computer Science challenges: acquisition,

  • rganization, query, analysis and visualization of the data

Jim Gray and Alex Szalay, The World Wide Telescope, Comm. ACM, 45(11): 50-54, Nov. 2002

slide-9
SLIDE 9

Data Mining: Concepts and Techniques

9

Evolution of Data and Information Science

1960s:

Data collection, database creation, network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

Data mining, data warehousing, multimedia databases, and Web databases

2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems

Social networks

slide-10
SLIDE 10

Big Data Tsunami

slide-11
SLIDE 11

The 5 V’s of Big Data

slide-12
SLIDE 12

Transforming the world with data

 Precision medicine  Enriched daily lives and social systems

slide-13
SLIDE 13

Value of Data

 Precision medicine

slide-14
SLIDE 14

Value of Data

 GPS traces, call records  Syndromic surveillance, social relationships

slide-15
SLIDE 15

Value of Data

 Shopping history  Recommendations

slide-16
SLIDE 16

What the class is about

slide-17
SLIDE 17

Data Mining: Concepts and Techniques

17

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Data mining really means knowledge mining

We are drowning in data, but starving for knowledge!

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, information harvesting, business intelligence, etc.

slide-18
SLIDE 18

Data Mining: Concepts and Techniques

18

Knowledge Discovery (KDD) Process

Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection and transformation Data Mining Pattern Evaluation

slide-19
SLIDE 19

Data Mining: Concepts and Techniques

19

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics Machine Learning Other Disciplines Visualization Artificial Intelligence

slide-20
SLIDE 20

Data Mining: Concepts and Techniques

20

Data Mining Functionalities

 Predictive: predict the value of a particular

attribute based on the values of other attributes

 Classification  Regression

 Descriptive: derive patterns that

summarize the underlying relationships in data

 Pattern mining and association analysis  Cluster analysis  Ranking queries and skyline

slide-21
SLIDE 21

Class Topics

 Classical data mining and machine learning

algorithms

 Frequent pattern mining  Classification  Clustering

 Data exploration techniques

 Ranking (kNN, skyline)

 Data mining applications and emerging challenges

 Spatiotemporal data mining (data variety)  Truth discovery (data veracity)  Privacy preserving data mining (data privacy)

21

slide-22
SLIDE 22

Classification and prediction

Data Mining: Concepts and Techniques

22  Classification: construct models (functions) that describe

and distinguish classes for future prediction

 Prediction/regression: predict unknown or missing

numerical values

 Derived models can be represented as rules, mathematical

formulas, etc.

 Topics  Classification: Decision tree, Bayesian classification,

Neural networks, Support vector machines, kNN

 Regression: linear and non-linear regression  Ensemble methods

slide-23
SLIDE 23

Data Mining: Concepts and Techniques

23

Frequent pattern mining and association analysis

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

Frequent sequential pattern

Frequent structured pattern

Applications

Basket data analysis — Beer and diapers

Web log (click stream) analysis

DNA sequence analysis

Challenge: efficient algorithms to handle exponential size of the search space

Topics

Algorithms: Apriori, Frequent pattern growth, Vertical format

Closed and maximal patterns

Association rules mining 23

Data Mining: Concepts and Techniques

slide-24
SLIDE 24

Data Mining: Concepts and Techniques

24

Cluster and outlier analysis

Cluster analysis

 Class label is unknown: Group data to form new classes, e.g.,

cluster houses to find distribution patterns

 Unsupervised learning (vs. supervised learning)  Maximizing intra-class similarity & minimizing interclass similarity

Outlier analysis

 Outlier: Data object that does not comply with the general behavior

  • f the data

 Noise or exception? Useful in fraud detection, rare events analysis  E.g. Extreme large purchase

slide-25
SLIDE 25

Clustering Analysis

 Topics

 Partitioning based clustering: k-means  Hierarchical clustering: classical, BIRCH  Density based clustering: DBSCAN  Model-based clustering: EM  Cluster evaluation  Outlier analysis

25

Data Mining: Concepts and Techniques

slide-26
SLIDE 26

Ranking queries and skyline

Data Mining: Concepts and Techniques

26

Topk and kNN queries

Skyline

Algorithms and various definitions

slide-27
SLIDE 27

Spatiotemporal data mining

 Trajectory mining  Time series  Applications

 Mobility study  Traffic prediction  Location recommendation

Data Mining: Concepts and Techniques

27

slide-28
SLIDE 28

Big Data and Privacy

slide-29
SLIDE 29

Privacy Risks

slide-30
SLIDE 30

Privacy Risks

 Tracking  Identification  Profiling

slide-31
SLIDE 31

Privacy preserving data mining

 Topics: algorithms that allow data mining while

preserving individual information

 Challenge: tradeoff between privacy, accuracy,

and efficiency

Data Mining: Concepts and Techniques

32

slide-32
SLIDE 32

Data Mining: Concepts and Techniques

33

Today

 Meeting everybody in class  Course topics  Course logistics

slide-33
SLIDE 33

Textbooks

 Data mining: concepts and techniques. J. Han, M.

Kamber, Jian Pei. 3rd edition

 Mining of massive datasets. J. Leskovec, A. Rajaraman,

  • J. Ullman

 Online version: http://www.mmds.org

 G. James, D. Witten, T. Hastie, R. Tibshirani, An

Introductio to Statistical Learning, 2013

 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to

Data Mining, Wiley, 2005

Data Mining: Concepts and Techniques

34

slide-34
SLIDE 34

Data (Mining) Conferences

 Data mining

 SIGKDD, ICDM, SDM, CIKM, PAKDD …

 Data management

 SIGMOD, VLDB, ICDE, EDBT, CIKM …

 Machine learning

 ICML, NIPS, AAAI, …

Data Mining: Concepts and Techniques

35

slide-35
SLIDE 35

Workload

 ~3 programming assignments

 Implementation of classical algorithms and competition!

 ~3 reading assignments/paper reviews  ~1 paper presentation  1 course project (team of up to 3 students)  1 midterm  No final exam

36

Data Mining: Concepts and Techniques

slide-36
SLIDE 36

Paper reviews

 1 page  NOT just a summary of the paper, but your

critical opinion of the paper

 Format

 Summary  3 strengths or things you like (S1, S2, S3 …)  3 weaknesses (W1, W2, W3 …)  Potential extensions/ideas

 Connect and contrast the paper to what we have

learned/read so far

8/23/2017

37

slide-37
SLIDE 37

Course Project

 Options

 Comparative study and evaluation of existing

algorithms

 Design of new algorithms to solve new problems  Data mining challenges

 Timeline

 10/16: proposal  11/27, 11/29, 12/4: Project workshop/presentation  12/16: project report/deliverables

slide-38
SLIDE 38

Late Policy

 Late assignment will be accepted within 3

days of the due date and penalized 10% per day

 2 late assignment allowances, each can be

used to turn in a single late assignment within 3 days of the due date without penalty.

slide-39
SLIDE 39

Learning Objectives (Non technical)

 Read papers and write paper critiques  Present papers and lead discussions  Learn/practice the life cycle of a research project

 literature review  problem formulation  project proposal writing  algorithm design  experimental studies  paper/project report writing

8/23/2017

slide-40
SLIDE 40

Grading

 Assignments/presentations

40%

 Final project

30%

 Midterm

30%

slide-41
SLIDE 41

Some expectations

 Participate in class, think critically, ask questions  Read and write reviews critically  Start on assignments and projects early  Enjoy the class!

8/23/2017

slide-42
SLIDE 42

Data Mining: Concepts and Techniques

43

Today

 Meeting everybody in class  Course topics  Course logistics  Next lecture: data preprocessing