Introduction Motivation: Business Intelligence Customer information - - PowerPoint PPT Presentation

introduction motivation business intelligence
SMART_READER_LITE
LIVE PREVIEW

Introduction Motivation: Business Intelligence Customer information - - PowerPoint PPT Presentation

Introduction Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, ) stock-price, ) Sales


slide-1
SLIDE 1

Introduction

slide-2
SLIDE 2

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …) Product information (Product-id, category, manufacturer, made-in, stock-price, …) Sales information (customer-id, product-id, #units, unit-price, sales-representative, …) Business queries:

slide-3
SLIDE 3

Techniques: Business Intelligence

  • Multidimensional data analysis
  • Online query answering
  • Interactive data exploration

Jian Pei: CMPT 741/459 Data Mining -- Introduction 3

slide-4
SLIDE 4

Motivation: Store Layout Design

Jian Pei: CMPT 741/459 Data Mining -- Introduction 4

http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

slide-5
SLIDE 5

Techniques: Store Layout Design

  • Customer purchase patterns
  • Business strategies

Jian Pei: CMPT 741/459 Data Mining -- Introduction 5

slide-6
SLIDE 6

Motivation: Community Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction 6

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social- media-1-728.jpg?cb=1308736811

slide-7
SLIDE 7

Techniques: Community Detection

  • Similarity between objects
  • Partitioning objects into groups

– No guidance about what a group is

Jian Pei: CMPT 741/459 Data Mining -- Introduction 7

slide-8
SLIDE 8

Motivation: Disease Prediction

Jian Pei: CMPT 741/459 Data Mining -- Introduction 8

Symptoms:

  • verweight,

high blood pressure, back pain, short of breadth, chest pain, cold sweat …

What medical problems does this patient has?

slide-9
SLIDE 9

Techniques: Disease Prediction

  • Features
  • Model

Jian Pei: CMPT 741/459 Data Mining -- Introduction 9

slide-10
SLIDE 10

Motivation: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Introduction 10

http://i.imgur.com/ckkoAOp.gif

slide-11
SLIDE 11

Techniques: Fraud Detection

  • Features
  • Dissimilarity
  • Groups and noise

Jian Pei: CMPT 741/459 Data Mining -- Introduction 11

http://i.stack.imgur.com/tRDGU.png

slide-12
SLIDE 12

What Is Data Science About?

  • Data
  • Extraction of knowledge from data
  • Continuation of data mining and knowledge

discovery from data (KDD)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 12

slide-13
SLIDE 13

What Is Data?

  • Values of qualitative or quantitative variables

belonging to a set of items

  • Represented in a structure, e.g., tabular, tree
  • r graph structure
  • Typically the results of measurements
  • As an abstract concept can be viewed as the

lowest level of abstraction from which information and then knowledge are derived

Jian Pei: CMPT 741/459 Data Mining -- Introduction 13

slide-14
SLIDE 14

What Is Information?

  • “Knowledge communicated or received

concerning a particular fact or circumstance”

  • Conceptually, information is the message

(utterance or expression) being conveyed

  • Cannot be predicted
  • Can resolve uncertainty

Jian Pei: CMPT 741/459 Data Mining -- Introduction 14

slide-15
SLIDE 15

What Is Knowledge?

  • Familiarity with someone or something,

which can include facts, information, descriptions, or skills acquired through experience or education

  • Implicit knowledge: practical skill or expertise
  • Explicit knowledge: theoretical

understanding of a subject

Jian Pei: CMPT 741/459 Data Mining -- Introduction 15

slide-16
SLIDE 16

Data Systems

  • A data system answers queries based on

data acquired in the past

  • Base data – the rawest data not derived

from anywhere else

  • Knowledge – information derived from the

base data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 16

slide-17
SLIDE 17

Dealing with Data – Querying

  • Given a set of student records about name,

age, courses taken and grades

  • Simple queries

– What is John Doe’s age?

  • Aggregate queries

– What is the average GPA of all students at this school?

  • Queries can be arbitrarily complicated

– Find the students X and Y whose grades are less than 3% apart in as many courses as possible

Jian Pei: CMPT 741/459 Data Mining -- Introduction 17

slide-18
SLIDE 18

Queries

  • A precise request for information
  • Subjects in databases and information

retrieval

– Databases: structured queries on structured (e.g., relational) data – Information retrieval: unstructured queries on unstructured (e.g., text, image) data

  • Important assumptions

– Information needs – Query languages

Jian Pei: CMPT 741/459 Data Mining -- Introduction 18

slide-19
SLIDE 19

Data-driven Exploration

  • What should be the next strategy of a

company?

– A lot of data: sales, human resource, production, tax, service cost, …

  • The question cannot be translated into a

precise request for information (i.e., a query)

  • Developing familiarity (knowledge) and

actionable items (decisions) by interactively analyzing data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 19

slide-20
SLIDE 20

Data-driven Thinking

  • Starting with some simple queries
  • New queries are raised by consuming the

results of previous queries

  • No ultimate query in design!

– But many queries can be answered using DB/IR techniques

Jian Pei: CMPT 741/459 Data Mining -- Introduction 20

slide-21
SLIDE 21

The Art of Data-driven Thinking

  • The way of generating queries remains an

art!

– Different people may derive different results using the same data “If you torture the data long enough, it will confess” – Ronald H. Coase

  • More often than not, more data may be

needed – datafication

Jian Pei: CMPT 741/459 Data Mining -- Introduction 21

slide-22
SLIDE 22

Queries for Data-driven Thinking

  • Probe queries – finding information about

specific individuals

  • Aggregation – finding information about groups
  • Pattern finding – finding commonality in

population

  • Association and correlation – finding

connections among individuals and groups

  • Causality analysis – finding causes and

consequences

Jian Pei: CMPT 741/459 Data Mining -- Introduction 22

slide-23
SLIDE 23

What Is Data Mining?

  • Broader sense: the art of data-driven

thinking

  • Technical sense: the non-trivial process of

identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96]

– Methods and tools of answering various types of queries in the data mining process in the broader sense

Jian Pei: CMPT 741/459 Data Mining -- Introduction 23

slide-24
SLIDE 24

Machine Learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” – Tom M. Mitchell

  • Essentially, learn the distribution of data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 24

slide-25
SLIDE 25

Data mining vs. Machine Learning

  • Machine learning focuses on prediction,

based on known properties learned from the training data

  • Data mining focuses on the discovery of

(previously) unknown properties on the data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 25

slide-26
SLIDE 26

Jian Pei: CMPT 741/459 Data Mining -- Introduction 26

The KDD Process

Data

Target data

Preprocessed data Transformed data

Patterns

Knowledge Selection Preprocessing Transformation Data mining Interpretation/ evaluation

slide-27
SLIDE 27

Data Mining R&D

  • New problem identification
  • Data collection and transformation
  • Algorithm design and implementation
  • Evaluation

– Effectiveness evaluation – Efficiency & scalability evaluation

  • Deployment and business solution

Jian Pei: CMPT 741/459 Data Mining -- Introduction 27

slide-28
SLIDE 28

Data Mining on Big Data

“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it” – Hal Varian, Google’s Chief Economist

Jian Pei: CMPT 741/459 Data Mining -- Introduction 28

slide-29
SLIDE 29

What Is Big Data?

  • No quantitative definition!
  • “Big data is like teenage sex

– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...” – Dan Ariely

Jian Pei: CMPT 741/459 Data Mining -- Introduction 29

slide-30
SLIDE 30

Data Volume vs. Storage Cost

  • The unit cost of disk storage decreases

dramatically

Jian Pei: CMPT 741/459 Data Mining -- Introduction 30

Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB

http://ns1758.ca/winch/winchest.html

slide-31
SLIDE 31

Big Data – Volume

“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time” — Wikipedia

Jian Pei: CMPT 741/459 Data Mining -- Introduction 31

slide-32
SLIDE 32

H1N1 Pandemic Crisis (2009)

  • A new flu virus combining elements of the viruses

that cause bird flu and swine flu

  • The US Centers for Disease Control and Prevention

(CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection

  • Google used user search keywords to predict the

spread of winter flu

– A supervised approach based on more than 3 billion search queries every day, examining 450 million different models, using 2007-2008 data from CDC

  • Some things can be done based on large scale data,

but cannot be done on a smaller scale data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 32

slide-33
SLIDE 33

Detecting Hurricane – Unsupervised

Jian Pei: CMPT 741/459 Data Mining -- Introduction 33

  • D. Kang, D. Jiang, J. Pei, Z. Liao, X. Sun, and H-J. Choi. "Multidimensional Mining of Large-Scale Search Logs: A Topic-Concept

Cube Approach". In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11), Hong Kong, China, February 9-12, 2011.

slide-34
SLIDE 34

Big Data: Volume

  • Every day, about 7 billion shares change hands
  • n US equity markets

– About 2/3 is traded by computer algorithms based

  • n huge amounts of data to predict gains and risk
  • In Q2 2015

– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million

  • utside China

– LinkedIn has 380 million active users – Twitter has 304 active users

Jian Pei: CMPT 741/459 Data Mining -- Introduction 34

slide-35
SLIDE 35

Velocity

  • Google processes 24+ petabytes of data per

day

  • Facebook gets 10+ million new photos

uploaded every hour

  • Facebook members like or leave a comment

3+ billion times per day

  • YouTube users upload 1+ hour of video

every second

  • 400+ million tweets per day

Jian Pei: CMPT 741/459 Data Mining -- Introduction 35

slide-36
SLIDE 36

What Has Been Changed?

  • The 1880 census in the US took 8 years to

complete

– The 1890 census would need 13 years – using punch cards, it was reduced to less than 1 year

  • It is essential to get not only the accurate but

also the timely data

– Statisticians use sampling to estimate

  • Recently, with the new technologies, the

ways of data collection and transmission have been fundamentally changed

Jian Pei: CMPT 741/459 Data Mining -- Introduction 36

slide-37
SLIDE 37

Sampling for Volume/Velocity?

  • Sampling idea: the marginal new information

brought by larger amount of data shrinks quickly

– The sample should be truly random

  • On a data set of hundreds or thousands of

attributes, can sampling help in

– Finding subcategories of attribute combinations – Finding outliers and exceptions

  • Big data contains signals of different strengths

– No noise, instead weaker and weaker, but still may be interesting and important signals

Jian Pei: CMPT 741/459 Data Mining -- Introduction 37

slide-38
SLIDE 38

Big Data – Leytro Pictures

  • Lytro pictures record the whole light field

– Photographers can decide later which parts to focus on

  • Big data tries to record as much information

as possible

– Analysts can decide later what to extract from big data – Both advantages and challenges

Jian Pei: CMPT 741/459 Data Mining -- Introduction 38

slide-39
SLIDE 39

Veracity

  • “1 in 3 business leaders don't trust the

information they use to make decisions”

  • Assuming a slowly growing total cost budget,

tradeoff between data volume and data quality

  • Loss of veracity in combining different types
  • f information from different sources
  • Loss of veracity in data extraction,

transformation, and processing

Jian Pei: CMPT 741/459 Data Mining -- Introduction 39

slide-40
SLIDE 40

Variety

  • Integrating data capturing different aspects
  • f a data object

– Vancouver Canucks: game video, technical statistics, social media, … – Different pieces are in different format

  • Different views of the same data object from

different sources

– Did the soccer ball pass the goal line? – The views may not be consistent

Jian Pei: CMPT 741/459 Data Mining -- Introduction 40

slide-41
SLIDE 41

Four V-challenges

  • Volume: massive scale and growth, 40% per

year in global data generated

  • Velocity: real time data generation and

consumption

  • Variety: heterogeneous data, mainly

unstructured or semi-structured, from many sources

  • Veracity

Jian Pei: CMPT 741/459 Data Mining -- Introduction 41

slide-42
SLIDE 42

Is Big Data Really New?

  • People were aware of the existence of big

data long time ago, but no one can access it until very recently

– (Genesis 28:15) “I am with you and will watch

  • ver you wherever you go”

– “密室私语,天闻如雷;暗室欺⼼忄,神目如电; 善恶之报,如影随⾏衍” – Similar statements in Quran and Sutra

  • What has been changed?

– How is data connected with people

Jian Pei: CMPT 741/459 Data Mining -- Introduction 42

slide-43
SLIDE 43

Diversity in Data Usage

  • In the past, only very few projects can afford

to be data-intensive

  • Nowadays, excessive applications are

(naturally) data-intensive

Jian Pei: CMPT 741/459 Data Mining -- Introduction 43

slide-44
SLIDE 44

Jian Pei: CMPT 741/459 Data Mining -- Introduction 44

slide-45
SLIDE 45

Datafication

  • Extract data about an object or event in a

quantified way so that it can be analyzed

– Different from digitalization

  • An important feature of big data
  • Key: new data, new applications, new
  • pportunities

Jian Pei: CMPT 741/459 Data Mining -- Introduction 45

slide-46
SLIDE 46

New Values of Datafication

  • Example: Captcha and ReCaptcha (Luis von

Ahn)

  • How to create new values of data and

datafication?

– Connecting data with new users – Connecting different pieces of data to present a bigger picture

  • Important techniques

– Data aggregation – Extended datafication

Jian Pei: CMPT 741/459 Data Mining -- Introduction 46

slide-47
SLIDE 47

Big Data Players

  • Data holders
  • Data specialists
  • Big-data mindset leaders
  • A capable company may play 2 or 3 roles at

the same time

  • What is most important, big-data mindset,

skills, or data itself?

Jian Pei: CMPT 741/459 Data Mining -- Introduction 47

slide-48
SLIDE 48

Privacy

  • “… big data analytics have the potential to

eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace” — Executive Office of the (US) President

Jian Pei: CMPT 741/459 Data Mining -- Introduction 48

slide-49
SLIDE 49

A Beautiful Story about Big Data

Jian Pei: CMPT 741/459 Data Mining -- Introduction 49 Source: http://abcnews.go.com/blogs/lifestyle/2014/01/the-genius-okcupid-hack-that-led-to-true-love/

slide-50
SLIDE 50

Romantics in the Big Data Age

  • Datafication and feature selection
  • Using data about many people (e.g., 20,000

women in McKinlay’s story)

  • Ranking and drilling down into groups
  • Connecting data analytics with practice

(Chris McKinlay dated 88 until he met Christine Wang)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 50

slide-51
SLIDE 51

Keep in Mind

“Our industry does not respect tradition – it only respects innovation.”

– Satya Nadella

Jian Pei: CMPT 741/459 Data Mining -- Introduction 51

slide-52
SLIDE 52

Jian Pei: CMPT 741/459 Data Mining -- Introduction 52

Goals of This Course

  • Data-driven thinking – towards being a (big)

data scientist

  • Principles and hands-on skills of data

mining, particularly in the context of big data

– Identifying new data mining problems – Data mining algorithm design – Data mining applications

  • Novel problems for upcoming research
slide-53
SLIDE 53

Format

  • Due to the fast progress in data mining, we

will go beyond the textbook substantially

  • Active classroom discussion
  • Open questions and brainstorming
  • Textbook: Data Mining – Concepts and

Techniques (3rd ed)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 53

slide-54
SLIDE 54

Read – Try – Think

  • Reading

– (required) Textbook and a small number of research papers – You have to have the 3rd ed of the textbook! – (open end, not covered by the exam) Technical and non-technical materials

  • Trying

– Assignments and a project

  • Thinking

– Examine everything from a data scientist angle from today

Jian Pei: CMPT 741/459 Data Mining -- Introduction 54

slide-55
SLIDE 55

Jian Pei: CMPT 741/459 Data Mining -- Introduction 55

Data Mining: History

  • 1989 IJCAI Workshop on Knowledge

Discovery in Databases

– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

  • 91-94 Workshops on Knowledge

Discovery in Databases

– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, 1996)

slide-56
SLIDE 56

Jian Pei: CMPT 741/459 Data Mining -- Introduction 56

Data Mining: History (cont’d)

  • 95-98 International Conferences on Knowledge

Discovery in Databases and Data Mining (KDD’95-98)

– Journal of Data Mining and Knowledge Discovery (1997)

  • ACM SIGKDD conferences since 1998 and

SIGKDD Explorations

  • More conferences on data mining

– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.

  • ACM Transactions on KDD starting in 2007
slide-57
SLIDE 57

Jian Pei: CMPT 741/459 Data Mining -- Introduction 57

KDD Conferences

  • ACM SIGKDD International Conference on

Knowledge Discovery in Databases and Data Mining (KDD) (best)

  • IEEE International Conference on Data

Mining (ICDM)

  • SIAM Data Mining Conference (SDM)
slide-58
SLIDE 58

Jian Pei: CMPT 741/459 Data Mining -- Introduction 58

Regional Conferences

  • Conference on Principles and practices of

Knowledge Discovery and Data Mining (PKDD) – European KDD

– Co-organized with ECML (European Conference

  • n Machine Learning)
  • Pacific-Asia Conference on Knowledge

Discovery and Data Mining (PAKDD) – Asian KDD

slide-59
SLIDE 59

Jian Pei: CMPT 741/459 Data Mining -- Introduction 59

Journals

  • ACM Transactions on KDD
  • IEEE Transactions on Knowledge and Data

Engineering (TKDE)

  • Data Mining and Knowledge Discovery

(DAMI or DMKD)

  • Knowledge and Information Systems
  • KDD Explorations
slide-60
SLIDE 60

Differences between 459 and 741

  • CMPT 459 – undergraduate version

– Basic concepts and methods – What, why, how – Focus: essential data mining methods and variations

  • CMPT 741 – graduate version

– Focus: how to use the principles and ideas to solve new problems – new methods may be needed! – For course-based/Big Data Professional program students, something in between

Jian Pei: CMPT 741/459 Data Mining -- Introduction 60

slide-61
SLIDE 61

Student Groups

  • 459 students
  • 741 course-based/Big Data Professional

program students

  • 741 thesis-based students
  • Different groups will be trained differently to

meet their objectives

Jian Pei: CMPT 741/459 Data Mining -- Introduction 61

slide-62
SLIDE 62

Knowing Your Peers

Jian Pei: CMPT 741/459 Data Mining -- Introduction 62

slide-63
SLIDE 63

Preparation

Jian Pei: CMPT 741/459 Data Mining -- Introduction 63

slide-64
SLIDE 64

Workload

Jian Pei: CMPT 741/459 Data Mining -- Introduction 64

slide-65
SLIDE 65

Evaluation

  • 5 regular assignments

– Exam questions will be similar to those in regular assignments

  • 5 mini assignments

– Team work (2 students at a time) – One has to team up with different students in different mini assignments

  • Project

– Mining a real data set

  • Exam

– Solving questions using the materials covered in the class or their simple combinations

Jian Pei: CMPT 741/459 Data Mining -- Introduction 65

slide-66
SLIDE 66

Lectures

  • Cover major ideas and critical details
  • To-do-list specifies the materials one should

understand

  • Assignments are the hints for the final exam
  • Extended materials are only for students

who want to learn more, and are not required in the exam

Jian Pei: CMPT 741/459 Data Mining -- Introduction 66

slide-67
SLIDE 67

To-Do-List

  • Read Chapter 1 in the textbook
  • Understand the concepts mentioned in

Section 1.8 (some of them are omitted in the lecture notes)

Jian Pei: CMPT 741/459 Data Mining -- Introduction 67