Introduction Motivation: Business Intelligence Customer information - - PowerPoint PPT Presentation
Introduction Motivation: Business Intelligence Customer information - - PowerPoint PPT Presentation
Introduction Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, ) stock-price, ) Sales
Motivation: Business Intelligence
Jian Pei: CMPT 741/459 Data Mining -- Introduction 2
Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …) Product information (Product-id, category, manufacturer, made-in, stock-price, …) Sales information (customer-id, product-id, #units, unit-price, sales-representative, …) Business queries:
Techniques: Business Intelligence
- Multidimensional data analysis
- Online query answering
- Interactive data exploration
Jian Pei: CMPT 741/459 Data Mining -- Introduction 3
Motivation: Store Layout Design
Jian Pei: CMPT 741/459 Data Mining -- Introduction 4
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
Techniques: Store Layout Design
- Customer purchase patterns
- Business strategies
Jian Pei: CMPT 741/459 Data Mining -- Introduction 5
Motivation: Community Detection
Jian Pei: CMPT 741/459 Data Mining -- Introduction 6
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social- media-1-728.jpg?cb=1308736811
Techniques: Community Detection
- Similarity between objects
- Partitioning objects into groups
– No guidance about what a group is
Jian Pei: CMPT 741/459 Data Mining -- Introduction 7
Motivation: Disease Prediction
Jian Pei: CMPT 741/459 Data Mining -- Introduction 8
Symptoms:
- verweight,
high blood pressure, back pain, short of breadth, chest pain, cold sweat …
What medical problems does this patient has?
Techniques: Disease Prediction
- Features
- Model
Jian Pei: CMPT 741/459 Data Mining -- Introduction 9
Motivation: Fraud Detection
Jian Pei: CMPT 741/459 Data Mining -- Introduction 10
http://i.imgur.com/ckkoAOp.gif
Techniques: Fraud Detection
- Features
- Dissimilarity
- Groups and noise
Jian Pei: CMPT 741/459 Data Mining -- Introduction 11
http://i.stack.imgur.com/tRDGU.png
What Is Data Science About?
- Data
- Extraction of knowledge from data
- Continuation of data mining and knowledge
discovery from data (KDD)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 12
What Is Data?
- Values of qualitative or quantitative variables
belonging to a set of items
- Represented in a structure, e.g., tabular, tree
- r graph structure
- Typically the results of measurements
- As an abstract concept can be viewed as the
lowest level of abstraction from which information and then knowledge are derived
Jian Pei: CMPT 741/459 Data Mining -- Introduction 13
What Is Information?
- “Knowledge communicated or received
concerning a particular fact or circumstance”
- Conceptually, information is the message
(utterance or expression) being conveyed
- Cannot be predicted
- Can resolve uncertainty
Jian Pei: CMPT 741/459 Data Mining -- Introduction 14
What Is Knowledge?
- Familiarity with someone or something,
which can include facts, information, descriptions, or skills acquired through experience or education
- Implicit knowledge: practical skill or expertise
- Explicit knowledge: theoretical
understanding of a subject
Jian Pei: CMPT 741/459 Data Mining -- Introduction 15
Data Systems
- A data system answers queries based on
data acquired in the past
- Base data – the rawest data not derived
from anywhere else
- Knowledge – information derived from the
base data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 16
Dealing with Data – Querying
- Given a set of student records about name,
age, courses taken and grades
- Simple queries
– What is John Doe’s age?
- Aggregate queries
– What is the average GPA of all students at this school?
- Queries can be arbitrarily complicated
– Find the students X and Y whose grades are less than 3% apart in as many courses as possible
Jian Pei: CMPT 741/459 Data Mining -- Introduction 17
Queries
- A precise request for information
- Subjects in databases and information
retrieval
– Databases: structured queries on structured (e.g., relational) data – Information retrieval: unstructured queries on unstructured (e.g., text, image) data
- Important assumptions
– Information needs – Query languages
Jian Pei: CMPT 741/459 Data Mining -- Introduction 18
Data-driven Exploration
- What should be the next strategy of a
company?
– A lot of data: sales, human resource, production, tax, service cost, …
- The question cannot be translated into a
precise request for information (i.e., a query)
- Developing familiarity (knowledge) and
actionable items (decisions) by interactively analyzing data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 19
Data-driven Thinking
- Starting with some simple queries
- New queries are raised by consuming the
results of previous queries
- No ultimate query in design!
– But many queries can be answered using DB/IR techniques
Jian Pei: CMPT 741/459 Data Mining -- Introduction 20
The Art of Data-driven Thinking
- The way of generating queries remains an
art!
– Different people may derive different results using the same data “If you torture the data long enough, it will confess” – Ronald H. Coase
- More often than not, more data may be
needed – datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction 21
Queries for Data-driven Thinking
- Probe queries – finding information about
specific individuals
- Aggregation – finding information about groups
- Pattern finding – finding commonality in
population
- Association and correlation – finding
connections among individuals and groups
- Causality analysis – finding causes and
consequences
Jian Pei: CMPT 741/459 Data Mining -- Introduction 22
What Is Data Mining?
- Broader sense: the art of data-driven
thinking
- Technical sense: the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96]
– Methods and tools of answering various types of queries in the data mining process in the broader sense
Jian Pei: CMPT 741/459 Data Mining -- Introduction 23
Machine Learning
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” – Tom M. Mitchell
- Essentially, learn the distribution of data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 24
Data mining vs. Machine Learning
- Machine learning focuses on prediction,
based on known properties learned from the training data
- Data mining focuses on the discovery of
(previously) unknown properties on the data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 25
Jian Pei: CMPT 741/459 Data Mining -- Introduction 26
The KDD Process
Data
Target data
Preprocessed data Transformed data
Patterns
Knowledge Selection Preprocessing Transformation Data mining Interpretation/ evaluation
Data Mining R&D
- New problem identification
- Data collection and transformation
- Algorithm design and implementation
- Evaluation
– Effectiveness evaluation – Efficiency & scalability evaluation
- Deployment and business solution
Jian Pei: CMPT 741/459 Data Mining -- Introduction 27
Data Mining on Big Data
“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it” – Hal Varian, Google’s Chief Economist
Jian Pei: CMPT 741/459 Data Mining -- Introduction 28
What Is Big Data?
- No quantitative definition!
- “Big data is like teenage sex
– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...” – Dan Ariely
Jian Pei: CMPT 741/459 Data Mining -- Introduction 29
Data Volume vs. Storage Cost
- The unit cost of disk storage decreases
dramatically
Jian Pei: CMPT 741/459 Data Mining -- Introduction 30
Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB
http://ns1758.ca/winch/winchest.html
Big Data – Volume
“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time” — Wikipedia
Jian Pei: CMPT 741/459 Data Mining -- Introduction 31
H1N1 Pandemic Crisis (2009)
- A new flu virus combining elements of the viruses
that cause bird flu and swine flu
- The US Centers for Disease Control and Prevention
(CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection
- Google used user search keywords to predict the
spread of winter flu
– A supervised approach based on more than 3 billion search queries every day, examining 450 million different models, using 2007-2008 data from CDC
- Some things can be done based on large scale data,
but cannot be done on a smaller scale data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 32
Detecting Hurricane – Unsupervised
Jian Pei: CMPT 741/459 Data Mining -- Introduction 33
- D. Kang, D. Jiang, J. Pei, Z. Liao, X. Sun, and H-J. Choi. "Multidimensional Mining of Large-Scale Search Logs: A Topic-Concept
Cube Approach". In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM'11), Hong Kong, China, February 9-12, 2011.
Big Data: Volume
- Every day, about 7 billion shares change hands
- n US equity markets
– About 2/3 is traded by computer algorithms based
- n huge amounts of data to predict gains and risk
- In Q2 2015
– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million
- utside China
– LinkedIn has 380 million active users – Twitter has 304 active users
Jian Pei: CMPT 741/459 Data Mining -- Introduction 34
Velocity
- Google processes 24+ petabytes of data per
day
- Facebook gets 10+ million new photos
uploaded every hour
- Facebook members like or leave a comment
3+ billion times per day
- YouTube users upload 1+ hour of video
every second
- 400+ million tweets per day
Jian Pei: CMPT 741/459 Data Mining -- Introduction 35
What Has Been Changed?
- The 1880 census in the US took 8 years to
complete
– The 1890 census would need 13 years – using punch cards, it was reduced to less than 1 year
- It is essential to get not only the accurate but
also the timely data
– Statisticians use sampling to estimate
- Recently, with the new technologies, the
ways of data collection and transmission have been fundamentally changed
Jian Pei: CMPT 741/459 Data Mining -- Introduction 36
Sampling for Volume/Velocity?
- Sampling idea: the marginal new information
brought by larger amount of data shrinks quickly
– The sample should be truly random
- On a data set of hundreds or thousands of
attributes, can sampling help in
– Finding subcategories of attribute combinations – Finding outliers and exceptions
- Big data contains signals of different strengths
– No noise, instead weaker and weaker, but still may be interesting and important signals
Jian Pei: CMPT 741/459 Data Mining -- Introduction 37
Big Data – Leytro Pictures
- Lytro pictures record the whole light field
– Photographers can decide later which parts to focus on
- Big data tries to record as much information
as possible
– Analysts can decide later what to extract from big data – Both advantages and challenges
Jian Pei: CMPT 741/459 Data Mining -- Introduction 38
Veracity
- “1 in 3 business leaders don't trust the
information they use to make decisions”
- Assuming a slowly growing total cost budget,
tradeoff between data volume and data quality
- Loss of veracity in combining different types
- f information from different sources
- Loss of veracity in data extraction,
transformation, and processing
Jian Pei: CMPT 741/459 Data Mining -- Introduction 39
Variety
- Integrating data capturing different aspects
- f a data object
– Vancouver Canucks: game video, technical statistics, social media, … – Different pieces are in different format
- Different views of the same data object from
different sources
– Did the soccer ball pass the goal line? – The views may not be consistent
Jian Pei: CMPT 741/459 Data Mining -- Introduction 40
Four V-challenges
- Volume: massive scale and growth, 40% per
year in global data generated
- Velocity: real time data generation and
consumption
- Variety: heterogeneous data, mainly
unstructured or semi-structured, from many sources
- Veracity
Jian Pei: CMPT 741/459 Data Mining -- Introduction 41
Is Big Data Really New?
- People were aware of the existence of big
data long time ago, but no one can access it until very recently
– (Genesis 28:15) “I am with you and will watch
- ver you wherever you go”
– “密室私语,天闻如雷;暗室欺⼼忄,神目如电; 善恶之报,如影随⾏衍” – Similar statements in Quran and Sutra
- What has been changed?
– How is data connected with people
Jian Pei: CMPT 741/459 Data Mining -- Introduction 42
Diversity in Data Usage
- In the past, only very few projects can afford
to be data-intensive
- Nowadays, excessive applications are
(naturally) data-intensive
Jian Pei: CMPT 741/459 Data Mining -- Introduction 43
Jian Pei: CMPT 741/459 Data Mining -- Introduction 44
Datafication
- Extract data about an object or event in a
quantified way so that it can be analyzed
– Different from digitalization
- An important feature of big data
- Key: new data, new applications, new
- pportunities
Jian Pei: CMPT 741/459 Data Mining -- Introduction 45
New Values of Datafication
- Example: Captcha and ReCaptcha (Luis von
Ahn)
- How to create new values of data and
datafication?
– Connecting data with new users – Connecting different pieces of data to present a bigger picture
- Important techniques
– Data aggregation – Extended datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction 46
Big Data Players
- Data holders
- Data specialists
- Big-data mindset leaders
- A capable company may play 2 or 3 roles at
the same time
- What is most important, big-data mindset,
skills, or data itself?
Jian Pei: CMPT 741/459 Data Mining -- Introduction 47
Privacy
- “… big data analytics have the potential to
eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace” — Executive Office of the (US) President
Jian Pei: CMPT 741/459 Data Mining -- Introduction 48
A Beautiful Story about Big Data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 49 Source: http://abcnews.go.com/blogs/lifestyle/2014/01/the-genius-okcupid-hack-that-led-to-true-love/
Romantics in the Big Data Age
- Datafication and feature selection
- Using data about many people (e.g., 20,000
women in McKinlay’s story)
- Ranking and drilling down into groups
- Connecting data analytics with practice
(Chris McKinlay dated 88 until he met Christine Wang)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 50
Keep in Mind
“Our industry does not respect tradition – it only respects innovation.”
– Satya Nadella
Jian Pei: CMPT 741/459 Data Mining -- Introduction 51
Jian Pei: CMPT 741/459 Data Mining -- Introduction 52
Goals of This Course
- Data-driven thinking – towards being a (big)
data scientist
- Principles and hands-on skills of data
mining, particularly in the context of big data
– Identifying new data mining problems – Data mining algorithm design – Data mining applications
- Novel problems for upcoming research
Format
- Due to the fast progress in data mining, we
will go beyond the textbook substantially
- Active classroom discussion
- Open questions and brainstorming
- Textbook: Data Mining – Concepts and
Techniques (3rd ed)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 53
Read – Try – Think
- Reading
– (required) Textbook and a small number of research papers – You have to have the 3rd ed of the textbook! – (open end, not covered by the exam) Technical and non-technical materials
- Trying
– Assignments and a project
- Thinking
– Examine everything from a data scientist angle from today
Jian Pei: CMPT 741/459 Data Mining -- Introduction 54
Jian Pei: CMPT 741/459 Data Mining -- Introduction 55
Data Mining: History
- 1989 IJCAI Workshop on Knowledge
Discovery in Databases
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
- 91-94 Workshops on Knowledge
Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, 1996)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 56
Data Mining: History (cont’d)
- 95-98 International Conferences on Knowledge
Discovery in Databases and Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
- ACM SIGKDD conferences since 1998 and
SIGKDD Explorations
- More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
- ACM Transactions on KDD starting in 2007
Jian Pei: CMPT 741/459 Data Mining -- Introduction 57
KDD Conferences
- ACM SIGKDD International Conference on
Knowledge Discovery in Databases and Data Mining (KDD) (best)
- IEEE International Conference on Data
Mining (ICDM)
- SIAM Data Mining Conference (SDM)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 58
Regional Conferences
- Conference on Principles and practices of
Knowledge Discovery and Data Mining (PKDD) – European KDD
– Co-organized with ECML (European Conference
- n Machine Learning)
- Pacific-Asia Conference on Knowledge
Discovery and Data Mining (PAKDD) – Asian KDD
Jian Pei: CMPT 741/459 Data Mining -- Introduction 59
Journals
- ACM Transactions on KDD
- IEEE Transactions on Knowledge and Data
Engineering (TKDE)
- Data Mining and Knowledge Discovery
(DAMI or DMKD)
- Knowledge and Information Systems
- KDD Explorations
Differences between 459 and 741
- CMPT 459 – undergraduate version
– Basic concepts and methods – What, why, how – Focus: essential data mining methods and variations
- CMPT 741 – graduate version
– Focus: how to use the principles and ideas to solve new problems – new methods may be needed! – For course-based/Big Data Professional program students, something in between
Jian Pei: CMPT 741/459 Data Mining -- Introduction 60
Student Groups
- 459 students
- 741 course-based/Big Data Professional
program students
- 741 thesis-based students
- Different groups will be trained differently to
meet their objectives
Jian Pei: CMPT 741/459 Data Mining -- Introduction 61
Knowing Your Peers
Jian Pei: CMPT 741/459 Data Mining -- Introduction 62
Preparation
Jian Pei: CMPT 741/459 Data Mining -- Introduction 63
Workload
Jian Pei: CMPT 741/459 Data Mining -- Introduction 64
Evaluation
- 5 regular assignments
– Exam questions will be similar to those in regular assignments
- 5 mini assignments
– Team work (2 students at a time) – One has to team up with different students in different mini assignments
- Project
– Mining a real data set
- Exam
– Solving questions using the materials covered in the class or their simple combinations
Jian Pei: CMPT 741/459 Data Mining -- Introduction 65
Lectures
- Cover major ideas and critical details
- To-do-list specifies the materials one should
understand
- Assignments are the hints for the final exam
- Extended materials are only for students
who want to learn more, and are not required in the exam
Jian Pei: CMPT 741/459 Data Mining -- Introduction 66
To-Do-List
- Read Chapter 1 in the textbook
- Understand the concepts mentioned in
Section 1.8 (some of them are omitted in the lecture notes)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 67