introduction motivation business intelligence
play

Introduction Motivation: Business Intelligence Customer information - PowerPoint PPT Presentation

Introduction Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, ) stock-price, ) Sales


  1. Introduction

  2. Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, … ) stock-price, … ) Sales information (customer-id, product-id, #units, unit-price, sales-representative, … ) Business queries: Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

  3. Techniques: Business Intelligence • Multidimensional data analysis • Online query answering • Interactive data exploration Jian Pei: CMPT 741/459 Data Mining -- Introduction 3

  4. Motivation: Store Layout Design http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg Jian Pei: CMPT 741/459 Data Mining -- Introduction 4

  5. Techniques: Store Layout Design • Customer purchase patterns • Business strategies Jian Pei: CMPT 741/459 Data Mining -- Introduction 5

  6. Motivation: Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social- media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Data Mining -- Introduction 6

  7. Techniques: Community Detection • Similarity between objects • Partitioning objects into groups – No guidance about what a group is Jian Pei: CMPT 741/459 Data Mining -- Introduction 7

  8. Motivation: Disease Prediction What medical problems does this patient has? Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat … Jian Pei: CMPT 741/459 Data Mining -- Introduction 8

  9. Techniques: Disease Prediction • Features • Model Jian Pei: CMPT 741/459 Data Mining -- Introduction 9

  10. Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT 741/459 Data Mining -- Introduction 10

  11. Techniques: Fraud Detection • Features • Dissimilarity • Groups and noise http://i.stack.imgur.com/tRDGU.png Jian Pei: CMPT 741/459 Data Mining -- Introduction 11

  12. What Is Data Science About? • Data • Extraction of knowledge from data • Continuation of data mining and knowledge discovery from data (KDD) Jian Pei: CMPT 741/459 Data Mining -- Introduction 12

  13. What Is Data? • Values of qualitative or quantitative variables belonging to a set of items • Represented in a structure, e.g., tabular, tree or graph structure • Typically the results of measurements • As an abstract concept can be viewed as the lowest level of abstraction from which information and then knowledge are derived Jian Pei: CMPT 741/459 Data Mining -- Introduction 13

  14. What Is Information? • “Knowledge communicated or received concerning a particular fact or circumstance” • Conceptually, information is the message (utterance or expression) being conveyed • Cannot be predicted • Can resolve uncertainty Jian Pei: CMPT 741/459 Data Mining -- Introduction 14

  15. What Is Knowledge? • Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education • Implicit knowledge: practical skill or expertise • Explicit knowledge: theoretical understanding of a subject Jian Pei: CMPT 741/459 Data Mining -- Introduction 15

  16. Data Systems • A data system answers queries based on data acquired in the past • Base data – the rawest data not derived from anywhere else • Knowledge – information derived from the base data Jian Pei: CMPT 741/459 Data Mining -- Introduction 16

  17. Dealing with Data – Querying • Given a set of student records about name, age, courses taken and grades • Simple queries – What is John Doe’s age? • Aggregate queries – What is the average GPA of all students at this school? • Queries can be arbitrarily complicated – Find the students X and Y whose grades are less than 3% apart in as many courses as possible Jian Pei: CMPT 741/459 Data Mining -- Introduction 17

  18. Queries • A precise request for information • Subjects in databases and information retrieval – Databases: structured queries on structured (e.g., relational) data – Information retrieval: unstructured queries on unstructured (e.g., text, image) data • Important assumptions – Information needs – Query languages Jian Pei: CMPT 741/459 Data Mining -- Introduction 18

  19. Data-driven Exploration • What should be the next strategy of a company? – A lot of data: sales, human resource, production, tax, service cost, … • The question cannot be translated into a precise request for information (i.e., a query) • Developing familiarity (knowledge) and actionable items (decisions) by interactively analyzing data Jian Pei: CMPT 741/459 Data Mining -- Introduction 19

  20. Data-driven Thinking • Starting with some simple queries • New queries are raised by consuming the results of previous queries • No ultimate query in design! – But many queries can be answered using DB/IR techniques Jian Pei: CMPT 741/459 Data Mining -- Introduction 20

  21. The Art of Data-driven Thinking • The way of generating queries remains an art! – Different people may derive different results using the same data “ If you torture the data long enough, it will confess ” – Ronald H. Coase • More often than not, more data may be needed – datafication Jian Pei: CMPT 741/459 Data Mining -- Introduction 21

  22. Queries for Data-driven Thinking • Probe queries – finding information about specific individuals • Aggregation – finding information about groups • Pattern finding – finding commonality in population • Association and correlation – finding connections among individuals and groups • Causality analysis – finding causes and consequences Jian Pei: CMPT 741/459 Data Mining -- Introduction 22

  23. What Is Data Mining? • Broader sense: the art of data-driven thinking • Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of queries in the data mining process in the broader sense Jian Pei: CMPT 741/459 Data Mining -- Introduction 23

  24. Machine Learning “ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E ” – Tom M. Mitchell • Essentially, learn the distribution of data Jian Pei: CMPT 741/459 Data Mining -- Introduction 24

  25. Data mining vs. Machine Learning • Machine learning focuses on prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data Jian Pei: CMPT 741/459 Data Mining -- Introduction 25

  26. The KDD Process Knowledge Interpretation/ Patterns evaluation Transformed data Data mining Preprocessed data Transformation Preprocessing Selection Target data Data Jian Pei: CMPT 741/459 Data Mining -- Introduction 26

  27. Data Mining R&D • New problem identification • Data collection and transformation • Algorithm design and implementation • Evaluation – Effectiveness evaluation – Efficiency & scalability evaluation • Deployment and business solution Jian Pei: CMPT 741/459 Data Mining -- Introduction 27

  28. Data Mining on Big Data “ Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it ” – Hal Varian, Google’s Chief Economist Jian Pei: CMPT 741/459 Data Mining -- Introduction 28

  29. What Is Big Data? • No quantitative definition! • “Big data is like teenage sex – everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...” – Dan Ariely Jian Pei: CMPT 741/459 Data Mining -- Introduction 29

  30. Data Volume vs. Storage Cost • The unit cost of disk storage decreases dramatically Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB http://ns1758.ca/winch/winchest.html Jian Pei: CMPT 741/459 Data Mining -- Introduction 30

  31. Big Data – Volume “Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time” — Wikipedia Jian Pei: CMPT 741/459 Data Mining -- Introduction 31

  32. H1N1 Pandemic Crisis (2009) • A new flu virus combining elements of the viruses that cause bird flu and swine flu • The US Centers for Disease Control and Prevention (CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection • Google used user search keywords to predict the spread of winter flu – A supervised approach based on more than 3 billion search queries every day, examining 450 million different models, using 2007-2008 data from CDC • Some things can be done based on large scale data, but cannot be done on a smaller scale data Jian Pei: CMPT 741/459 Data Mining -- Introduction 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend