Co nc e pt a nd Applic a tio ns o f Da ta Mining We e k 1 Topics - - PowerPoint PPT Presentation

co nc e pt a nd applic a tio ns o f da ta mining
SMART_READER_LITE
LIVE PREVIEW

Co nc e pt a nd Applic a tio ns o f Da ta Mining We e k 1 Topics - - PowerPoint PPT Presentation

Co nc e pt a nd Applic a tio ns o f Da ta Mining We e k 1 Topics Topics Introduction Introduction Syllabus Data Mining Concepts Team Organization Introduction Session Introduction Session Your name and major The


slide-1
SLIDE 1

Co nc e pt a nd Applic a tio ns

  • f Da ta Mining

We e k 1

slide-2
SLIDE 2

Topics Topics

  • Introduction

Introduction

  • Syllabus
  • Data Mining Concepts
  • Team Organization
slide-3
SLIDE 3

Introduction Session Introduction Session

  • Your name and major

Th d fi iti f d t i i

  • The definition of data mining
  • Your expectation from this course

Your expectation from this course

slide-4
SLIDE 4

Course Syllabus Course Syllabus

S ll b

  • Syllabus
slide-5
SLIDE 5

Da ta Mining Applic a tio ns Da ta Mining Applic a tio ns

slide-6
SLIDE 6

Classes of Data-Mining Applications in 2003

Data‐Mining Applications Percentage Banking 13 Banking 13 Bioinformatics/biotech 10 Direct marketing/fundraising 10 F d d t ti 9 Fraud detection 9 Scientific data 9 Insurance 8 l

So urc e :

Telecommunication 8 Medical/pharmaceuticals 6 Retail 6

www.kdnu

e‐Commerce/Web 5 Other 4 Investment/stocks 3

ug g e ts.c o m

Manufacturing 2 Security 2 Supply chain analysis 2

m

Travel 2 Entertainment 1

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

Newsweek, May 22, 2006

slide-10
SLIDE 10

Ma rke t Ba ske t Ana lysis Ma rke t Ba ske t Ana lysis

slide-11
SLIDE 11

C

F ig ure 9

Che m Che m

9.14 A Ch

mistr mistr

he mic a l d

ry I nf ry I nf

da ta b a se

fo rm fo rm

e .

ma tic ma tic s c s

slide-12
SLIDE 12

Wha t is Da ta Mining ? Wha t is Da ta Mining ?

Source: Cover page of Advanced in Knowledge Discovery and Data Mining , edited by U Fayyad G Piatesky‐Shapiro P Smyth and R Uthurusamy MIT Press edited by U. Fayyad, G. Piatesky Shapiro, P. Smyth and R. Uthurusamy, MIT Press

slide-13
SLIDE 13

How Much Information in 2003 How Much Information in 2003

  • http://www.sims.berkeley.edu/research/proje

cts/how‐much‐info‐2003/ cts/how much info 2003/

slide-14
SLIDE 14

What is Data Mining? What is Data Mining?

  • Misnomer??
  • Gold Mining vs. Sand (Rock) Mining
  • Knowledge Discovery from Data (KDD)

K l d t ti

  • Knowledge extraction
  • Data/pattern analysis
  • Data archaeology
  • Data dredging

Data dredging

slide-15
SLIDE 15

Da ta Mining is a n I nte rdisc iplina ry a nd Multidisc iplina ry F ie ld Da ta Mining is a n I nte rdisc iplina ry a nd Multidisc iplina ry F ie ld a nd Multidisc iplina ry F ie ld a nd Multidisc iplina ry F ie ld

DATABASE DATABASE TECHNOLOGY TECHNOLOGY MACHINE MACHINE LEARNING LEARNING STATISTICS STATISTICS

& MATH

MATH INFORMATION INFORMATION

DATA DATA MINING MINING

& & MATH

MATH THEORY THEORY INFORMATION INFORMATION RETRIEVAL RETRIEVAL OTHER OTHER DISCIPLINES DISCIPLINES

slide-16
SLIDE 16
  • lo g y

m te c hno a se syste m

  • f da ta b a

vo lutio n o 1.1 T he e v F ig ure

slide-17
SLIDE 17

P Da Pro c e Da Pro c e a ta M e ss o dis a ta M e ss o dis Minin

  • f kn

c o ve Minin

  • f kn

c o ve ng is

  • wle

e ry ng is

  • wle

e ry s a e dg e s a e dg e e

F ig ure 1.4 Da ta mining a s a ste p in the pro c e ss o f kno wle dg e disc o ve ry

slide-18
SLIDE 18

Arc hite c ture o f a Da ta Mining Syste m Syste m

Graphical User Interface Pattern/Model Evaluation Data Mining Engine

Knowledge- Base

Database or Data Warehouse Server

Data World-Wide Other Info

data cleaning, integration, and selection Database

Data Warehouse

  • d

de Web Repositories

Database

F ig ure 1.5 Arc hite c ture o f a typic a l da ta mining syste m

slide-19
SLIDE 19

Da ta Mining a nd Sta ke ho lde rs a a g a d S a e

  • de s

Increasing potential to support End User M ki business decisions End User Business Making Decisions Data Presentation Business Analyst Data Presentation Visualization Techniques Data Mining K l d Di Data Analyst Knowledge Discovery Data Exploration Statistical Analysis, Querying and Reporting DBA OLAP y y g p g Data Warehouses / Data Marts Data Sources Data Sources Paper, Files, Information Providers, Database Systems, OLTP

slide-20
SLIDE 20

Data Types - Perspective on Structure Data Types Perspective on Structure

  • Structured

S i t t d

  • Semi‐structured
  • Unstructured

Unstructured

20

slide-21
SLIDE 21

Structured Data (1) Structured Data (1)

  • Data is organized in semantic entities

g

  • Similar entities are grouped together

( l ti l ) (relations or classes)

  • Entities in the same group have the same

Entities in the same group have the same descriptions (attributes, features)

21

slide-22
SLIDE 22

Structured Data (2) Structured Data (2)

  • Descriptions for all entities in a group

(schema)

  • Attributes

d f d f – Have same defined formats – Have predefined lengths – Follow same orders

22

slide-23
SLIDE 23

Semi-structured Data (1) Semi structured Data (1)

  • Semi‐structured data are organized in

g semantic entities Si il titi d t th

  • Similar entities are grouped together
  • Entities in same group may not have same

Entities in same group may not have same attributes

23

slide-24
SLIDE 24

Semi-structured Data (2) Semi structured Data (2)

  • Attributes

– Order of attributes not necessarily important – Not all attributes may be required – Size of same attributes in a group may differ – Type of same attributes in a group may differ

24

slide-25
SLIDE 25

XML XML

<bank‐1> <customer> H / <customer_name> Hayes </customer_name> <customer_street> Main </customer_street> <customer_city> Harrison </customer_city> <account> <account_number> A‐102 </account_number> <branch_name> Perryridge </branch_name> <balance> 400 </balance> </account> </account> <account> … </account> </customer> . . </bank 1> </bank‐1>

25

slide-26
SLIDE 26

Unstructured Data (1) Unstructured Data (1)

  • Masses of computerized data

– which do not have a data structure – which is easily readable by a machine

26

slide-27
SLIDE 27

Unstructured Data (2) Unstructured Data (2)

“Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data – commonly appearing in e‐mails, memos, notes from ll d i call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material research presentations and Web marketing material, research, presentations and Web pages.”‐‐ DM Review Magazine, February 2003 Issue

slide-28
SLIDE 28

Data Types – Perspective on

Representation

  • Numeric and categorical

Numeric and categorical

  • Quantitative and qualitative
  • Nominal and ordinal
  • Static and dynamic (temporal)

28

slide-29
SLIDE 29

Numeric and Categorical Data (1) Numeric and Categorical Data (1)

  • Numeric data

Numeric data

– Real number data, integer number data – Properties – Properties

  • Order relations (2 < 5)
  • Distance relation (d(2.3, 4.2) = 1.9)

Distance relation (d(2.3, 4.2) 1.9)

  • Equality relation (2 = 2)

29

slide-30
SLIDE 30

Numeric and Categorical Data (2) Numeric and Categorical Data (2)

  • Categorical (symbolic) values

Categorical (symbolic) values

– Equality relation

  • Blue = Blue or Rea <> Blue

Blue = Blue or Rea <> Blue

– Categorical values can be converted to a numeric values

  • Gender (male, female) (0, 1)

30

slide-31
SLIDE 31

Quantitative and Qualitative Data Quantitative and Qualitative Data

  • Quantitative data

– Numeric values are quantitative values – Height, weight, salary

  • Qualitative data

N i l – Nominal – Ordinal

31

slide-32
SLIDE 32

Nominal Data Nominal Data

  • Utility customer type (residential, commercial,

industrial, governmental)

  • Use different symbols, characters, and

numbers numbers

  • These values can be coded alphabetically as A,

d i ll d B, and C, or numerically as 1, 2, and 3

  • Order‐less

Order less

32

slide-33
SLIDE 33

Ordinal Data Ordinal Data

  • The rank of the student in a class

O di l i bl i i l i bl f

  • Ordinal variables is a categorical variable for

which an order relation is defined but not a di t l ti distance relation

  • The ordered scale need not be necessarily

The ordered scale need not be necessarily linear; difference between 4th and 5th students are different to that of 14th and 15th students are different to that of 14 and 15 students

33

slide-34
SLIDE 34

Static and Dynamic Data Static and Dynamic Data

  • Static data

– Attribute values do not change with time

  • Dynamic data

Att ib t l h ith ti – Attribute values change with time

34

slide-35
SLIDE 35

Data Repositories Data Repositories

  • Transactional database
  • Relational database
  • Relational database
  • Data warehouse
  • Advanced database
  • Data stream
  • The World Wide Web

The World Wide Web

35

slide-36
SLIDE 36

T ra nsa c tio na l Da ta b a se T ra nsa c tio na l Da ta b a se

TI D List of item _ I Ds T100 I1, I2, I5 T200 I2 I4 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1 I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3

36

T a ble 5.1 T

ra nsa c tio na l da ta fo r a n AllE

le c tro nic s b ra nc h

slide-37
SLIDE 37

F ig

F ro

g ure 1.6.

  • m a Re l

F ra g me

la tiona l D

e nts o f Re

Da ta ba s

e la tio ns

se fo r AllE

le c tro nic s

37

slide-38
SLIDE 38

Da ta Wa re ho use (Ma rt) Da ta Wa re ho use (Ma rt)

38

F ig ure 1.7 T

ypic a l fra me wo rk o f a da ta wa re ho use fo r AllE

le c tro nic s

slide-39
SLIDE 39

T a ble 3.1 Co mpa riso n b e twe e n OL

T P a nd OL AP syste ms

39

slide-40
SLIDE 40

Sta r Sc he ma o f a Da ta Wa re ho use fo r Sa le s Wa re ho use fo r Sa le s

40

F ig ure 3.4 Sta r sc he ma o f a da ta wa re ho use fo r sa le s

slide-41
SLIDE 41

T a ble 3 3 A 3 D vie w o f sa le s da ta fo r AllE

le c tro nic s a c c o rding to the

T a ble 3.3 A 3-D vie w o f sa le s da ta fo r AllE

le c tro nic s, a c c o rding to the

dime nsio ns time , ite m, a nd lo c atio n. T he me a sure displa ye d is do llar_so ld (in tho usa nds).

slide-42
SLIDE 42

Da ta Cub e fo r Sa le s Da ta Cub e fo r Sa le s

42

F ig ure 3.1 A 3-D da ta c ub e re pre se nta tio n o f the da ta in T

a b le 3.3, a c c o rding to the dime nsio ns time , ite m, a nd lo c atio n. T he me a sure displa ye d is do llar_so ld (in tho usa nds).

slide-43
SLIDE 43

F ig

  • p

c o

g ure 3.10.

pe ra tio ns o

  • mmo nly

E xa mple

  • n multid

use d fo r d e s o f T ypic dime nsio n da ta wa r c a l OL AP na l da ta c re ho using c ub e , g

43

slide-44
SLIDE 44

Advanced Databases Advanced Databases

  • Object‐relational databases
  • Temporal databases
  • Sequence databases
  • Time‐series databases
  • Spatial databases
  • Saptio‐temporal databases

Saptio temporal databases

  • Text databases

H t d t b

  • Heterogeneous databases
slide-45
SLIDE 45

Data Streams Data Streams

Th f f d h ibl

  • The features of data stream: huge or possibly

infinite volume, dynamically changing, flowing i d t i fi d d ll i l in and out in a fixed order, allowing only one

  • r a small number of scans, and demanding

f t ( ft l ti ) ti fast (often real‐time) response time

slide-46
SLIDE 46

The World Wide Web (1) The World Wide Web (1)

  • The WWW serves a huge, distributed, global

g , , g information service center for news, advertisements, consumer information, , , financial management, education, government, e‐commerce, and many other g , , y information services

slide-47
SLIDE 47

The WWW (2) The WWW (2)

  • The challenges for KD

g

– Size – Complexity p y – Dynamic – Diversity Diversity – Relevance

slide-48
SLIDE 48

Lab Activities Lab Activities

  • Introduction to R
  • Organize your team

– Each team consist of three (four) students – Email your team information (names and email addresses) to the instructor by the end of today’s lab session

  • Read the chapter 2 of the lecture text book and do team

homework assignment #1

  • Read the chapters 1, 2 and 3 of the lab text book
  • Brainstorm on the topic of you group project

Brainstorm on the topic of you group project

slide-49
SLIDE 49

(Team) Homework Assignment #1 (Team) Homework Assignment #1

  • Do Example 2.1, 2.6, 2.7, and Exercise 2.18.

Note that you need to use R for 2 18 (b) Note that you need to use R for 2.18 (b).

  • Prepare for the results of the homework

p assignment

  • Due date

– beginning of the lecture on Friday February 4th.

slide-50
SLIDE 50

Next Week Topics Next Week Topics

  • Data types and data repositories (Section 1.3)
  • Data preprocessing (Ch. 2)