Part I: Introductory Materials Introduction to Data Mining Dr. - - PowerPoint PPT Presentation

part i introductory materials
SMART_READER_LITE
LIVE PREVIEW

Part I: Introductory Materials Introduction to Data Mining Dr. - - PowerPoint PPT Presentation

Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory What is common among


slide-1
SLIDE 1

Part I: Introductory Materials

Introduction to Data Mining

  • Dr. Nagiza F. Samatova

Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory

slide-2
SLIDE 2

2

What is common among all of them?

slide-3
SLIDE 3

Who are the data producers? What data?

Application Data

  • Application Category: Finance
  • Producer: Wall Street
  • Data: stocks, stock prices, stock purchases,…
  • Application Category: Academia
  • Producer: NCSU
  • Data: students admission data (name, DOB, GRE

scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc.

3

slide-4
SLIDE 4

Application Categories

  • Finance (e.g., banks)
  • Entertainment (e.g., games)
  • Science (e.g., weather forecasting)
  • Medicine (e.g., disease diagnostics)
  • Cybersecurity (e.g., terrorists, identity theft)
  • Commerce (e.g., e-Commerce)

4

slide-5
SLIDE 5

What questions to ask about the data?

DataQuestions

  • Academia:NCSU:Admission data

1. Is there any correlation between the students’ GRE scores and their successful completion of a PhD program? 2. What are the groups of students that share common academic performance? 3. Are there any admitted students who would stand out as an anomaly? What type of anomaly is that? 4. If the student majors in Physics, what other major is he/she likely double-major?

5

slide-6
SLIDE 6

Questions by Types?

  • Correlation, similarity, comparison,…
  • Association, causality, co-occurrence,…
  • Grouping, clustering,…
  • Categorization, classification,…
  • Frequency or rarity of occurrence,…
  • Anomalous or normal objects, events, behaviors,
  • Forecasting: future classes, future activity,…

6

slide-7
SLIDE 7

What information we need to answer?

QuestionsData Objects and Object Features

  • Academia:NCSU:Admission data

– Objects: Students – Object’s Features=Variables=Attributes=Dimensions & Types

  • Name:String (e.g., Name=Neil Shah)
  • GPA:Numeric (e.g., GPA=5.0)
  • Recommendation:Text (e.g., … the top 2% in my career…)
  • Etc.

7

slide-8
SLIDE 8

How to compare two objects?

Data Object Object Pairs

  • Academia:NCSU:Admission data

– Objects: Students – Based on a single feature:

  • Similar GPA
  • The same first letter in the last name

– Based on a set of features:

  • Similar academic records (GPA, GRE, etc.)
  • Similar demographic records

– Can you compute a numerical value for your similarity measure used for comparison? Why or Why not?

8

slide-9
SLIDE 9

How to represent data mathematically?

Data Object & its Features Data Model

9

  • What mathematical objects have you studied?

– Scalar – Points – Vectors – Vector spaces – Matrices – Sets – Graphs, networks (maybe) – Tensors (maybe) – Time series (maybe) – Topological manifolds (maybe) – …

9

slide-10
SLIDE 10

Data object as vector with components…

10

City=(Latitude, Longitude)--2-dimensional object Vector components:

  • Features, or
  • Attributes, or
  • Dimensions

Raleigh=(35.46, 78.39) Boston=(42.21, 71.5) Proximity(Raleigh, Boston)=?

  • Geodesic distance
  • Euclidean distance
  • Length of the interstate route
slide-11
SLIDE 11

A set of data objects as vector spaces

11

3-dimensional vector space Latitude Longitude Altitude

Raleigh Moscow

Mining such data ~ studying vector spaces

slide-12
SLIDE 12

Multi-dimensional vectors…

12

S1=(John Smith, 5.0, 180, 6.0, 200) S2=(Jane Doe, 3.0, 140, 5.4, 70) Vector components:

  • Features, or
  • Attributes, or
  • Dimensions

Student=(Name, GPA, Weight, Height, Income in K, …) - mutli-dimensional Proximity(S1, S2)=?

  • How to compare when vector components are
  • f heterogeneous type, or different scales?
  • How to show the results of the comparison?
slide-13
SLIDE 13

as matrices…

13

Original Documents t-d term-document matrix Terms=Features=Dimensions

D1: Child Safety at Home D2: Infant & Toddler First Aid D3: Your Baby's Health and Safety: From Infant to Toddler

Parsed Documents

D1: Child Safety Home D2: Infant Toddler D3: Bab Health Safety Infant Toddler

T1: Bab T2: Child T3: Health T4: Home T5: Infant T6: Safety T7: Toddler D1: D2: D3: T1: 1 T2: 1 T3: 1 T4: 1 T5: 1 1 T6: 1 1 T7: 1 1

Example: A collection of text documents on the Web Mining such data ~ studying matrices

slide-14
SLIDE 14
  • r as trees

14

t-d term-document matrix

D1: D2: D3: T1: 1 T2: 1 T3: 1 T4: 1 T5: 1 1 T6: 1 1 T7: 1 1

president government party election political elected national districts held district independence vice minister parties population area climate city miles province land topography total season 1999 square rate economy million products 1996 growth copra economic 1997 food scale exports rice fish

D3 D2 document terms Is D2 similar to D3?

What if there are 10,000 terms?

Mining such data ~ studying trees

slide-15
SLIDE 15

0r as networks, or graphs w/ nodes & links

15

population area climate city miles province land topography total season 1999 square rate president government party election political elected national districts held district independence vice minister parties economy million products 1996 growth copra economic 1997 food scale exports rice fish Nodes=Documents Links=Document similarity (e.g., if document references another document )

Mining such data ~ studying graphs, or graph mining

slide-16
SLIDE 16

What apps naturally deal w/ graphs?

16 Credit: Images are from Google images via search of keywords

Semantic Web Social Networks World Wide Web Drug Design, Chemical compounds Computer networks Sensor networks

slide-17
SLIDE 17

What questions to ask about graph data?

Graph Data Graph Mining Questions

  • Academia:NCSU:Admission data

1. Nodes=students; links=similar academics/demographics 2. How many distinct academically performing groups of students admitted to NCSU? 3. Which academic group is the largest? 4. Given a new student applicant, can we predict which academic group the student will likely belong to? 5. Are groups of student with similar demographics usually share similar academic performance? 6. Over the last decade, has the diversity in demographics of accepted student groups increased or decreased? 7. …

17

slide-18
SLIDE 18

Recap: Data Mining and Graph Mining

18

Data Application Questions Data Objects + Features Mathematical Data Representation (Data Model) Vectors Matrices Graphs Time series Tensors Sets Manifolds Not one hat fits all More than one models are needed Models are related

slide-19
SLIDE 19

19

How much data?

Astrophysics Cosmology Climate Biology Ecology Web 30TB/day 20-40TB/simulation 1PB/year 850TB

1 TB (TeraByte) – 1012 Bytes 1 PB (PetaByte) – 1015 Bytes

My laptop: 60 GB (GigaBytes) – 109 Bytes

slide-20
SLIDE 20

20

It is not just the Size

Petabytes Data

– but the Complexity

slide-21
SLIDE 21

21

Data Describes Complex Patterns/Phenomena

How to untangle the riddles of the complexity?

Complex regulation Single gene ~30k genes

50 trans elements control single gene expression Challenge: How to “connect the dots” to answer important science/business questions? Analytical tools that find the “dots” from data significantly reduce data.

slide-22
SLIDE 22

22

Connecting the Dots

Sheer Volume of Data

Climate Now: 20-40 Terabytes/year 5 years: 5-10 Petabytes/year Fusion Now: 100 Megabytes/15 min 5 years: 1000 Megabytes/2 min

Advanced Math+Algorithms Huge dimensional space Combinatorial challenge Complicated by noisy data Requires high-performance computers

Providing Predictive Understanding Produce bioenergy Stabilize CO2 Clean toxic waste

Understanding the Dots Finding the Dots Connecting the Dots

slide-23
SLIDE 23

23

Why Would Data Mining Matter?

Enables solving many large-scale data problems

Finding the Dots Connecting the Dots Understanding the Dots

  • How to effectively

produce bioenergy?

  • How to stabilize carbon

dioxide?

  • How to convert toxic

into non-toxic waste? ... Science Questions

slide-24
SLIDE 24

24

kB/s GB/$M MIPS/$M CPU, Disk, Network Trend CPU: every 1.2 years Disk: every 1.4 years WAN: 0.7 years

Doubling:

Src: Richard Mount, SLAC

How to Move and Access the Data? Technology trends are a rate limiting factor

Most of these data will NEVER be touched!

Latency and Speed – Storage Performance

105

Retrieval Rate Mbytes/s log10(Object Size Bytes)

Memory Disk Tape

  • J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000

Naturally distributed

but effectively immovable

Streaming/Dynamic

but not re-computable

Data doubles every 9 months; CPU ―18 months.

slide-25
SLIDE 25

25

How to Make Sense of Data?

Know Your Limits & Be Smart

To see 1 percent of a petabyte at 10 megabytes per second takes:

Terabytes Petabytes Gigabytes Megabytes More analysis More data Ultrascale Computations: Must be smart about which probe combinations to see! Physical Experiments: Must be smart about probe placement!

Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest.

35 8-hour days!

slide-26
SLIDE 26

26

What Analysis Algorithms to Use?

Even a simple big O analysis can elucidate simplicity.

Algorithmic Complexity:

Calculate means O(n) Calculate FFT O(n log(n)) Calculate SVD O(r • c) Clustering algorithms O(n2) For illustration chart assumes 10-12 sec. (1Tflop/sec) calculation time per data point

3 yrs.

0.1 sec.

10-2 sec.

10GB

3 hrs 10-3 sec.

10-4 sec. 100MB

1 sec. 10-5 sec.

10-6 sec. 1MB 10-4sec.

10-8 sec.

10-8 sec. 10KB 10-8 sec.

10-10 sec.

10-10sec. 100B

n2

nlog(n)

n

Algorithm Complexity

Data size n

Analysis algorithms fail for a few gigabytes.

If n=10GB, then what is O(n) or O(n2) on a teraflop computers? 1GB = 109 bytes 1Tflop = 1012 op/sec