Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM - - PowerPoint PPT Presentation

da data min a minin ing
SMART_READER_LITE
LIVE PREVIEW

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM - - PowerPoint PPT Presentation

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM 15/16 What is Data Mining? Data mining is the process of extracting hidden patterns from data. An Unethical Econometric practice of massaging and manipulating the


slide-1
SLIDE 1

IRDM ‘15/16

What at i is it?

Da Data Min a Minin ing

  • Ch. 1

I-DM- 1

slide-2
SLIDE 2

IRDM ‘15/16

What is Data Mining?

I-DM- 2

“Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” “Data mining, in a broad sense, is the set of techniques for analyzing and understanding data.” “An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.” “Data mining is the process of extracting hidden patterns from data.”

slide-3
SLIDE 3

IRDM ‘15/16

What is Data Mining?

I-DM- 3

“An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.” “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” “Data mining, in a broad sense, is the set of techniques for analyzing and understanding data.” “Data mining is the process of extracting hidden patterns from data.”

slide-4
SLIDE 4

IRDM ‘15/16

The KDD (Knowledge Discovery from Data) Process

I-DM- 4

Filtering patterns Visualisation Pattern interpretation Data pre-processing Data mining Post-processing

Input data Information

Normalisation Dimensionality reduction Feature selection Handling missing values Interactive data mining

slide-5
SLIDE 5

IRDM ‘15/16

Data Mining vs. Information Retrieval

IR is answering questions the user asked DM is answering questions the user didn didn’t ask

“Show me the web pages relevant to this query“

vs.

“Show me the inter eres estin ing p pattern erns in the contents of these web pages”

Vague problem… How to define interestingness? How to evaluate results?

I-DM- 5

slide-6
SLIDE 6

IRDM ‘15/16

Data Mining’s position in Science

Data mining uses statistics to infer from data

 is data mining just a fancy name for statistics?

Data mining uses methods to learn unseen patterns

 is data mining just a boring name for machine learning?

Is data mining voodoo science?

I-DM- 6

slide-7
SLIDE 7
slide-8
SLIDE 8

IRDM ‘15/16

Why D Why Data ta M Mini ning?

I-DM- 8

slide-9
SLIDE 9

IRDM ‘15/16

Why Data Mining?

I-DM- 9

The ”PHT” Pirate wanted all information of the world. But before he realized most of it was useless, he was already buried under it. —Stanisław Lem, The Cyberiad

slide-10
SLIDE 10

Big Data, Bigger Data, Biggest Data

slide-11
SLIDE 11

IRDM ‘15/16

Data, data, data

I-DM- 11

1 250 000 transactions per hour

≈ 5GB of climate data

350 000 000

  • photos. Per day.

500 000 000 tweets per day

slide-12
SLIDE 12

IRDM ‘15/16

Data, data, data

I-DM- 12

1 250 000 transactions per hour

≈ 5GB of climate data

350 000 000

  • photos. Per day.

500 000 000 tweets per day To use this data, we need tools to analyse and understand it. We need data mining.

slide-13
SLIDE 13

IRDM ‘15/16

Data Mining Applications

Business Intelligence

I-DM- 13

slide-14
SLIDE 14

Shopping Data

Which products are

  • ften bought

toget ether er?

slide-15
SLIDE 15

Train Delays

Which trains are delayed because of othe

  • ther trains?
slide-16
SLIDE 16

Drug Discovery

What part of the molecule makes the drug work?

slide-17
SLIDE 17

IRDM ‘15/16

Data Mining Applications

Business Intelligence

 what do customers buy together?  what are the seasonal trends?

Scientific Data Analysis

 what genes cause diseases?  what are the differences between languages?

And anything else where you have data…

 who should Hillary Clinton try to persuade to vote?  is everything alright with my space object?

I-DM- 17

slide-18
SLIDE 18

IRDM ‘15/16

Faster! Do it faster!

Nobody likes exponential time algorithms. In data mining, we don’t even like polynomial time

 Your solution is 𝑃(𝑜3)?

Great… my data is only 10M records!

(Sub-)Linear runtime is what we strive for

 this means cutting corners: good

  • od enou
  • ugh is

s good

  • od enou
  • ugh

 often search space is so complicated there is no guarantee:

hopeful ully good enoug ugh h is is good enoug ugh, h, hopefull ully

I-DM- 18

slide-19
SLIDE 19

IRDM ‘15/16

Sampling from Static Data

Can we trim Big Data to Reasonably Sized Data? Without bias

 by sampling uni

uniformly, , every row has the same probability

 often without replacement: duplicate rows may be a nuisance

With bias to recent records

 by sampling with exponential decay, 𝑞 𝑌

∝ 𝑓−𝜇⋅𝜀𝜀

 where 𝜇 is the decay rate, and 𝜀𝜀 the age of element 𝑌

  • With bias to certain (e.g. rare) classes

 by stratified sampling, often uniform with a probability per class

I-DM- 19

slide-20
SLIDE 20

IRDM ‘15/16

How much should we ask for?

How much much data should we sample?

depends on the samp ample comp mplexity of your problem space

No Free Lunc nch h Th Theorem: number of samples needed for error 𝜗 depends on the actua ual l distribution 𝑞 of the data, and there always exists some 𝑞 with arbitrarily ly hig igh h sample complexity Vapnik-Chervonenkis (VC) dimensionality and Rademacher complexity instead show how rich a set of hypotheses ℋ is for your data. Promising, but often difficult to use in practice.

So, for many practical problems, we simply don’t know, and just sample as s much much a as s we ca can ha n hand ndle

I-DM- 20

slide-21
SLIDE 21

IRDM ‘15/16

Streaming Data

Lots of data comes in over time, as a data data str stream

 e.g. sensor networks, telemetry data, CERN

Often, more data comes in than we can/want to store

 to analyse this data, we need specialised algorithms,

that have a memory complexity 𝑛 ≪ |𝑇|

Static databases are also streams

 streaming data is simply non-random access,

e.g. we allow only one pass (or 𝑜) over your data

How can we sample from a stream?

 without bias?

I-DM- 21

slide-22
SLIDE 22

IRDM ‘15/16

Sampling from Streams

How can we get a uni unifor

  • rm s

samp mple 𝑆

  • f
  • f 𝑙 el

elem ements o

  • ver a

er a strea eam 𝑇?

 that is, how do we make sure that after 𝑜 elements of 𝑇, each of

those have the same pro robabi bility to be in 𝑆?

Reservoir Sampling, The Key Idea:

 initialise reservoir 𝑆 with first 𝑙 elements of 𝑇  insert 𝑜th element into 𝑆 with probability

𝑙 𝑜

 if successful, remove one of the 𝑙 old points uniformly at random

Now, every element of 𝑇 has the probability 𝑙

𝑜 to be in 𝑆 (!)

(Aggarwal Ch. 2.4.1) I-DM- 22

slide-23
SLIDE 23

IRDM ‘15/16

Conclusions

We’re collecting more and more data

 most of it is boring — how to find out what part is interesting?

Scientific Method

 form hypothesis, collect data, test hypothesis

Data Mining

 collect data first, ask questions later;

let the computer find what (interesting) hypotheses hold in it

Efficiency is very important

 a good answer now is much better than

the perfect answer when we’re all dead.

I-DM- 23

slide-24
SLIDE 24

IRDM ‘15/16

Thank you!

We’re collecting more and more data

 most of it is boring — how to find out what part is interesting?

Scientific Method

 form hypothesis, collect data, test hypothesis

Data Mining

 collect data first, ask questions later;

let the computer find what (interesting) hypotheses hold in it

Efficiency is very important

 a good answer now is much better than

the perfect answer when we’re all dead.

I-DM- 24