Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM - PowerPoint PPT Presentation

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM ‘15/16

What is Data Mining? “Data mining is the process of extracting hidden patterns from data.” “An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.” “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” “Data mining, in a broad sense, is the set of techniques for analyzing and understanding data.” I-DM- 2 IRDM ‘15/16

What is Data Mining? “Data mining is the process of extracting hidden patterns from data.” “An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results.” “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.” “Data mining, in a broad sense, is the set of techniques for analyzing and understanding data.” I-DM- 3 IRDM ‘15/16

The KDD (Knowledge Discovery from Data) Process Filtering patterns Input data Visualisation Pattern interpretation Data Data Post-processing pre-processing mining Normalisation Dimensionality reduction Information Feature selection Handling missing values Interactive data mining I-DM- 4 IRDM ‘15/16

Data Mining vs. Information Retrieval IR is answering questions the user asked DM is answering questions the user didn didn’t ask “Show me the web pages relevant to this query“ vs. “Show me the inter eres estin ing p pattern erns in the contents of these web pages” Vague problem… How to define interestingness? How to evaluate results? I-DM- 5 IRDM ‘15/16

Data Mining’s position in Science Data mining uses statistics to infer from data  is data mining just a fancy name for statistics? Data mining uses methods to learn unseen patterns  is data mining just a boring name for machine learning? Is data mining voodoo science? I-DM- 6 IRDM ‘15/16

Why D Why Data ta M Mini ning? I-DM- 8 IRDM ‘15/16

Why Data Mining? The ”PHT” Pirate wanted all information of the world. But before he realized most of it was useless, he was already buried under it. — Stanisław Lem, The Cyberiad I-DM- 9 IRDM ‘15/16

Big Data, Bigger Data, Biggest Data

Data, data, data 1 250 000 transactions per hour ≈ 5GB of climate data 350 000 000 500 000 000 photos. Per day. tweets per day I-DM- 11 IRDM ‘15/16

Data, data, data 1 250 000 transactions per hour ≈ 5GB of climate data To use this data, we need tools to analyse and understand it. 350 000 000 We need data mining. 500 000 000 photos. Per day. tweets per day I-DM- 12 IRDM ‘15/16

Data Mining Applications Business Intelligence I-DM- 13 IRDM ‘15/16

Shopping Data Which products are often bought toget ether er?

Train Delays Which trains are delayed because of othe other trains?

Drug Discovery What part of the molecule makes the drug work?

Data Mining Applications Business Intelligence  what do customers buy together?  what are the seasonal trends? Scientific Data Analysis  what genes cause diseases?  what are the differences between languages? And anything else where you have data…  who should Hillary Clinton try to persuade to vote?  is everything alright with my space object? I-DM- 17 IRDM ‘15/16

Faster! Do it faster! Nobody likes exponential time algorithms. In data mining, we don’t even like polynomial time  Your solution is 𝑃 ( 𝑜 3 ) ? Great… my data is only 10M records! (Sub-)Linear runtime is what we strive for  this means cutting corners: good ood enou ough is s good ood enou ough  often search space is so complicated there is no guarantee: hopeful ully good enoug ugh h is is good enoug ugh, h, hopefull ully I-DM- 18 IRDM ‘15/16

Sampling from Static Data Can we trim Big Data to Reasonably Sized Data? Without bias  by sampling uni uniformly, , every row has the same probability  often without replacement: duplicate rows may be a nuisance With bias to recent records � ∝ 𝑓 −𝜇⋅𝜀𝜀  by sampling with exponential decay, 𝑞 𝑌 �  where 𝜇 is the decay rate, and 𝜀𝜀 the age of element 𝑌 With bias to certain (e.g. rare) classes  by stratified sampling, often uniform with a probability per class I-DM- 19 IRDM ‘15/16

How much should we ask for? How much much data should we sample? depends on the samp ample comp mplexity of your problem space  No Free Lunc nch h Th Theorem: number of samples needed for error 𝜗 depends on the actua ual l distribution 𝑞 of the data, and there always exists some 𝑞 with arbitrarily ly hig igh h sample complexity Vapnik-Chervonenkis (VC) dimensionality and Rademacher complexity instead show how rich a set of hypotheses ℋ is for your data. Promising, but often difficult to use in practice. So, for many practical problems, we simply don’t know, and just sample as s much much a as s we ca can ha n hand ndle I-DM- 20 IRDM ‘15/16

Streaming Data Lots of data comes in over time, as a data data str stream  e.g. sensor networks, telemetry data, CERN Often, more data comes in than we can/want to store  to analyse this data, we need specialised algorithms, that have a memory complexity 𝑛 ≪ | 𝑇 | Static databases are also streams  streaming data is simply non-random access, e.g. we allow only one pass (or 𝑜 ) over your data How can we sample from a stream?  without bias? I-DM- 21 IRDM ‘15/16

Sampling from Streams How can we get a uni unifor orm s samp mple 𝑆 of of 𝑙 el elem ements o over a er a strea eam 𝑇 ?  that is, how do we make sure that after 𝑜 elements of 𝑇 , each of those have the same pro robabi bility to be in 𝑆 ? Reservoir Sampling, The Key Idea:  initialise reservoir 𝑆 with first 𝑙 elements of 𝑇 𝑙  insert 𝑜 th element into 𝑆 with probability 𝑜  if successful, remove one of the 𝑙 old points uniformly at random Now, every element of 𝑇 has the probability 𝑙 𝑜 to be in 𝑆 (!) (Aggarwal Ch. 2.4.1) I-DM- 22 IRDM ‘15/16

Conclusions We’re collecting more and more data  most of it is boring — how to find out what part is interesting? Scientific Method  form hypothesis, collect data, test hypothesis Data Mining  collect data first, ask questions later; let the computer find what (interesting) hypotheses hold in it Efficiency is very important  a good answer now is much better than the perfect answer when we’re all dead. I-DM- 23 IRDM ‘15/16

Thank you! We’re collecting more and more data  most of it is boring — how to find out what part is interesting? Scientific Method  form hypothesis, collect data, test hypothesis Data Mining  collect data first, ask questions later; let the computer find what (interesting) hypotheses hold in it Efficiency is very important  a good answer now is much better than the perfect answer when we’re all dead. I-DM- 24 IRDM ‘15/16

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM - PowerPoint PPT Presentation

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM 15/16 What is Data Mining? Data mining is the process of extracting hidden patterns from data. An Unethical Econometric practice of massaging and manipulating the

1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3

Intr troducti tion to NLP P an and T Text Min xt Minin ing Tutor: R Rahm ahmad ad Mahen

Class 4 @rwdkent Overview Current Events (10 min) Break (5 min) Explore RWD (25 min) CSS

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

CENTRE-BERCY 5 Min 10 Min 45 Min 55 Min DESTINATION PARIS BERCY ACCORHOTELS ARENA THE SEINE

procedure SERIAL MIN ( A , n ) 1. 2. begin 3. min = A [ 0 ] ; 4. for i := 1 to n 1 do 5.

Lizzi Meister ISJL Community Engagement Fellows Set Induction (5 min) Introduction to the

In the area: 400 meter to beach 5 min to supermarket 10 min to Guzelyurt 55

October 25, 2017 Aja Philp & Kerry Hamilton, Planners Workshop Overview 5 min - Welcome -

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

MA111: Contemporary mathematics Task Duration Finish first A: 6 min Nothing B: 5 min

Meta-Learning Lake 2019 & McCoy et al. 2020 By Joe O'Connor, Abby Bertics, and Ferran Alet

CREATE & USE FOCUS GROUP DATA Our Agenda 5 Introductions, Curriculum Overview min 5

Fatalities by time after crash 35% 31% (0-9 Min.) (90+ Min.) 34% (10-90 Min.) GSM/CDMA GPS

CREATE AND USE SURVEYS Our Agenda 5 Introductions, Curriculum Overview min 5 Review and

AUDITORIUM 5 KM 11 MIN RIDE PARCO DELLA MUSICA 8 KM 15 MIN RIDE GEMELLI HOSPITAL

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

PUBLIC HEALTHS SPECIAL ROLE IN BUILDING PARTNERSHIPS Jan ONeill Carol Moehrle Community Coach

Density models for credit risk Nicole El Karoui, Ecole Polytechnique, France e d Monique

BERNOULLI DYNAMICAL SYSTEMS AND LIMIT THEOREMS Dalibor Voln y Universit e de Rouen

Mail Contact Materials Jonathan P. Schreiner American Community Survey Office U.S. Census Bureau

The problem Are relevant RDF processing tasks on large datasets practically

Recovering Grammar Relationships for the Java Language Specification Ralf Lmmel and Vadim

1 NON-TRADITIONAL WORK LANDSCAPE Employment Employee Independent contractor classification

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM - PowerPoint PPT Presentation

Da Data Min a Minin ing What at i is it? Ch. 1 I-DM- 1 IRDM 15/16 What is Data Mining? Data mining is the process of extracting hidden patterns from data. An Unethical Econometric practice of massaging and manipulating the

1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3 min www.matsgroup.info 1 min 2 min 3

Intr troducti tion to NLP P an and T Text Min xt Minin ing Tutor: R Rahm ahmad ad Mahen

Class 4 @rwdkent Overview Current Events (10 min) Break (5 min) Explore RWD (25 min) CSS

Spelling, Punctuation and Grammar Suffixes -ing Year One SPaG | Suffixes -ing Suffixes Suffixes

CENTRE-BERCY 5 Min 10 Min 45 Min 55 Min DESTINATION PARIS BERCY ACCORHOTELS ARENA THE SEINE

procedure SERIAL MIN ( A , n ) 1. 2. begin 3. min = A [ 0 ] ; 4. for i := 1 to n 1 do 5.

Lizzi Meister ISJL Community Engagement Fellows Set Induction (5 min) Introduction to the

In the area: 400 meter to beach 5 min to supermarket 10 min to Guzelyurt 55

October 25, 2017 Aja Philp &amp; Kerry Hamilton, Planners Workshop Overview 5 min - Welcome -

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

MA111: Contemporary mathematics Task Duration Finish first A: 6 min Nothing B: 5 min

Meta-Learning Lake 2019 &amp; McCoy et al. 2020 By Joe O'Connor, Abby Bertics, and Ferran Alet

CREATE &amp; USE FOCUS GROUP DATA Our Agenda 5 Introductions, Curriculum Overview min 5

Fatalities by time after crash 35% 31% (0-9 Min.) (90+ Min.) 34% (10-90 Min.) GSM/CDMA GPS

CREATE AND USE SURVEYS Our Agenda 5 Introductions, Curriculum Overview min 5 Review and

AUDITORIUM 5 KM 11 MIN RIDE PARCO DELLA MUSICA 8 KM 15 MIN RIDE GEMELLI HOSPITAL

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

PUBLIC HEALTHS SPECIAL ROLE IN BUILDING PARTNERSHIPS Jan ONeill Carol Moehrle Community Coach

Density models for credit risk Nicole El Karoui, Ecole Polytechnique, France e d Monique

BERNOULLI DYNAMICAL SYSTEMS AND LIMIT THEOREMS Dalibor Voln y Universit e de Rouen

Mail Contact Materials Jonathan P. Schreiner American Community Survey Office U.S. Census Bureau

The problem Are relevant RDF processing tasks on large datasets practically

Recovering Grammar Relationships for the Java Language Specification Ralf Lmmel and Vadim

1 NON-TRADITIONAL WORK LANDSCAPE Employment Employee Independent contractor classification

October 25, 2017 Aja Philp & Kerry Hamilton, Planners Workshop Overview 5 min - Welcome -

Meta-Learning Lake 2019 & McCoy et al. 2020 By Joe O'Connor, Abby Bertics, and Ferran Alet

CREATE & USE FOCUS GROUP DATA Our Agenda 5 Introductions, Curriculum Overview min 5