Information Retrieval & Data Mining Tuesday 1416 & Thursday - - PowerPoint PPT Presentation

information retrieval data mining
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval & Data Mining Tuesday 1416 & Thursday - - PowerPoint PPT Presentation

Nine Credit-Point Core Lecture on Information Retrieval & Data Mining Tuesday 1416 & Thursday 1618 @ HS003 (E1.3) Martin Theobald Pauli Miettinen Data Mining is About finding new and interesting information from data -


slide-1
SLIDE 1

18 October 2011 IR&DM, WS'11/12

Data Mining is…

  • About finding new and interesting

information from data

  • Association rules
  • Clusterings
  • Latent models
  • Classifiers

Information Retrieval & Data Mining

Tuesday 14–16 & Thursday 16–18 @ HS003 (E1.3) Martin Theobald Pauli Miettinen Nine Credit-Point Core Lecture on

slide-2
SLIDE 2

18 October 2011 IR&DM, WS'11/12

Data Mining — motivation

2

What to do with the information you’ve retrieved? The ”PHT” Pirate wanted all information of the world. But before he realized most of it was useless, he was already buried under it. —Stanisław Lem, The Cyberiad

slide-3
SLIDE 3

18 October 2011 IR&DM, WS'11/12

Data Mining — definition

3

Data mining is the process of extracting hidden patterns from data. —Wikipedia Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. —Hand, Mannila & Smyth: Principles of Data Mining Data mining, in a broad sense, is the set of techniques for analyzing and understanding data. —Zaki & Meira: Fundamentals of Data Mining Algorithms

slide-4
SLIDE 4

18 October 2011 IR&DM, WS'11/12

Data Mining — definition

4

Data mining, in a broad sense, is the set of techniques for analyzing and understanding data. —Zaki & Meira: Fundamentals of Data Mining Algorithms

slide-5
SLIDE 5

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

slide-6
SLIDE 6

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence
slide-7
SLIDE 7

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together?

slide-8
SLIDE 8

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends?

slide-9
SLIDE 9

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money? $?

slide-10
SLIDE 10

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis
slide-11
SLIDE 11

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis

– What genes cause diseases?

slide-12
SLIDE 12

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis

– What genes cause diseases? – What species co-inhabit areas?

slide-13
SLIDE 13

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis

– What genes cause diseases? – What species co-inhabit areas? – What happens if average temperature raises?

slide-14
SLIDE 14

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis

– What genes cause diseases? – What species co-inhabit areas? – What happens if average temperature raises?

  • And anything else where you have data…
slide-15
SLIDE 15

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis

– What genes cause diseases? – What species co-inhabit areas? – What happens if average temperature raises?

  • And anything else where you have data…

– Who Barack Obama should persuade to vote him?

slide-16
SLIDE 16

18 October 2011 IR&DM, WS'11/12

Data Mining Applications

5

  • Business intelligence

– What customers buy together? – What are the seasonal trends? – How to make more money?

  • Scientific data analysis

– What genes cause diseases? – What species co-inhabit areas? – What happens if average temperature raises?

  • And anything else where you have data…

– Who Barack Obama should persuade to vote him? – Is there a problem in International Space Station?

slide-17
SLIDE 17

18 October 2011 IR&DM, WS'11/12

What do You need to do Data Mining

  • Data
  • Domain knowledge
  • Data mining techniques

6

slide-18
SLIDE 18

18 October 2011 IR&DM, WS'11/12

What do You need to do Data Mining

  • Data
  • Domain knowledge
  • Data mining techniques

6

This course

slide-19
SLIDE 19

18 October 2011 IR&DM, WS'11/12

The Techniques

7

  • Frequent itemset mining & association rules
  • Clustering
  • Dimensionality reduction
  • Matrix factorization & latent factor models
  • Classifiers
slide-20
SLIDE 20

18 October 2011 IR&DM, WS'11/12

Frequent itemset mining demo

8

slide-21
SLIDE 21

18 October 2011 IR&DM, WS'11/12

Clustering for Medical Data

  • Temperament data

– Individuals are assigned values on different scales

  • Fear of uncertainity, shyness, impulsiveness, etc.

– Data is clustered (people with similar value combinations go to same cluster) – Results:

  • 4 clusters are enough
  • strong association between temperament and socio-economic

status and education

  • males and females cluster similarly, even if clustered

independently

9 Wessman: Clustering methods in the Analysis of Complex Diseases, manuscirpt

slide-22
SLIDE 22

18 October 2011 IR&DM, WS'11/12

Clustering for Medical Data

10

Stable, persistent, not very impulsive High socio-economical status and education

slide-23
SLIDE 23

18 October 2011 IR&DM, WS'11/12

Clustering for Medical Data

10

Outgoing, impulsive, energetic High socio-economical status and education

slide-24
SLIDE 24

18 October 2011 IR&DM, WS'11/12

Clustering for Medical Data

10

No extreme scales High hypomania and psychosis proneness

slide-25
SLIDE 25

18 October 2011 IR&DM, WS'11/12

Clustering for Medical Data

10

Shy, pessimistic, prefer routines and privacy Low socio-economic status, high levels of depression and schizophrenia

slide-26
SLIDE 26

18 October 2011 IR&DM, WS'11/12

Ecological Niche Modeling

11

  • Goal: Describe the area species inhabit using bio-

ecological variables

– Temperature, rainfall, etc.

  • Application: Forecast what happens to species if bio-

ecological environment changes

– Consequences of global warming

  • Data Mining Problem: classification

– Classify the areas inhabited by species using the bio- ecological variables

slide-27
SLIDE 27

18 October 2011 IR&DM, WS'11/12

Ecological Niche Modeling

12

  • Either
  • February’s max temperature is

between -9.8°C and 0.4°C

  • July’s max temperature is between

12.2°C and 24.6°C

  • August’s average rainfall is

between 56.85 mm and 136.46 mm

  • Or
  • September’s average rainfall is

between 183.27 mm and 238.78 mm European Elk

Galbrun & Miettinen: From Black and White to Full Colour: Extending Redescription Mining Outside the Boolean World. SDM ’11

slide-28
SLIDE 28

18 October 2011 IR&DM, WS'11/12

Ecological Niche Modeling

13