Redescription Mining Pauli Miettinen 17 November 2010 An Example - - PowerPoint PPT Presentation

redescription mining
SMART_READER_LITE
LIVE PREVIEW

Redescription Mining Pauli Miettinen 17 November 2010 An Example - - PowerPoint PPT Presentation

Redescription Mining Pauli Miettinen 17 November 2010 An Example VLDB ICDM SDM SIGMOD (J. Han P .S. Yu) C.-R. Lin S. Lonardi Pauli Miettinen 17 Nov 2010 VLDB ICDM SDM SIGMOD (J. Han P .S. Yu)


slide-1
SLIDE 1

Redescription Mining

Pauli Miettinen 17 November 2010

slide-2
SLIDE 2

An Example

Pauli Miettinen 17 Nov 2010

VLDB ∧ ICDM ∧ SDM ∧ SIGMOD ⇔ (J. Han ∧ P .S. Yu)∨ C.-R. Lin ∨ S. Lonardi

slide-3
SLIDE 3

Authors Conferences Co-Authors VLDB ∧ ICDM ∧ SDM ∧ SIGMOD ⇔ (J. Han ∧ P .S. Yu)∨ C.-R. Lin ∨ S. Lonardi

Pauli Miettinen 17 Nov 2010

slide-4
SLIDE 4

Authors Conferences Co-Authors VLDB ∧ ICDM ∧ SDM ∧ SIGMOD ⇔ (J. Han ∧ P .S. Yu)∨ C.-R. Lin ∨ S. Lonardi

Pauli Miettinen 17 Nov 2010

Dimitrios Gunopulos Charu C. Aggarwal Philip S. Yu Eamonn J. Keogh ...

slide-5
SLIDE 5

Definitions

Pauli Miettinen 17 Nov 2010

slide-6
SLIDE 6

The Definitions

  • Redescription. Given two data sets with a bijection

between the rows, a redescription is a pair of queries (Q1,Q2)

  • ver the columns such that (Q1,Q2) satisfies certain

constraints and supp(Q1)≈supp(Q2).

Pauli Miettinen 17 Nov 2010

Redescription mining. Given the data sets as above, find the (k) best redescriptions.

slide-7
SLIDE 7

More Concrete Definition

  • Data sets: Boolean
  • Queries: Arbitrary Boolean formulae
  • Similarity function: Jaccard

|supp(Q1)∩supp(Q2)|/|supp(Q1)∪supp(Q2)|

  • Constraints: Minimum support σmin,

maximum support σmax, minimum similarity Jmin, maximum size of formula k, maximum p-value pmax (more on this later)

Pauli Miettinen 17 Nov 2010

slide-8
SLIDE 8

Special Cases

  • Only conjunctive queries
  • ”bi-directional” association rule mining

Q1 ⇒ Q2 and Q2 ⇒ Q1

  • One query given
  • classification task

Pauli Miettinen 17 Nov 2010

slide-9
SLIDE 9

p-values

  • On-line: assuming independency, what is the

probability of the observed support intersection size given support sizes?

  • Binomial distribution
  • Off-line: what is the (empirical) probability of

finding as good redescriptions with given column and row margins

  • swap randomization

Pauli Miettinen 17 Nov 2010

slide-10
SLIDE 10

Algorithms

Pauli Miettinen 17 Nov 2010

slide-11
SLIDE 11

Some Algorithms

  • CARTwheels [Ramakrishnan et al. 2004,

Kumar 2007]

  • Greedy [Gallo, M. & Mannila 2008]
  • Both are for Boolean data
  • Neither finds arbitrary Boolean formulae

Pauli Miettinen 17 Nov 2010

slide-12
SLIDE 12

Generalizations

Pauli Miettinen 17 Nov 2010

slide-13
SLIDE 13

Non-Boolean Data

  • Queries of type

x1∈ [-0.2, 1.3] ∨ x2 ∈ (-∞, 20]

  • ”Standard” way: binarize data via bucketing
  • Allows using existing algorithms
  • Has many problems
  • Bucketing can be done on the fly [Galbrun &

M., submitted]

Pauli Miettinen 17 Nov 2010

slide-14
SLIDE 14

Example: Bioclimatic Niche Finding

  • Data: (1) Presence/absence data for

mammals in Europe; (2) climatic data (temperature and rainfall)

  • Question: Find a description over

climatic variables that describes the area inhabited by (a group of) mammals (and vice versa)

Pauli Miettinen 17 Nov 2010

slide-15
SLIDE 15

Bioclimatic Niche Finding: Background

  • A.k.a. bioclimatic envelope finding
  • Has been done for a long time by biologists
  • Only single, hand-selected species
  • Methods used include regression, neural

networks, and genetic algorithms

  • Niche: realized niche in Grinnellian sense

Pauli Miettinen 17 Nov 2010

slide-16
SLIDE 16

Niche Finding: Our Contributions

  • Automate niche finding
  • Easy-to-understand method (contra

genetic algorithms and neural networks)

  • Allow for more complex sets of species
  • Can be generalized from species to traits
  • Traits are more stable on

palaeontological scale

Pauli Miettinen 17 Nov 2010

slide-17
SLIDE 17

Niche Finding: Example Results

European Elk ⇔ ([−9.80 ≤ tmax(Feb) ≤ 0.40] ∧ [12.20 ≤ tmax(Jul) ≤ 24.60] ∧[56.852 ≤ pavg(Aug) ≤ 136.46])∨[183.27 ≤ pavg(Sep)≤ 238.78] Jaccard = 0.814; support = 582 Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ tmax(Mar) ≤ 14.50] ∧ [17.30 ≤ tmax(Aug) ≤ 25.20] ∧ [14.90 ≤ tmax(Sep) ≤ 22.80]) ∨ [19.60 ≤ tavg(Jul) ≤ 19.956] Jaccard = 0.623; support = 681

Pauli Miettinen 17 Nov 2010

slide-18
SLIDE 18

European Elk ⇔ ([−9.80 ≤ tmax(Feb) ≤ 0.40] ∧ [12.20 ≤ tmax(Jul) ≤ 24.60] ∧[56.852 ≤ pavg(Aug) ≤ 136.46])∨[183.27 ≤ pavg(Sep)≤ 238.78]

Pauli Miettinen 17 Nov 2010

slide-19
SLIDE 19

Niche Finding: Example Results

European Elk ⇔ ([−9.80 ≤ tmax(Feb) ≤ 0.40] ∧ [12.20 ≤ tmax(Jul) ≤ 24.60] ∧[56.852 ≤ pavg(Aug) ≤ 136.46])∨[183.27 ≤ pavg(Sep)≤ 238.78] Jaccard = 0.814; support = 582 Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ tmax(Mar) ≤ 14.50] ∧ [17.30 ≤ tmax(Aug) ≤ 25.20] ∧ [14.90 ≤ tmax(Sep) ≤ 22.80]) ∨ [19.60 ≤ tavg(Jul) ≤ 19.956] Jaccard = 0.623; support = 681

Pauli Miettinen 17 Nov 2010

slide-20
SLIDE 20

Wood Mouse ∧ Natterer’s Bat ∧ Eurasian Pygmy Shrew ⇔ ([3.20 ≤ tmax(Mar) ≤ 14.50] ∧ [17.30 ≤ tmax(Aug) ≤ 25.20] ∧ [14.90 ≤ tmax(Sep) ≤ 22.80]) ∨ [19.60 ≤ tavg(Jul) ≤ 19.956]

Pauli Miettinen 17 Nov 2010

slide-21
SLIDE 21

Discussion

Pauli Miettinen 17 Nov 2010

slide-22
SLIDE 22

Pattern Mining or Subgroup Discovery?

Pauli Miettinen 17 Nov 2010

PM Binary data Unsupervised Frequency Exhaustive Reconstructive SD Numerical data Supervised Interestingness Heuristic Descriptive

slide-23
SLIDE 23

Pattern Mining or Subgroup Discovery?

Pauli Miettinen 17 Nov 2010

PM Binary data Unsupervised Frequency Exhaustive Reconstructive SD Numerical data Supervised Interestingness Heuristic Descriptive

slide-24
SLIDE 24

Conclusions

  • Redescription mining is a promising

research direction

  • (SD ∩ PM) ∩ RDM ≠ ∅
  • Still a new direction
  • There are nails for this hammer
slide-25
SLIDE 25

CARTwheels

  • Grows two classification and regression trees

(CARTs)

  • Fix one tree and grow other to match; alternate
  • Leaves are matched and paths are the

descriptions:

(ICDM ) ∨ (¬ICDM ∧¬STOC) ⇔ (C. Olston ∧¬C. Chekuri ) ∨ (¬C. Olston ∧ ¬A. Wigderson)

Pauli Miettinen 17 Nov 2010

slide-26
SLIDE 26

(ICDM ) ∨ (¬ICDM ∧¬STOC) ⇔ (C. Olston ∧¬C. Chekuri ) ∨ (¬C. Olston ∧ ¬A. Wigderson)

ICDM STOC

Yes No No

  • C. Olston
  • C. Chekuri
  • A. Wigderson

Yes No No No

slide-27
SLIDE 27

Greedy

  • Grows formulae in a greedy fashion using

beam search

  • Prunes search space as if monotonicity

would hold

  • If adding a variable does not help now, it

will not help later, either

  • False in general

Pauli Miettinen 17 Nov 2010