BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE) Pauli - - PowerPoint PPT Presentation

boolean matrix factorisations in data mining and elsewhere
SMART_READER_LITE
LIVE PREVIEW

BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE) Pauli - - PowerPoint PPT Presentation

BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE) Pauli Miettinen 15 April 2013 In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation of aging


slide-1
SLIDE 1

BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE)

Pauli Miettinen 15 April 2013

slide-2
SLIDE 2

In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation of aging professors at the universities of Bordeaux and Clermont-

  • Ferrand. But one day…

Gian-Carlo Rota Foreword to Boolean matrix theory and applications by K. H. Kim, 1982

slide-3
SLIDE 3

BACKGROUND

slide-4
SLIDE 4

FREQUENT ITEMSET MINING

A frequent itemset

slide-5
SLIDE 5

FREQUENT ITEMSET MINING

A frequent itemset

slide-6
SLIDE 6

FREQUENT ITEMSET MINING

A frequent itemset Many frequent itemsets

slide-7
SLIDE 7

FREQUENT ITEMSET MINING

slide-8
SLIDE 8

FREQUENT ITEMSET MINING

slide-9
SLIDE 9

TILING DATABASES

  • Goal: Find itemsets that cover the transaction data
  • Itemset I covers item i in transaction T if i ∈ I ⊆ T
  • Minimum tiling: Find the smallest number of tiles that cover all

items in all transactions

  • Maximum k-tiling: Find k tiles that cover the maximum

number of item–transaction pairs

  • If you have a set of tiles, these reduce to the Set Cover problem
  • F. Geerts et al., Tiling databases, in: DS '04, 77–122.
slide-10
SLIDE 10

TILING AS A MATRIX FACTORISATION

1 1 1 1 1 1 1

( )

slide-11
SLIDE 11

TILING AS A MATRIX FACTORISATION

1 1 1 1 1 1 1

( )

slide-12
SLIDE 12

TILING AS A MATRIX FACTORISATION

1 1 1 1 1 1 1

( )

1 1 1 1

( )

1 1 1 1

( )

= ×

slide-13
SLIDE 13

TILING AS A MATRIX FACTORISATION

1 1 1 1 1 1 1

( )

1 1 1 1

( )

1 1 1 1

( )

= ×

slide-14
SLIDE 14

TILING AS A MATRIX FACTORISATION

1 1 1 1 1 1 1

( )

1 1 1 1

( )

1 1 1 1

( )

= ×

slide-15
SLIDE 15

TILING AS A MATRIX FACTORISATION

1 1 1 1 1 1 1

( )

1 1 1 1

( )

1 1 1 1

( )

= ○

slide-16
SLIDE 16

BOOLEAN PRODUCTS AND FACTORISATIONS

  • The Boolean matrix product of two binary matrices A

and B is their matrix product under Boolean semi-ring

  • The Boolean matrix factorisation of a binary matrix

A expresses it as a Boolean product of two binary factor matrices B and C, that is, A = B○C

(A B)j =

Wk

=1 kbkj

slide-17
SLIDE 17

MATRIX RANKS

  • The (Schein) rank of a matrix A is the least number of rank-1

matrices whose sum is A

  • A = R1 + R2 + … + Rk
  • Matrix is rank-1 if it is an outer product of two vectors
  • The Boolean rank of binary matrix A is the least number of

binary rank-1 matrices whose element-wise or is A

  • The least k such that A= B○C with B having k columns
slide-18
SLIDE 18

THE MANY NAMES OF BOOLEAN RANK

  • Minimum tiling (data mining)
  • Rectangle covering number (communication complexity)
  • Minimum bi-clique edge covering number (Garey & Johnson GT18)
  • Minimum set basis (Garey & Johnson SP7)
  • Optimum key generation (cryptography)
  • Minimum set of roles (access control)
slide-19
SLIDE 19

COMPARISON OF RANKS

  • Boolean rank is NP-hard to compute
  • And as hard to approximate as the minimum clique
  • Boolean rank can be less than normal rank
  • rankB(A) = O(log2(rank(A))) for certain A
  • Boolean rank is never more than the non-negative rank

  

1 1 1 1 1 1 1

  

slide-20
SLIDE 20

APPROXIMATE FACTORISATIONS

  • Noise usually makes real-world matrices (almost) full rank
  • We want to find a good low-rank approximation
  • The goodness is measured using the Hamming distance
  • Given A and k, find B and C such that B has k columns and

|A – B○C| is minimised

  • No easier than finding the Boolean rank
slide-21
SLIDE 21

APPROXIMATE FACTORISATIONS

  • Noise usually makes real-world matrices (almost) full rank
  • We want to find a good low-rank approximation
  • The goodness is measured using the Hamming distance
  • Given A and k, find B and C such that B has k columns and

|A – B○C| is minimised

  • No easier than finding the Boolean rank
slide-22
SLIDE 22

THE BASIS USAGE PROBLEM

  • Finding the factorisation is hard even if we know one factor matrix
  • Problem. Given B and A, find X such that |A○X – B| is minimised
  • We can replace B and X with column vectors
  • |A○x – b| versus ||Ax – b||
  • Normal algebra: Moore–Penrose pseudo-inverse
  • Boolean algebra: no polylogarithmic approximation
slide-23
SLIDE 23

BIPARTITE GRAPHS

1 1 1 1 1 1 1

( )

1 2 3 A B C 1 2 3 A B C

G(A) A

slide-24
SLIDE 24

BOOLEAN RANK AND BICLIQUES

  • The Boolean rank of a matrix

A is the least number of complete bipartite subgraphs needed to cover every edge of the induced bipartite graph G(A)

1 2 3 A B C

slide-25
SLIDE 25

1 2 3

BOOLEAN RANK AND BICLIQUES

1 1 1 1

( )

1 2 3 A B C 1 1 1 1 1 1 1

( )

1 2 3 A B C 1 1 1 1

( )

  • =

A B C

slide-26
SLIDE 26

1 2 3

BOOLEAN RANK AND BICLIQUES

1 1 1 1

( )

1 2 3 A B C 1 1 1 1 1 1 1

( )

1 2 3 A B C 1 1 1 1

( )

  • =

A B C

slide-27
SLIDE 27

1 2 3

BOOLEAN RANK AND BICLIQUES

1 1 1 1

( )

1 2 3 A B C 1 1 1 1 1 1 1

( )

1 2 3 A B C 1 1 1 1

( )

  • =

A B C

slide-28
SLIDE 28

1 2 3

BOOLEAN RANK AND BICLIQUES

1 1 1 1

( )

1 2 3 A B C 1 1 1 1 1 1 1

( )

1 2 3 A B C 1 1 1 1

( )

  • =

A B C

slide-29
SLIDE 29

ALGORITHMS

Images by Wikipedia users Arab Ace and Sheilalau
slide-30
SLIDE 30

THE BASIS USAGE

  • Peleg’s algorithm approximates within 2√[(k+a)log a]
  • a is the maximum number of 1s in A’s columns
  • Optimal solution
  • Either an O(2kknm) exhaustive search, or an integer program
  • Greedy algorithm: select each column of B if it improves the

residual error

slide-31
SLIDE 31

EXACT ALGORITHM FOR THE BOOLEAN RANK

  • Consider an edge-dual of the bipartite graph G
  • Edges of G ⤳ vertices of edge-dual G’
  • Connect two vertices of G’ if the endpoints of the

corresponding edges in G induce a biclique

  • A clique partition of G’ is a biclique cover of G
  • A coloring of the complement of G’ is a clique partition of G’
  • A. Ene et al., Fast exact and heuristic methods for

role minimization problems, in: SACMAT '08, 1–10.

slide-32
SLIDE 32

EXAMPLE

1 2 3 A B C 1A 1B 2A 2B 2C 3B 3C

slide-33
SLIDE 33

EXAMPLE

1 2 3 A B C 1A 1B 2A 2B 2C 3B 3C

slide-34
SLIDE 34

EXAMPLE

1 2 3 A B C 1A 1B 2A 2B 2C 3B 3C

slide-35
SLIDE 35

EXACT ALGORITHM FOR THE BOOLEAN RANK

  • Eliminate vertices of G’ if:
  • vertex has no neighbours (is a clique of its own)
  • vertex v is such that it and all of its neighbours are a superset of

vertex u with all its neighbours

  • Solve graph coloring in the complement of the resulting

irreducible kernel

  • Add the removed vertices appropriately
  • A. Ene et al., Fast exact and heuristic methods for

role minimization problems, in: SACMAT '08, 1–10.

slide-36
SLIDE 36

THE ASSO ALGORITHM

  • Heuristic – too many hardness results to hope for good provable

results in any case

  • Intuition: If two columns share a factor, they have 1s in same

rows

  • Noise makes detecting this harder
  • Pairwise row association rules reveal (some of) the factors
  • Pr[aik = 1 | ajk = 1]

P . Miettinen et al., The Discrete Basis Problem, IEEE Trans. Knowl. Data en. 20 (2008) 1348–1362.

slide-37
SLIDE 37

slide-38
SLIDE 38

slide-39
SLIDE 39

slide-40
SLIDE 40

slide-41
SLIDE 41

slide-42
SLIDE 42

slide-43
SLIDE 43

slide-44
SLIDE 44

slide-45
SLIDE 45

slide-46
SLIDE 46

slide-47
SLIDE 47

THE PANDA ALGORITHM

  • Intuition: every good factor has a noise-free core
  • Two-phase algorithm:
  • 1. Find error-free core pattern (maximum area itemset/tile)
  • 2. Extend the core with noisy rows/columns
  • The core patterns are found using a greedy method
  • The 1s already belonging to some factor/tile are removed

from the residual data where the cores are mined

  • C. Lucchese et al., Mining Top-K Patterns from Binary

Datasets in presence of Noise, in: SDM '10, 165–176.

slide-48
SLIDE 48

EXAMPLE

slide-49
SLIDE 49

EXAMPLE

slide-50
SLIDE 50

SELECTING THE RANK

1 1 1 1 1 1 1

( )

1 1 1 1 1 1 1

( )

1 1 1 1 1 1 1

( )

1 1 1 1 1 1 1

( )

1 1 1 1 1 1 1

( )

1 1 1 1 1 1 1

( )

slide-51
SLIDE 51

PRINCIPLES OF GOOD K

  • Goal: Separate noise from structure
  • We assume data has correct type of structure
  • There are k factors explaining the structure
  • Rest of the data does not follow the structure (noise)
  • But how to decide where structure ends and noise starts?
slide-52
SLIDE 52

MINIMUM DESCRIPTION LENGTH PRINCIPLE

  • The best model (order) is the one that allows you to explain

your data with least number of bits

  • Two-part (crude) MDL: the cost of model L(H) plus the cost
  • f data given the model L(D | H)
  • Problem: how to do the encoding
  • All involved matrices are binary: well-known encoding schemes
slide-53
SLIDE 53

FITTING BMF TO MDL

B C E

  • Two-part MDL: minimise L(H) + L(D | H)

P . Miettinen, J. Vreeken, Model Order Selection for Boolean Matrix Factorization, in: KDD '11, 51–59.

slide-54
SLIDE 54

FITTING BMF TO MDL

  • model L(H)

B C E

  • Two-part MDL: minimise L(H) + L(D | H)

P . Miettinen, J. Vreeken, Model Order Selection for Boolean Matrix Factorization, in: KDD '11, 51–59.

slide-55
SLIDE 55

FITTING BMF TO MDL

B C E

  • Two-part MDL: minimise L(H) + L(D | H)

data given model L(D | H)

P . Miettinen, J. Vreeken, Model Order Selection for Boolean Matrix Factorization, in: KDD '11, 51–59.

slide-56
SLIDE 56 Pauli Miettinen 24 September 2012

EXAMPLE: ASSO & MDL

Paleo k = 19 5 10 15 20 25 30 35 40 45 50 1.86 1.88 1.9 1.92 1.94 1.96 1.98 2 2.02 x 10 4 Mammals k = 13 5 10 15 20 25 30 35 40 45 50 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 x 10 5 2 4 6 8 10 12 14 16 18 20 6.6 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 x 10 4 DBLP k = 4 Dialect k = 37 50 100 150 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 x 10 5
slide-57
SLIDE 57

SPARSE MATRICES

slide-58
SLIDE 58

MOTIVATION

  • Many real-world binary matrices are sparse
  • Representing sparse matrices with sparse factors is desirable
  • Saves space, improves usability, …
  • Sparse matrices should be computationally easier
slide-59
SLIDE 59

APPROXIMATING THE BOOLEAN RANK

  • Let A be a binary n-by-m matrix that has f(m) columns with

more than log2(n) 1s

  • Lemma. We can approximate the Boolean rank of A within

O(f(m)ln(|A|)).

P . Miettinen, Sparse Boolean Matrix Factorizations, in: ICDM '10, 935–940.

slide-60
SLIDE 60

SPARSE FACTORISATIONS

  • Any binary matrix A that admits rank-k BMF has factorisation

to matrices B and C such that |B| + |C| ≤ 2|A|

  • |A| is the number of non-zeros in A
  • Can be extended to approximate factorisations
  • Tight result (consider a case when A has exactly one 1)

P . Miettinen, Sparse Boolean Matrix Factorizations, in: ICDM '10, 935–940.

slide-61
SLIDE 61

CONCLUSIONS

  • Boolean matrix factorisations are a topic older than I am
  • Work has been done in many fields of CS
  • Not just Data Mining
  • Approximate factorisations are an interesting tool for data

mining

  • Work is not done yet…
slide-62
SLIDE 62

CONCLUSIONS

  • Boolean matrix factorisations are a topic older than I am
  • Work has been done in many fields of CS
  • Not just Data Mining
  • Approximate factorisations are an interesting tool for data

mining

  • Work is not done yet…
  • Thank Yo! •