BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE)
Pauli Miettinen 15 April 2013
BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE) Pauli - - PowerPoint PPT Presentation
BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE) Pauli Miettinen 15 April 2013 In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation of aging
BOOLEAN MATRIX FACTORISATIONS IN DATA MINING (AND ELSEWHERE)
Pauli Miettinen 15 April 2013
In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation of aging professors at the universities of Bordeaux and Clermont-
Gian-Carlo Rota Foreword to Boolean matrix theory and applications by K. H. Kim, 1982
BACKGROUND
FREQUENT ITEMSET MINING
A frequent itemset
FREQUENT ITEMSET MINING
A frequent itemset
FREQUENT ITEMSET MINING
A frequent itemset Many frequent itemsets
FREQUENT ITEMSET MINING
FREQUENT ITEMSET MINING
TILING DATABASES
items in all transactions
number of item–transaction pairs
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
= ×
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
= ×
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
= ×
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
= ○
BOOLEAN PRODUCTS AND FACTORISATIONS
and B is their matrix product under Boolean semi-ring
A expresses it as a Boolean product of two binary factor matrices B and C, that is, A = B○C
(A B)j =
Wk
=1 kbkj
MATRIX RANKS
matrices whose sum is A
binary rank-1 matrices whose element-wise or is A
THE MANY NAMES OF BOOLEAN RANK
COMPARISON OF RANKS
1 1 1 1 1 1 1
APPROXIMATE FACTORISATIONS
|A – B○C| is minimised
APPROXIMATE FACTORISATIONS
|A – B○C| is minimised
THE BASIS USAGE PROBLEM
BIPARTITE GRAPHS
1 1 1 1 1 1 1
1 2 3 A B C 1 2 3 A B C
G(A) A
BOOLEAN RANK AND BICLIQUES
A is the least number of complete bipartite subgraphs needed to cover every edge of the induced bipartite graph G(A)
1 2 3 A B C
1 2 3
BOOLEAN RANK AND BICLIQUES
1 1 1 1
1 2 3 A B C 1 1 1 1 1 1 1
1 2 3 A B C 1 1 1 1
A B C
1 2 3
BOOLEAN RANK AND BICLIQUES
1 1 1 1
1 2 3 A B C 1 1 1 1 1 1 1
1 2 3 A B C 1 1 1 1
A B C
1 2 3
BOOLEAN RANK AND BICLIQUES
1 1 1 1
1 2 3 A B C 1 1 1 1 1 1 1
1 2 3 A B C 1 1 1 1
A B C
1 2 3
BOOLEAN RANK AND BICLIQUES
1 1 1 1
1 2 3 A B C 1 1 1 1 1 1 1
1 2 3 A B C 1 1 1 1
A B C
ALGORITHMS
Images by Wikipedia users Arab Ace and SheilalauTHE BASIS USAGE
residual error
EXACT ALGORITHM FOR THE BOOLEAN RANK
corresponding edges in G induce a biclique
role minimization problems, in: SACMAT '08, 1–10.
EXAMPLE
1 2 3 A B C 1A 1B 2A 2B 2C 3B 3C
⤳
EXAMPLE
1 2 3 A B C 1A 1B 2A 2B 2C 3B 3C
⤳
EXAMPLE
1 2 3 A B C 1A 1B 2A 2B 2C 3B 3C
⤳
EXACT ALGORITHM FOR THE BOOLEAN RANK
vertex u with all its neighbours
irreducible kernel
role minimization problems, in: SACMAT '08, 1–10.
THE ASSO ALGORITHM
results in any case
rows
P . Miettinen et al., The Discrete Basis Problem, IEEE Trans. Knowl. Data en. 20 (2008) 1348–1362.
≈
≈
≈
≈
≈
≈
≈
≈
≈
≈
THE PANDA ALGORITHM
from the residual data where the cores are mined
Datasets in presence of Noise, in: SDM '10, 165–176.
EXAMPLE
≈
EXAMPLE
≈
SELECTING THE RANK
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
PRINCIPLES OF GOOD K
MINIMUM DESCRIPTION LENGTH PRINCIPLE
your data with least number of bits
FITTING BMF TO MDL
B C E
P . Miettinen, J. Vreeken, Model Order Selection for Boolean Matrix Factorization, in: KDD '11, 51–59.
FITTING BMF TO MDL
⊕
B C E
P . Miettinen, J. Vreeken, Model Order Selection for Boolean Matrix Factorization, in: KDD '11, 51–59.
FITTING BMF TO MDL
B C E
data given model L(D | H)
P . Miettinen, J. Vreeken, Model Order Selection for Boolean Matrix Factorization, in: KDD '11, 51–59.
EXAMPLE: ASSO & MDL
Paleo k = 19 5 10 15 20 25 30 35 40 45 50 1.86 1.88 1.9 1.92 1.94 1.96 1.98 2 2.02 x 10 4 Mammals k = 13 5 10 15 20 25 30 35 40 45 50 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 x 10 5 2 4 6 8 10 12 14 16 18 20 6.6 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 x 10 4 DBLP k = 4 Dialect k = 37 50 100 150 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 x 10 5SPARSE MATRICES
MOTIVATION
APPROXIMATING THE BOOLEAN RANK
more than log2(n) 1s
O(f(m)ln(|A|)).
P . Miettinen, Sparse Boolean Matrix Factorizations, in: ICDM '10, 935–940.
SPARSE FACTORISATIONS
to matrices B and C such that |B| + |C| ≤ 2|A|
P . Miettinen, Sparse Boolean Matrix Factorizations, in: ICDM '10, 935–940.
CONCLUSIONS
mining
CONCLUSIONS
mining