BOOLEAN MATRIX FACTORISATIONS & DATA MINING
Pauli Miettinen 6 February 2013
BOOLEAN MATRIX FACTORISATIONS & DATA MINING Pauli Miettinen 6 - - PowerPoint PPT Presentation
BOOLEAN MATRIX FACTORISATIONS & DATA MINING Pauli Miettinen 6 February 2013 In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation of aging professors at
BOOLEAN MATRIX FACTORISATIONS & DATA MINING
Pauli Miettinen 6 February 2013
In the sleepy days when the provinces of France were still quietly provincial, matrices with Boolean entries were a favored occupation of aging professors at the universities of Bordeaux and Clermont-
Gian-Carlo Rota Foreword to Boolean matrix theory and applications by K. H. Kim, 1982
BACKGROUND
FREQUENT ITEMSET MINING
transactions
STILL TOO MANY ITEMSETS
TILING DATABASES
items in all transactions
item–transaction pairs
TILING AS A MATRIX FACTORISATION
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
= × ○
BOOLEAN PRODUCTS AND FACTORISATIONS
B is their matrix product under Boolean semi-ring
expresses it as a Boolean product of two binary factor matrices B and C, that is, A = B○C
(A B)j =
Wk
=1 kbkj
MATRIX RANKS
matrices whose sum is A
rank-1 matrices whose element-wise or is A
THE MANY NAMES OF BOOLEAN RANK
COMPARISON OF RANKS
1 1 1 1 1 1 1
APPROXIMATE FACTORISATIONS
|A – B○C| is minimised
THE BASIS USAGE PROBLEM
ALGORITHMS
Images by Wikipedia users Arab Ace and SheilalauTHE BASIS USAGE
residual error
THE ASSO ALGORITHM
results in any case
≈
THE PANDA ALGORITHM
from the residual data where the cores are mined
EXAMPLE
≈
SELECTING THE RANK
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
PRINCIPLES OF GOOD K
MINIMUM DESCRIPTION LENGTH PRINCIPLE
your data with least number of bits
FITTING BMF TO MDL
⊕
B C E
data given model L(D | H)
EXAMPLE: ASSO & MDL
Paleo k = 19 5 10 15 20 25 30 35 40 45 50 1.86 1.88 1.9 1.92 1.94 1.96 1.98 2 2.02 x 10 4 Mammals k = 13 5 10 15 20 25 30 35 40 45 50 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 x 10 5 2 4 6 8 10 12 14 16 18 20 6.6 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 x 10 4 DBLP k = 4 Dialect k = 37 50 100 150 2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 x 10 5SPARSE MATRICES
MOTIVATION
APPROXIMATING THE BOOLEAN RANK
more than log2(n) 1s
O(f(m)ln(|A|)).
SPARSE FACTORISATIONS
to matrices B and C such that |B| + |C| ≤ 2|A|
CONCLUSIONS
mining
L Tiank Y