The 8M Algorithm from Todays Perspective Radim Belohlavek, Martin - - PowerPoint PPT Presentation
The 8M Algorithm from Todays Perspective Radim Belohlavek, Martin - - PowerPoint PPT Presentation
The 8M Algorithm from Todays Perspective Radim Belohlavek, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACK UNIVERSITY OLOMOUC CLA 2018 14th International Conference on Concept Lattices and Their Applications Olomouc, Czech Republic,
Our Contributions
Boolean matrix factorization (BMF) Current research = design of new factorization algorithms Present and analyze 8M method
– unknown in present research on BMF – (first) complete description of the 8M algorithm – improvement of the 8M algorithm (8M+) – lessons performance of existing algorithms
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
1 / 18
Boolean Matrix Factorization
A general aim: for a given matrix I ∈ {0,1}n×m find matrices A ∈ {0,1}n×k and B ∈ {0,1}k×m for which I (approximately) equals A ○ B, k reasonably small ○ is the Boolean matrix product (A ○ B)ij =
k
max
l=1 min(Ail,Blj).
⎛ ⎜ ⎜ ⎜ ⎝ 10111 01101 01001 10110 ⎞ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝ 110 011 001 100 ⎞ ⎟ ⎟ ⎟ ⎠ ○ ⎛ ⎜ ⎝ 10110 00101 01001 ⎞ ⎟ ⎠ Various terminology and notation (including FCA) Factors = interesting patterns that help explain data
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
2 / 18
Error Measure
I (approximately) equals A ○ B Assessed by means of the metric E(⋅,⋅) E(C,D) = ∑m,n
i,j=1 ∣Cij − Dij∣.
Two components of E E(I,A ○ B) = Eu(I,A ○ B) + Eo(I,A ○ B), where Eu(I,A ○ B) = ∣{⟨i,j⟩; Iij = 1,(A ○ B)ij = 0}∣, Eo(I,A ○ B) = ∣{⟨i,j⟩; Iij = 0,(A ○ B)ij = 1}∣. Non-symmetry of undercovering and overcovering error
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
3 / 18
8M
Statistical software package known as BMDP Developed in 1960s at the University of California in Los Angeles (W. J. Dixon) Developed by: M. R. Mickey, L. Engelman and P. Mudle 8M method has been added to BMDP in the late 1970s Probably the oldest BMF method No longer available Dixon, W. J. (ed.): BMDP Statistical Software Manual. Berkeley, CA: University of California Press (1992) Incomplete description → several blindspots Partially black box analysis of 8M
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
4 / 18
Basic Idea of 8M
Input:
– I ∈ {0,1}n×m . . . Boolean matrix – k . . . number of desired factors – init . . . number of initial factors – cost . . . determines significance of overcovering
Output:
– A ∈ {0,1}n×k and B ∈ {0,1}k×m
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
5 / 18
Basic Idea of 8M: main procedure
Algorithm 1: 8M
B ← ComputeInitialFactors(init) A ← 0n×init f ← init RefineMatricesAB(A, B, I, cost) kReached ← 0 while kReached < 2 or I ≤ A ○ B do foreach ⟨i, j⟩ do if Iij > (A ○ B)ij then ∆+
ij ← 1 else ∆+ ij ← 0
add column j of ∆+ with the largest count of 1s as new column to A add row of 0s as new row to B and set entry j of this row to 1 f ← f + 1 RefineMatricesAB(A, B, I, cost) if another two new factors were added then remove column A_(f−2) from A and row B(f−2)_ from B f ← f − 1 RefineMatricesAB(A, B, I, cost) if f=k then kReached ← kReached + 1 return A, B
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
6 / 18
Basic Idea of 8M: refine matrices
Algorithm 2: RefineMatricesAB
repeat RefineMatrixA(A, B, I, cost) RefineMatrixB(A, B, I, cost) until loop executed 3 times or A and B did not change
Algorithm 3: RefineMatrixA
foreach row i ∈ {1, . . . , n} do y ← Ii_; Z ← B; Ai_ ← 0 repeat foreach factor l ∈ 1, . . . , f do ml ← ∑m
j=1 yj ⋅ Zlj − cost ⋅ ∑m j=1(1 − yj) ⋅ Zlj
select p for which mp = maxl ml if mp > 0 then Aip ← 1 foreach j ∈ {1, . . . , m} do if Zpj = 1 then Z_j ← 0; yj ← 0 until mp > 0
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
7 / 18
Basic Idea of 8M: initialization
Algorithm 4: ComputeInitialFactors
C ← m × m Boolean matrix with all entries equal to 0 foreach Cij do if I_i ≤ I_j and ∣I_i∣ > 0 then Cij ← 1 remove all duplicate and empty rows from C f ← 0 foreach row i ∈ 1, . . . , m of matrix C do if row Ci_ has entry j for which Cij = 1 and Ckj = 0 for all k < i then f ← f + 1 add row Ci_ as a new row to B if f = init then return B
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
8 / 18
Basic Idea of 8M
1 Computing init initial factors
– similarity with Asso algorithm
2 Iteratively computes new factors until k factors are obtained 3 Generating new factor via Boolean regression 4 Previously generated factors are revisited and dropped
– adds two factors, then removes factor generated two steps back – k = 6, sequence: 2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
9 / 18
Comparison with Other Methods
Tiling Geerts, Goethals, Mielikainen: Tiling databases. In: Discovery Science 2004 (2004). Asso Miettinen, Mielikainen, Gionis, Das, Mannila: The discrete basis problem. IEEE Trans. Knowledge and Data Eng. (2008). GreConD Belohlavek, Vychodil: Discovery of optimal factors in binary data via a novel method
- f matrix decomposition. J. Comput. Syst. Sci. (2010).
Hyper Xiang, Jin, Fuhry, Dragan: Summarizing transactional databases with overlapped
- hyperrectangles. Data Mining and Know. Discovery (2011).
PaNDa Lucchese, Orlando, Perego: Mining top-K patterns from binary datasets in presence of
- noise. In: SIAM DM 2010 (2010).
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
10 / 18
Comparison with Other Methods: results
20 40 60 80 100 120 k (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M Tiling Asso GreConD PaNDa Hyper
(a) Mushroom
5 10 15 20 25 30 35 40 45 50 k (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M Tiling Asso GreConD PaNDa Hyper
(b) Set X1 Figure: Coverage quality of the first l factors on real and synthetic data.
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
11 / 18
8M from Today’s Perspective
Improvements of 8M Lessons from 8M
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
12 / 18
Improvements of 8M
8M+ New initialization step Very fast strategy of GreConD algorithm No overcovering error
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
13 / 18
Comparison of 8M and 8M+
10 20 30 40 50 60 70 80 90 100 110 l (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M 8M+
(a) Mushroom
50 100 150 200 250 l (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M 8M+
(b) DNA Figure: Coverage quality of the first l factors on real data: 8M vs. 8M+.
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
14 / 18
Lesson from 8M
Revisiting the previously generated factors Significant aspect Non-symmetry of undercovering and overcovering error Existing algorithms do not use any kind of revisiting Improvement of existing algorithms Removes factors driven by parameter p
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
15 / 18
Lesson from 8M: improvement of GreConD
p Dataset
- rig.
0.01 0.02 0.03 0.04 0.05 Emea k 42 34 29 26 25 24 23 c 1.000 1.000 0.992 0.981 0.975 0.963 0.956 Chess k 124 119 72 62 55 51 47 c 1.000 1.000 0.991 0.981 0.970 0.962 0.952 Firewall 1 k 66 65 17 10 8 7 6 c 1.000 1.000 0.990 0.981 0.972 0.964 0.953 Firewall 2 k 10 10 4 4 4 4 3 c 1.000 1.000 0.998 0.998 0.998 0.998 0.958 Mushroom k 120 113 81 73 69 65 61 c 1.000 1.000 0.990 0.980 0.970 0.960 0.951
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
16 / 18
Conclusions
Detailed description of 8M Improvement of 8M New ideas for current BMF algorithms Explore revisiting of factors
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
17 / 18
Thank you
- R. Belohlavek, M. Trnecka (Palacký University Olomouc)
18 / 18