The 8M Algorithm from Todays Perspective Radim Belohlavek, Martin - - PowerPoint PPT Presentation

▶

Mar 13, 2024 637 likes •855 views

The 8M Algorithm from Todays Perspective Radim Belohlavek, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACK UNIVERSITY OLOMOUC CLA 2018 14th International Conference on Concept Lattices and Their Applications Olomouc, Czech Republic,

SLIDE 1

The 8M Algorithm from Today’s Perspective

Radim Belohlavek, Martin Trnecka

DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC CLA 2018 14th International Conference on Concept Lattices and Their Applications Olomouc, Czech Republic, June 12–14, 2018

SLIDE 2

Our Contributions

Boolean matrix factorization (BMF) Current research = design of new factorization algorithms Present and analyze 8M method

– unknown in present research on BMF – (first) complete description of the 8M algorithm – improvement of the 8M algorithm (8M+) – lessons performance of existing algorithms

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

1 / 18

SLIDE 3

Boolean Matrix Factorization

A general aim: for a given matrix I ∈ {0,1}n×m find matrices A ∈ {0,1}n×k and B ∈ {0,1}k×m for which I (approximately) equals A ○ B, k reasonably small ○ is the Boolean matrix product (A ○ B)ij =

k

max

l=1 min(Ail,Blj).

⎛ ⎜ ⎜ ⎜ ⎝ 10111 01101 01001 10110 ⎞ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎝ 110 011 001 100 ⎞ ⎟ ⎟ ⎟ ⎠ ○ ⎛ ⎜ ⎝ 10110 00101 01001 ⎞ ⎟ ⎠ Various terminology and notation (including FCA) Factors = interesting patterns that help explain data

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

2 / 18

SLIDE 4

Error Measure

I (approximately) equals A ○ B Assessed by means of the metric E(⋅,⋅) E(C,D) = ∑m,n

i,j=1 ∣Cij − Dij∣.

Two components of E E(I,A ○ B) = Eu(I,A ○ B) + Eo(I,A ○ B), where Eu(I,A ○ B) = ∣{⟨i,j⟩; Iij = 1,(A ○ B)ij = 0}∣, Eo(I,A ○ B) = ∣{⟨i,j⟩; Iij = 0,(A ○ B)ij = 1}∣. Non-symmetry of undercovering and overcovering error

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

3 / 18

SLIDE 5

8M

Statistical software package known as BMDP Developed in 1960s at the University of California in Los Angeles (W. J. Dixon) Developed by: M. R. Mickey, L. Engelman and P. Mudle 8M method has been added to BMDP in the late 1970s Probably the oldest BMF method No longer available Dixon, W. J. (ed.): BMDP Statistical Software Manual. Berkeley, CA: University of California Press (1992) Incomplete description → several blindspots Partially black box analysis of 8M

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

4 / 18

SLIDE 6

Basic Idea of 8M

Input:

– I ∈ {0,1}n×m . . . Boolean matrix – k . . . number of desired factors – init . . . number of initial factors – cost . . . determines significance of overcovering

Output:

– A ∈ {0,1}n×k and B ∈ {0,1}k×m

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

5 / 18

SLIDE 7

Basic Idea of 8M: main procedure

Algorithm 1: 8M

B ← ComputeInitialFactors(init) A ← 0n×init f ← init RefineMatricesAB(A, B, I, cost) kReached ← 0 while kReached < 2 or I ≤ A ○ B do foreach ⟨i, j⟩ do if Iij > (A ○ B)ij then ∆+

ij ← 1 else ∆+ ij ← 0

add column j of ∆+ with the largest count of 1s as new column to A add row of 0s as new row to B and set entry j of this row to 1 f ← f + 1 RefineMatricesAB(A, B, I, cost) if another two new factors were added then remove column A_(f−2) from A and row B(f−2)_ from B f ← f − 1 RefineMatricesAB(A, B, I, cost) if f=k then kReached ← kReached + 1 return A, B

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

6 / 18

SLIDE 8

Basic Idea of 8M: refine matrices

Algorithm 2: RefineMatricesAB

repeat RefineMatrixA(A, B, I, cost) RefineMatrixB(A, B, I, cost) until loop executed 3 times or A and B did not change

Algorithm 3: RefineMatrixA

foreach row i ∈ {1, . . . , n} do y ← Ii_; Z ← B; Ai_ ← 0 repeat foreach factor l ∈ 1, . . . , f do ml ← ∑m

j=1 yj ⋅ Zlj − cost ⋅ ∑m j=1(1 − yj) ⋅ Zlj

select p for which mp = maxl ml if mp > 0 then Aip ← 1 foreach j ∈ {1, . . . , m} do if Zpj = 1 then Z_j ← 0; yj ← 0 until mp > 0

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

7 / 18

SLIDE 9

Basic Idea of 8M: initialization

Algorithm 4: ComputeInitialFactors

C ← m × m Boolean matrix with all entries equal to 0 foreach Cij do if I_i ≤ I_j and ∣I_i∣ > 0 then Cij ← 1 remove all duplicate and empty rows from C f ← 0 foreach row i ∈ 1, . . . , m of matrix C do if row Ci_ has entry j for which Cij = 1 and Ckj = 0 for all k < i then f ← f + 1 add row Ci_ as a new row to B if f = init then return B

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

8 / 18

SLIDE 10

Basic Idea of 8M

1 Computing init initial factors

– similarity with Asso algorithm

2 Iteratively computes new factors until k factors are obtained 3 Generating new factor via Boolean regression 4 Previously generated factors are revisited and dropped

– adds two factors, then removes factor generated two steps back – k = 6, sequence: 2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

9 / 18

SLIDE 11

Comparison with Other Methods

Tiling Geerts, Goethals, Mielikainen: Tiling databases. In: Discovery Science 2004 (2004). Asso Miettinen, Mielikainen, Gionis, Das, Mannila: The discrete basis problem. IEEE Trans. Knowledge and Data Eng. (2008). GreConD Belohlavek, Vychodil: Discovery of optimal factors in binary data via a novel method

f matrix decomposition. J. Comput. Syst. Sci. (2010).

Hyper Xiang, Jin, Fuhry, Dragan: Summarizing transactional databases with overlapped

hyperrectangles. Data Mining and Know. Discovery (2011).

PaNDa Lucchese, Orlando, Perego: Mining top-K patterns from binary datasets in presence of

noise. In: SIAM DM 2010 (2010).
R. Belohlavek, M. Trnecka (Palacký University Olomouc)

10 / 18

SLIDE 12

Comparison with Other Methods: results

20 40 60 80 100 120 k (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M Tiling Asso GreConD PaNDa Hyper

(a) Mushroom

5 10 15 20 25 30 35 40 45 50 k (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M Tiling Asso GreConD PaNDa Hyper

(b) Set X1 Figure: Coverage quality of the first l factors on real and synthetic data.

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

11 / 18

SLIDE 13

8M from Today’s Perspective

Improvements of 8M Lessons from 8M

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

12 / 18

SLIDE 14

Improvements of 8M

8M+ New initialization step Very fast strategy of GreConD algorithm No overcovering error

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

13 / 18

SLIDE 15

Comparison of 8M and 8M+

10 20 30 40 50 60 70 80 90 100 110 l (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M 8M+

(a) Mushroom

50 100 150 200 250 l (number of factors) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 coverage 8M 8M+

(b) DNA Figure: Coverage quality of the first l factors on real data: 8M vs. 8M+.

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

14 / 18

SLIDE 16

Lesson from 8M

Revisiting the previously generated factors Significant aspect Non-symmetry of undercovering and overcovering error Existing algorithms do not use any kind of revisiting Improvement of existing algorithms Removes factors driven by parameter p

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

15 / 18

SLIDE 17

Lesson from 8M: improvement of GreConD

p Dataset

rig.

0.01 0.02 0.03 0.04 0.05 Emea k 42 34 29 26 25 24 23 c 1.000 1.000 0.992 0.981 0.975 0.963 0.956 Chess k 124 119 72 62 55 51 47 c 1.000 1.000 0.991 0.981 0.970 0.962 0.952 Firewall 1 k 66 65 17 10 8 7 6 c 1.000 1.000 0.990 0.981 0.972 0.964 0.953 Firewall 2 k 10 10 4 4 4 4 3 c 1.000 1.000 0.998 0.998 0.998 0.998 0.958 Mushroom k 120 113 81 73 69 65 61 c 1.000 1.000 0.990 0.980 0.970 0.960 0.951

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

16 / 18

SLIDE 18

Conclusions

Detailed description of 8M Improvement of 8M New ideas for current BMF algorithms Explore revisiting of factors

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

17 / 18

SLIDE 19

Thank you

R. Belohlavek, M. Trnecka (Palacký University Olomouc)

18 / 18