Evaluating Association Rules in Boolean Matrix Factorization Jan - PowerPoint PPT Presentation

Evaluating Association Rules in Boolean Matrix Factorization Jan Outrata, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC 4rd international workshop of Computational Intelligence and Data Mining Tatranské Matliare, Slovakia, September 17-18, 2016

Boolean Matrix Factorization (BMF) Method for analysis of Boolean data. A general aim: for a given matrix I ∈ { 0 , 1 } n × m find matrices A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m for which I (approximately) equals A ○ B ○ is the Boolean matrix product ( A ○ B ) ij = l = 1 min ( A il ,B lj ) . k max ⎛ ⎞ ⎛ ⎞ 10111 110 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ 10110 ⎜ ⎟ = ⎜ ⎟ ○ ⎜ ⎟ 01101 011 ⎜ ⎟ ⎜ ⎟ 00101 ⎝ ⎠ 01001 001 ⎝ ⎠ ⎝ ⎠ 01001 10110 100 Discovery of k factors that exactly or approximately explain the data. Factors = interesting patterns (rectangles) in data. J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 1 / 19

Geometry of BMF Geometry of factorization → coverage of the entries containing 1s by rectangles. ⎛ ⎞ ⎛ ⎞ 10111 110 ⎛ ⎞ ⎜ ⎟ ⎜ ⎟ 10110 ⎜ ⎟ = ⎜ ⎟ ○ ⎜ ⎟ 01101 011 ⎜ ⎟ ⎜ ⎟ 00101 ⎝ ⎠ 01001 001 ⎝ ⎠ ⎝ ⎠ 01001 10110 100 ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ 10111 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = ∨ ∨ 01101 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ 01001 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 10110 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 Belohlavek R., Vychodil V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, Journal of Computer and System Science 76(1)(2010), 3–20. J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 2 / 19

Explanation of Data by Factors How large portion of data is explain by factors? Distance (error function) E ( C,D ) = ∣∣ C − D ∣∣ = ∑ m,n i,j = 1 ∣ C ij − D ij ∣ . Two components of E E ( I,A ○ B ) = E u ( I,A ○ B ) + E o ( I,A ○ B ) , where E u ( I,A ○ B ) = ∣{⟨ i,j ⟩ ; I ij = 1 , ( A ○ B ) ij = 0 }∣ , E o ( I,A ○ B ) = ∣{⟨ i,j ⟩ ; I ij = 0 , ( A ○ B ) ij = 1 }∣ . Coverage quality for A ∈ { 0 , 1 } n × l and B ∈ { 0 , 1 } l × m c ( l ) = 1 − E ( I,A ○ B )/∣∣ I ∣∣ . J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 3 / 19

Two Basic Viewpoint to BMF Discrete Basis Problem – Given I ∈ { 0 , 1 } n × m and a positive integer k , find A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m that minimize ∣∣ I − A ○ B ∣∣ . – Emphasizes the importance of the first few (presumably most important) factors. – Miettinen P., Mielikainen T., Gionis A., Das G., Mannila H., The discrete basis problem, IEEE Transactional Knowledge and Data Engineering 20(10)(2008), 1348–1362 Approximate Factorization Problem – Given I and prescribed error ε ≥ 0 , find A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m with k as small as possible such that ∣∣ I − A ○ B ∣∣ ≤ ε . – Emphasizes the need to account for (and thus to explain) a prescribed (presumably reasonably large) portion of data. – Belohlavek R., Trnecka M., From-below approximations in Boolean matrix factorization: Geometry and new algorithm, Journal of Computer and System Science 81(8)(2015), 1678–1697. J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 4 / 19

Our Work Association rules form a ground of the Asso algorithm. Miettinen P., Mielikainen T., Gionis A., Das G., Mannila H., The discrete basis problem, IEEE Transactional Knowledge and Data Engineering 20(10)(2008), 1348–1362 Confidence parameter influences the quality of factorization. Can other type of association rules improve Asso ? Can be used association rules in other BMF algorithms? GreConD algorithm. Belohlavek R., Vychodil V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, Journal of Computer and System Science 76(1)(2010), 3–20. J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 5 / 19

Association Rules in GUHA GUHA (General Unary Hypothesis Automaton) For Boolean data association rule (over a given set of attributes) is an expression i ≈ j where i and j are attributes. GUHA general association rule is an expression ϕ ≈ ψ where ϕ and ψ are arbitrary complex logical formulas above the attributes. Four-fold table 4ft( i , j , I ) ⟨ a,b,c,d ⟩ = ⟨ fr ( i ∧ j ) ,fr ( i ∧ ¬ j ) ,fr (¬ i ∧ j ) ,fr (¬ i ∧ ¬ j )⟩ ¬ j I j a = fr ( i ∧ j ) b = fr ( i ∧ ¬ j ) i ¬ i c = fr (¬ i ∧ j ) d = fr (¬ i ∧ ¬ j ) . J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 6 / 19

(Generalized) Quantifiers Function q which assigns to any four-fold table 4ft( i , j , I ) a logical value 0 or 1 defines a so-called (generalized, GUHA) quantifier. Logical and statistical viewpoints Interpret different types of association rules (with different meaning of the association ≈ between attributes) J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 7 / 19

(Generalized) Quantifiers founded ( p -)implication , ⇒ p (for ≈ ) a + b ≥ p, a q ( a,b,c,d ) = { 1 if 0 otherwise . Used in Asso . double founded implication , ⇔ p a + b + c ≥ p, a q ( a,b,c,d ) = { 1 if 0 otherwise . Meaning: the number of objects having in I both i and j is at least 100 ⋅ p % of the number of objects having i or j . J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 8 / 19

(Generalized) Quantifiers founded equivalence , ≡ p a + b + c + d ≥ p, a + d q ( a,b,c,d ) = { 1 if 0 otherwise . Meaning: At least 100 ⋅ p % among all objects in I have the same attributes. E-equivalence , ∼ E δ q ( a,b,c,d ) = { 1 if max ( b c + d ) < δ, c a + b , 0 otherwise . negative Jaccard distance b + c + d ≥ p, q ( a,b,c,d ) = { 1 if b + c 0 otherwise . Our new quantifier resembling Jaccard distance dissimilarity measure used in data mining. Meaning: at least 100 ⋅ p % objects have i or j among the objects not having i or j . J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 9 / 19

Modified Asso algorithm Input: A Boolean matrix I ∈ { 0 , 1 } n × m , a positive integer k , a threshold value τ ∈ ( 0 , 1 ] , real-valued weights w + , w − and a quantifier q τ (with parameter τ ) interpreting i ≈ j Output: Boolean matrices A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m for i = 1 , . . . , m do for j = 1 , . . . , m do Q ij = q τ ( a, b, c, d ) end end A ← empty n × k Boolean matrix B ← empty k × m Boolean matrix for l = 1 , . . . , k do ( Q i _ , e ) ← arg max Q i _ , e ∈{ 0 , 1 } n × 1 cover ([ B Q i _ ] , [ A e ] , I, w + , w − ) A ← [ A e ] , B ← [ B Q i _ ] end return A and B J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 10 / 19

Modified GreConD algorithm Input: A Boolean matrix I ∈ { 0 , 1 } n × m and a prescribed error ε ≥ 0 Output: Boolean matrices A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m Q ← empty m × m Boolean matrix for i = 1 , . . . , m do for j = 1 , . . . , m do if i ⇒ 1 j is true in I then Q ij = 1 end end end A ← empty n × k Boolean matrix B ← empty k × m Boolean matrix while ∣∣ I − A ○ B ∣∣ > ε do D ← arg max Q i _ cover ( Q i _ , I, A, B ) V ← cover ( D, I, A, B ) while there is j such that D j = 0 and cover ( D + [ j ] , I, A, B ) > V do j ← arg max j,D j = 0 cover ( D + [ j ] , I, A, B ) D ← ( D + [ j ]) ↓↑ V ← cover ( D, I, A, B ) end A ← [ A D ↓ ] , B ← [ B D ] end J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 11 / 19

Experimental Evaluation Synthetic data 1000 of randonly generated datasets ( 500 rows and 250 columns). Dataset k dens A dens B dens I Set C1 40 0.07 0.04 0.10 Set C2 40 0.07 0.06 0.15 Set C3 40 0.11 0.05 0.20 Table: Synthetic data Real data ∣∣ I ∣∣ Dataset Size 4590 × 392 DNA 26527 8124 × 119 Mushroom 186852 101 × 28 Zoo 862 Table: Real data J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 12 / 19

Results C1 1 2 0.9 1.8 0.8 1.6 0.7 1.4 0.6 overcoverage 1.2 coverage 0.5 1 0.4 0.8 0.3 0.6 founded implication founded implication 0.2 0.4 double founded implication double founded implication founded equivalence founded equivalence 0.1 negative Jaccard distance 0.2 negative Jaccard distance E−equivalence E−equivalence 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 number of factors number of factors Figure: Coverage for synthetic dataset C 1 Figure: Overcoverage for synthetic dataset C 1 J. Outrata, M. Trnecka (Palacký University Olomouc) Tatranské Matliare, Slovakia, Sep 2016 13 / 19

Evaluating Association Rules in Boolean Matrix Factorization Jan - PowerPoint PPT Presentation

Evaluating Association Rules in Boolean Matrix Factorization Jan Outrata, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACK UNIVERSITY OLOMOUC 4rd international workshop of Computational Intelligence and Data Mining Tatransk Matliare,

Boolean Algebra Chapter 3 Boolean Values Introduction Boolean Operations Fundamental Operators

1 Boolean Algebra 1. Boolean Algebra Verification Technology Content 1.1 Boolean algebra basics

Digital Design Discussion: Boolean Algebra Boolean Expression Equivalence Boolean Function

Boolean Logic 01-1 Boolean values Are TRUE and FALSE 01-2 Boolean values Are TRUE and

CHAPTER III BOOLEAN ALGEBRA R.M. Dansereau; v.1.0 BOOLEAN VALUES INTRO. TO COMP. ENG.

BOOLEAN MATRIX AND TENSOR DECOMPOSITIONS Pauli Miettinen TML 2013 27 September 2013 BOOLEAN

Association Rules from transactional databases ! Mining multilevel association rules from

Association Rules Data Mining and Exploration: Association Rules Itemsets, association rules

The boolean type and boolean operators Recall that Java provides a data type boolean which can

Boolean Functions Boolean Expressions Let B = { 0 , 1 } . 1 ... true, 0 ... false Let x 1 , x 2 ,

1. Boolean Algebra 1.1 Boolean Algebra Basics Verification Technology AND-operation

Mining Association Rules Mining Association Rules Additional Measures of rule interestingness

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

BOOLEAN MATRIX FACTORIZATIONS Pauli Miettinen Leap day, 2012 MATRIX FACTORIZATIONS

Boolean Matrix Multiplication (BMM) and CFG parsing by Franziska Ebert 23.3.2007 Boolean

COEN 212: DIGITAL SYSTEMS DESIGN I Lecture 3: Logic Gates Instr Instructor: Dr. Reza Soleymani,

Evolving Algebraic Constructions for Designing Bent Boolean Functions Stjepan Picek and Domagoj

Any Monotone Function Is Realized by Interlocked Polygons Authors: Erik Demaine, Martin Demaine,

Decision trees, protocols, and the Fourier Entropy-Influence Conjecture Andrew Wan (Simons

How to assess quality of BMF algorithms? Radim Belohlavek, Jan Outrata, Martin Trnecka DEPARTMENT

Announcements Readings Today CSE 321 Discrete Structures Section 8.2 n-Ary

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

The Rectangle Covering number of Random Boolean Matrices Mozhgan Pourmoradnasseri University of