sat based data mining
play

SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit - PowerPoint PPT Presentation

SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit dArtois, France GDR-IA - GT CAVIAR Orlans May 27, 2019 Outline Frequent Itemsets Mining Propositional Logic and SAT problem (Parallel) SAT-based Solvers for


  1. SAT-Based Data Mining Saïd Jabbour CRIL - CNRS UMR 8188 Université d’Artois, France GDR-IA - GT CAVIAR Orléans May 27, 2019

  2. Outline Frequent Itemsets Mining Propositional Logic and SAT problem (Parallel) SAT-based Solvers for Enumerating all (C, M)FIM on on (Uncertain) Transaction Databases Association Rules Mining Gradual Itemsets Mining Symmetry Breaking in Frequent Itemsets Mining FIM for CNF Formulas compression 2/71

  3. Data Mining ◮ Discovering interesting knowledge from large amounts of data. ◮ Frequent itemsets ◮ Sequential patterns ◮ Association rules ◮ Emerging patterns ◮ . . . ◮ Frequent itemset mining is an important part of data mining. ◮ Different variety of applications : Healthcare, Business, Education, Disaster prevention, etc. 3/71

  4. Frequent Itemset Mining ◮ A set of items : Ω = { a , b , c , . . . } . TID Transactions ◮ An itemset I over Ω : is a subset of Ω , T 1 a b c d i.e., I ⊆ Ω . T 2 a b c e T 3 a e ◮ A transaction : couple ( tid , I ) tid is the transaction identifier and T 4 a d e I is an itemset , i.e., I ⊆ Ω . T 5 a b T 6 b d ◮ Transaction database D : set of T 7 b e transactions. ◮ A transaction ( tid , I ) supports an itemset J if J ⊆ I . ◮ The cover of an itemset I : Cover ( I , D ) = { tid | ( tid , J ) ∈ D , I ⊆ J } . ◮ Cover ( { ab } , D )= { T 1 , T 2 , T 5 } ◮ The support of an itemset I in D : Supp ( I , D ) = | Cover ( I , D ) | . ◮ Supp ( { ab } , D )= 3 4/71

  5. Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ An itemset I is frequent if its support is greater than or equal to a minsup threshold. 5/71

  6. Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ CFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( I , D ) > S ( J , D ) } ◮ An itemset I is closed if I is frequent and there exists no super-pattern J ⊃ I , with the same support as I . 6/71

  7. Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ CFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( I , D ) > S ( J , D ) } ◮ MFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( J , D ) < θ } An itemset I is a max-pattern if I is frequent and there exists no frequent super-pattern J ⊃ I . 7/71

  8. Frequent Itemset Mining FIM Approches Specialized Approaches Declarative Approaches ◮ Apriori [Agrawal’93] ◮ CP [De Raedt’08] ◮ FP-growth [Han’00] ◮ SAT [Jabbour’13] ◮ ECLAT [Zaki’00] ◮ ASP [Gebser’16] ◮ LCM [Un’04], . . . ◮ ... 8/71

  9. Propositional Logic Formal Language of propositional formulas : P rop Syntax ◮ Logical constant : ⊥ , ⊤ ◮ Propositional symbols : a , b , c , . . . (atomic sentences) ◮ Wrapping parentheses : ( . . . ) ◮ Sentences are combined by connectives : ¬ , ∧ , ∨ , → , ⇔ . If Φ 1 , Φ 2 ∈ P rop , then the following formulas are in P rop : ¬ Φ 1 (Φ 1 ∧ Φ 2 ) (Φ 1 ∨ Φ 2 ) (Φ 1 → Φ 2 ) (Φ 1 ⇔ Φ 2 ) 9/71

  10. Propositional Logic : SAT Semantic : an interpretation is a fonction from P rop to { 0 , 1 } (0 : false; 1 : true). Defined inductively as :  P rop → { 0 , 1 }          0 ⊥      B :  ⊤ 1      F ∧ G min ( B ( F ) , B ( G ))      ¬ F 1 − B ( F )      F ∨ G max ( B ( F ) , B ( G )) ◮ A model of Φ is an interpretation B satisfying Φ , i.e., B (Φ) = 1. ◮ A formula Φ is satisfiable if there exists a model of Φ . 10/71

  11. Propositional logic : SAT SAT problem : decide if a formula in CNF is satisfiable or not? [NP-Complete’71] CNF : conjunction of clauses c 1 ∧ . . . ∧ c n Clause : disjunction of literals ( l 1 . . . ∨ l k ) Literal : a variable or its negation { l i , ¬ l i } C 1 C 2 C 3 C 4 � �������� �� �������� � � ���� �� ���� � � �� �� �� � � ���� �� ���� � Φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ) ∧ ( b ∨ c ) ∧ ( ¬ c ∨ a ) Various Applications : Model Checking, Planning, Data Mining, etc. → easier formulation → efficient solving 11/71

  12. SAT Problem ◮ Models enumeration problem ◮ Variant of the propositional satisfiability problem (SAT) C 3 C 1 C 2 C 4 � �������� �� �������� � � ���� �� ���� � � �� �� �� � � ���� �� ���� � Φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ) ∧ ( b ∨ c ) ∧ ( ¬ c ∨ a ) � { a = 1 , b = 1 , c = 1 } � { a = 0 , b = 1 , c = 0 } M (Φ) = { a = 1 , b = 1 , c = 0 } { a = 0 , b = 1 , c = 0 } ◮ Different application domains : ◮ Data mining ◮ Bounded model checking ◮ Knowledge compilation ◮ . . . ◮ Models enumeration problem received little attention compared to other SAT issues. 12/71

  13. Itemsets Mining Ω items (finite set of symbols) I Itemset (subset of Ω) T i = ( i , I i ) Transaction with i ∈ N the transaction identifier , I i an itemset D Transactional database (set of transactions) id transactions id transactions 1 0 0 1 1 1 1 1 1 c d e f g 2 0 0 1 1 1 1 1 2 c d e f g 3 1 1 1 1 0 0 0 3 a b c d 4 1 1 1 1 0 1 0 4 a b c d f 5 1 1 1 1 0 0 0 5 a b c d 6 0 0 1 0 1 0 0 6 c e a b c d e f g 13/71

  14. Symbolic approach [ECML/PKDD’13] Find { I ⊆ Ω | | Supp ( I , D ) | ≥ θ } , θ ∈ N Make frequent itemsets extraction as the models enumeration of a CNF formula ((anti-)monotonicity) m m � � � � � ( ¬ q i ↔ p a ) q i ≥ θ ( p a ∨ q i ) i = 1 a ∈ Ω \ T i i = 1 a ∈ Ω T i ∈ D | a � T i � ����� �� ����� � � ��������������������� �� ��������������������� � � ���������������������� �� ���������������������� � cover : Φ cov frequency : Φ freq closeness : Φ clos ( q 3 ∨ q 4 ∨ q 5 ∨ p a ) ∧ ¬ q 1 ↔ p a p b c d e f g ( q 3 ∨ q 4 ∨ q 5 ∨ p b ) ∧ ¬ q 2 ↔ p a p b c d e f g ( p c ) ∧ ¬ q 3 ↔ a b c d p e p f p g ( q 6 ∨ p d ) ∧ ¬ q 4 ↔ a b c d p e f p g ( q 1 ∨ q 2 ∨ q 6 ∨ p e ) ∧ ¬ q 5 ↔ a b c d p e p f p g ( q 1 ∨ q 2 ∨ q 4 ∨ p f ) ∧ ¬ q 6 ↔ p a p b c p d e p f p g ( q 1 ∨ q 2 ∨ p e ) q 1 + q 2 + q 3 + q 4 + q 5 + q 6 ≥ θ 14/71

  15. Symbolic approach Declarativity : easy extension to mine particular patterns (add new constraints) m � � � Φ cov = ( ¬ q i ↔ p a ) ( | Ω | − | T | + 1 ) ≈ | D | × | Ω | i = 1 a ∈ Ω \ T i T ∈ D m � Φ freq = O ( mlog 2 ( min _ supp )) q i ≥ θ i = 1 � � Φ clos = ( p a ∨ q i ) | D | − | Supp ( { a } ) | a ∈ Ω T i ∈ D | a � T i � Φ len = p a ≥ min _ length a ∈ Ω Instance #Tran, #Items Type of Data #CFIM θ > 1 . 10 5 Retail 10 88162, 6470 market basket data ≃ 5 . 10 5 Kosarak 1000 990002, 41267 hungarian on-line news portal ≃ 6 . 10 6 accidents 40000 340183, 468 traffic accidents ◮ The number of closed frequent itemsets is often significant. 15/71

  16. SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71

  17. SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71

  18. SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71

  19. DPLL-based procedure for CFIM [SGAI’16] ◮ DPLL-Enum+VSIDS : Variable State Independent, Decaying Sum branching heuristic ◮ DPLL-Enum+JW : branching heuristic based on the maximum number of occurrences of the variables ◮ DPLL-Enum+RAND : random variable selection 1000 CDCL+Enum DPLL-Enum+RAND 900 DPLL-Enum+VSIDS DPLL-Enum+JW 800 700 time (seconds) 600 500 400 300 200 100 0 50 100 150 200 250 300 Quorum 17/71

  20. Limitations m � � � Φ cov = ( ¬ q i ↔ p a ) ( | Ω | − | T | + 1 ) ≈ | D | × | Ω | i = 1 a ∈ Ω \ T i T ∈ D m � Φ freq = O ( mlog 2 ( min _ supp )) q i ≥ θ i = 1 � � Φ clos = ( p a ∨ q i ) | D | − | Supp ( { a } ) | a ∈ Ω T i ∈ D | a � T i � Φ len = p a ≥ min _ length a ∈ Ω Instance θ #Tran, #Items Type of Data #Clauses #CFIM > 1 . 10 5 Retail 10 88162, 16470 market basket data 1451119564 ≃ 5 . 10 5 Kosarak 1000 990002, 41267 hungarian on-line 40846393519 news portal ≃ 6 . 10 6 Accidents 40000 340183, 468 traffic accidents 147704774 ◮ Scalability problem : the number of clauses of the SAT encodings is very large. 18/71

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend