SPOWL: Spark-based OWL 2 Reasoning Materialisation Yu Liu and Peter McBrien Department of Computing, Imperial College London Y.Liu & P.McBrien BeyondMR17
Table of Contents Introduction SPOWL Overview SPOWL Features Evaluation Summary Y.Liu & P.McBrien BeyondMR17
Table of Contents Introduction SPOWL Overview SPOWL Features Evaluation Summary Y.Liu & P.McBrien BeyondMR17
Reasoning materialisation for OWL 2 ontologies ◮ LUBM T-Box: Student ⊑ Person (1) Student ⊑ ∃ takesCourse . Course (2) ◮ LUBM A-Box: Student(John) (3) Person(Lewis) (5) Student(Tom) (4) Person(Mary) (6) ◮ Reasoning materialisation: Student := { John , Tom } ; Person := { Lewis , Mary , John , Tom } takesCourse := { (John , ?C1) , (Tom , ?C2) } ; Course := { ?C1 , ?C2 } ◮ Querying the ontology: ◮ Not only explicit but also implicit facts will be returned. Y.Liu & P.McBrien BeyondMR17
Reasoning materialisation for OWL 2 ontologies Materialising reasoning results: Student := { John , Tom } Person := { Lewis , Mary , John , Tom } takesCourse := { (John , ?C2) , (Tom , ?C2) } Course := { ?C1 , ?C2 } ◮ Queries directly read the materialised results. ◮ Faster query processing and larger space required. ◮ Maintenance of the materialisation is difficult. ◮ Ideal case: queries are much more frequent than updates. ◮ Example systems: SPOWL, Oracle’s RDF Store, WebPIE, etc. Y.Liu & P.McBrien BeyondMR17
Rule evaluation for reasoning materialisation ◮ Rule format: if � antecedent � then � consequent � : Example: if C ⊑ D , C ( x ) then D ( x ) = ⇒ if Student ⊑ Person , Student( x ) then Person( x ) ◮ Well-known rulesets: ◮ RDFS entailment rules. ◮ OWL ter Horst rules. ◮ OWL 2 RL/RDF rules. ◮ Limitations: ◮ No use of tableaux reasoners (e.g. Pellet and Hermit). ◮ Reasoning relies on which set of entailment rules is chosen. ◮ Inefficient rule matching process. Y.Liu & P.McBrien BeyondMR17
Table of Contents Introduction SPOWL Overview SPOWL Features Evaluation Summary Y.Liu & P.McBrien BeyondMR17
SPOWL architecture ◮ T-Box is small enough for tableaux reasoners. ◮ The number of queries is much larger than the number of updates. Classified OWL T-Box T-Box Documents ① Spark Programme Generation ② Initial Load A-Box 1 Distributed Data Storage (e.g. HDFS) ••• ③ Programme Execution A-Box n Y.Liu & P.McBrien BeyondMR17
SPOWL overview 1. Classes & properties to Spark RDDs: C ❀ C rdd ( id ) P ❀ P rdd ( domain , range ) 2. T-Box axioms are mapped to entailment rules R axiom : C ⊑ D ❀ R C ⊑ D ::= if C rdd ( x ) then D rdd ( x ) 3. R axiom are further implemented as Spark programmes P axiom : R C ⊑ D ❀ P C ⊑ D ::= D rdd = D rdd . union( C rdd ) 4. P axiom are iteratively executed to build up the RDDs. Y.Liu & P.McBrien BeyondMR17
Table of Contents Introduction SPOWL Overview SPOWL Features Evaluation Summary Y.Liu & P.McBrien BeyondMR17
SPOWL uses tableaux reasoner ◮ More complete T-Box reasoning: C ⊑ D ⊔ E e.g. classifying gives us C ⊑ E C ⊓ D ⊑ ⊥ ◮ Entailment rules are specific to the A-Box data: ◮ No need to evaluate rules that are irrelevant to the ontological data. Y.Liu & P.McBrien BeyondMR17
SPOWL partitions reasoning materialisation ◮ Data of each class or property is stored separately in HDFS: C ❀ hdfs://$ { C PATH } / P ❀ hdfs://$ { P PATH } / ◮ A variant of the vertical partitioning model. ◮ Only the partitions storing the relevant data need to be accessed. e.g. Student rdd = sc . textfile( "hdfs://$ { Student PATH } /" ) ◮ Otherwise, the whole ontology should be read and a fragment of it should be filtered out. Y.Liu & P.McBrien BeyondMR17
SPOWL handles axioms beyond OWL 2 RL ◮ SomeValuesFrom forms a superclass expression (i.e. C ⊑ ∃ P . D ) e.g. Student ⊑ ∃ takesCourse . Course(2) ◮ Non-deterministic reasoning (OWL 2 RL Interpretation I ): = C ⊑ ∃ P . D iff C I ⊆ { x | ∃ y : � x , y � ∈ P I and y ∈ D I } I | ◮ Entailment rule R C ⊑∃ P . D : if C rdd ( x ) , ¬ P rdd ( x , y ) then P rdd ( x , null ) ◮ Spark programme P C ⊑∃ P . D : P rdd = P rdd . union( C rdd . subtract( P rdd . map(lambda ( x , y ) : x )) . map(lambda x : ( x , null ))) Y.Liu & P.McBrien BeyondMR17
The advantage of using Spark (1) Spark caches RDDs in distributed memory as much as possible: ◮ reduce the needs to write/read intermediate results to/from disk. ◮ reduce I/O overhead. ◮ suitable for iterative computation (e.g. computing transitive closure). Y.Liu & P.McBrien BeyondMR17
Data caching in distributed memory Iterative computation: ◮ TransitiveProperty P ( P ◦ P ⊑ P ). subOrganisationOf ◦ subOrganisationOf ⊑ subOrganisationOf (7) ◮ Entailment rule R P ◦ P ⊑ P : if P rdd ( x , y ) , P rdd ( y , z ) then P rdd ( x , z ) ◮ Spark programme P P ◦ P ⊑ P : while True do P tmp = P rdd . map(lambda ( x p , y p ) : ( y p , x p )) . join( P rdd ) . map(lambda ( y k , ( x p , z p )) : ( x p , z p )) if P tmp . isEmpty() then break P rdd = P rdd . union( P tmp ) end Y.Liu & P.McBrien BeyondMR17
Data caching in distributed memory Iterative computation: ◮ TransitiveProperty P ( P ◦ P ⊑ P ). subOrganisationOf ◦ subOrganisationOf ⊑ subOrganisationOf (7) ◮ Entailment rule R P ◦ P ⊑ P : if P rdd ( x , y ) , P rdd ( y , z ) then P rdd ( x , z ) ◮ Spark programme P P ◦ P ⊑ P : while True do P tmp = P rdd . map(lambda ( x p , y p ) : ( y p , x p )) . join( P rdd ) . map(lambda ( y k , ( x p , z p )) : ( x p , z p )) P tmp . cache() if P tmp . isEmpty() then break P rdd = P rdd . union( P tmp ) end Y.Liu & P.McBrien BeyondMR17
Data caching in distributed memory ◮ GraduateStudent rdd will be used three times: job a R GraduateStudent ⊑ Person ↓ Person rdd job b R GraduateStudent ⊑∃ takesCourse . GraduateCourse GraduateStudent rdd ↓ takesCourse rdd job c R GraduateStudent ⊑ Student ↓ Student rdd Figure: Caching GraduateStudent rdd for Repeated Usage Y.Liu & P.McBrien BeyondMR17
The advantage of using Spark (2) More flexible job scheduling as compared to Hadoop: Figure: Job Scheduling between Hadoop (left) and Spark (right) Y.Liu & P.McBrien BeyondMR17
DAG for parallelising reasoning Consider Person ⊓ ∃ takesCourse . Course ⊑ Student: ◮ R Person ⊓∃ takesCourse . Course ⊑ Student : if Person rdd ( x ) , takesCourse rdd ( x , y ) , Course rdd ( y ) then Student rdd ( x ) ◮ P Person ⊓∃ takesCourse . Course ⊑ Student : Student tmp 1 = takesCourse rdd . map(lambda ( x t , y t ) : ( y t , x t )) . join(Course rdd . map(lambda y c : ( y c , y c ))) . map(lambda ( y k , ( x t , y c )) : x t )) Student tmp 2 = Student tmp 1 . intersection(Person rdd ) Student rdd = Student rdd . union(Student tmp 2 ) Y.Liu & P.McBrien BeyondMR17
DAG for parallelising reasoning job a R Student ⊑ Person R GraduateStudent ⊑ Person ↓ Person rdd job b job d R Student ⊑∃ takesCourse . Course R Person ⊓∃ takesCourse . Course ⊑ Student ↓ ↓ takesCourse rdd Student rdd job c R GraduateCourse ⊑ Course ↓ Course rdd Figure: DAG Scheduling for R Person ⊓∃ takesCourse . Course ⊑ Student Y.Liu & P.McBrien BeyondMR17
Optimising programme execution order Executing job a , job b and job c before job d is the best order. job a R Student ⊑ Person R GraduateStudent ⊑ Person ↓ Person rdd job b job d R Student ⊑∃ takesCourse . Course R Person ⊓∃ takesCourse . Course ⊑ Student ↓ ↓ takesCourse rdd Student rdd job c R GraduateCourse ⊑ Course ↓ Course rdd Figure: DAG Scheduling for R Person ⊓∃ takesCourse . Course ⊑ Student Y.Liu & P.McBrien BeyondMR17
Ordering Spark Programmes Consider P 1 ⊑ P 2 , P 2 ◦ P 2 ⊑ P 2 and P 2 ⊑ P 3 : Figure: Acyclic property hierarchy How about considering an addition axiom P 3 ≡ P 1 − ? Figure: Cyclic property hierarchy Y.Liu & P.McBrien BeyondMR17
Table of Contents Introduction SPOWL Overview SPOWL Features Evaluation Summary Y.Liu & P.McBrien BeyondMR17
Evaluating SPOWL of reasoning materialisation ◮ Evaluation environment ◮ A cluster of 9 machines running on a private cloud environment. ◮ Each node with CPU @ 2.5GHz, 4 Cores, and 16 GB of Memory. ◮ Benchmarking dataset LUBM ◮ LUBM-2000: about 270 million A-Box facts and 44GB in size. ◮ Comparison system: WebPIE ◮ Using MapReduce as the computation framework. ◮ Not using tableaux reasoners. ◮ Not partitioning reasoning materialisation. ◮ Compressing data before reasoning materialisation. Y.Liu & P.McBrien BeyondMR17
Performance of reasoning materialisation ◮ Reasoning materialisation by SPOWL SPOWL LUBM-400 LUBM-800 LUBM-1200 LUBM-1600 LUBM-2000 Initial Load 9m08s 20m30s 27m50s 41m20s 54m10s Reasoning 10m19s 16m28s 33m20s 38m58s 58m08s Total Time 19m27s 36m58s 1h01m10s 1h20m18s 1h52m18s 01:00:29 00:50:24 Time (hh:mm:ss) 00:40:19 00:30:14 00:20:10 00:10:05 00:00:00 LUBM-400 LUBM-800 LUBM-1600 LUBM-1200 LUBM-2000 Initial Load Type Inference Y.Liu & P.McBrien BeyondMR17
Recommend
More recommend