Pushing XPath Accelerator to its Limits Christian Grn, Marc Kramis - PowerPoint PPT Presentation

1st International Workshop on Performance and Evaluation of Data Management Systems EXPDB 2006, June 30 Pushing XPath Accelerator to its Limits Christian Grün, Marc Kramis Alexander Holupirek, Marc H. Scholl, Marcel Waldvogel Department of Computer and Information Science University of Konstanz 1:07

Overview …processing large XML documents …our two prototypes …a benchmark framework …performance results …what we will do next Slide 2 1:07

Motivation Motivation Observation • sizes of XML instances are continuously growing: � Library Data, U Konstanz: 2 GB � DBLP: 300 MB � Wikipedia: 5 GB up to 500 GB � Log files > 10 GB … Fact • XML processors needed to handle these documents • current XML processors usually fail: � by design (on ‐ the ‐ fly parsing, 2GB limit, indexing overhead, …) � by technical limits (main memory barrier, swapping, …) Slide 3 1:07

Motivation MonetDB/XQuery • based on the Pathfinder project, developed in Konstanz • XPath Accelerator: relational XML encoding • StairCase Join: very efficient path traversal • Loop ‐ Lifting: linear execution of nested loops Identified Bottlenecks (Challenges…) • main memory limitation • no content/value indexes Slide 4 1:07

Motivation Two Approaches BaseX Idefix shrink main memory representation optimize disk layout → pure main memory processing persistent native XML storage ← → compressed representation constant scalability ← logarithmic updateability ← of XPath Accelerator encoding → introduction of an inherent value index Slide 5 1:07

BaseX – Memory Architecture Node Table Representation Parent Kind/Token Attributes (32 bit) (1/31 bit) (10/22 bit) Pre Par Tag Content Kind AttName AttVal ...0000 0.....0000 nil 1 0 db elem ...0001 0.....0001 0000...0000 2 1 address elem id add0 3 2 name elem title Prof. ...0010 0.....0010 0001...0001 4 3 Hack Hacklinson text ...0011 1.....0000 nil 5 2 street elem 6 3 Alley Road 43 text ...0010 0.....0011 nil 7 2 city elem ...0011 1.....0001 nil 8 3 Chicago, IL 60611 text ...0010 0.....0100 nil 9 1 address elem id add1 10 9 name elem ...0011 1.....0010 nil 11 10 Jack Johnson text ...0001 0.....0001 0000...0010 12 9 street elem ...0010 0.....0010 nil 13 10 Pick St. 43 text 14 9 city elem ... ... ... ... 15 10 Phoenix, AZ 85043 text � index storage ID Tag ID Text ID AttName ID AttValue 0000 db 0000 Hack Hacklinson 0000 id 0000 add0 � numeric references 0001 address 0001 Alley Road 43 0001 title 0001 Prof. 0010 name 0010 Chicago, IL 60611 0010 add1 0011 street 0011 Jack Johnson 0100 city 0100 Pick St. 43 0101 Phoenix, AZ 85043 Slide 6 1:07

BaseX – Querying Value Indexing • Text and AttributeValue indexes are extended by references to Pre values ( � inverted index) • small memory overhead (12 – 18%) Query Optimization: • predicates are evaluated first (selection pushdown) internal index axis and cs() kind test are added for predicate evaluation • • queries are inverted & rewritten Example: /db/address[@id = "add0"]/name � index::node()[@id = "add0"]/parent::address[parent::db/parent::cs()]/child::name Slide 7 1:07

Idefix – Data Structures Concept Shredding Block Storage Slide 8 1:07

Perfidix – Java Benchmarking Framework Task • automate tedious manual benchmarking tasks � generic � à la JUnit � integration (Eclipse, Ant, …) Output • console or XML per benchmark ( n runs) � minimum, maximum, average, standard deviation, confidence interval 95 Discussion • Java memory management • benchmark history Slide 9 1:07

Perfidix – Java Benchmarking Framework (cont.) Code Example public class DemoBench extends Benchmarkable { public DemoBench() {...} // one-time initialization public setUp() {...} // per-run & method preparation public tearDown() {...} // per-run & method cleanup public benchFoo() {...} // method Foo to bench public benchBar() {...} // method Bar to bench } Output Example ============================================================================================================ | - | unit | sum | min | max | avg | stddev | conf95 | runs | ============================================================================================================ | benchFoo | ns | 8023000 | 19000 | 3822000 | 80230.00 | 376167.54 | [6501.16,153958.84] | 100 | | benchBar | ns | 3951000 | 15000 | 778000 | 39510.00 | 74585.05 | [24891.33,54128.67] | 100 | | _________________________________________________________________________________________________________| | TOTAL | ns | 11974000 | 3951000 | 8023000 | 5987000.00 | 2036000.00 | [3165247.96,8808752.04] | | ============================================================================================================ Slide 10 1:07

Evaluation Systems • MonetDB & BaseX � main memory based processing � similar data structures • X ‐ Hive & Idefix � persistent disk storage � comparable scalability Benchmark Queries • XMark, 110 KB – 22 GB • six value ‐ oriented DBLP Queries, 300 MB Slide 11 1:07

Evaluation – Scalability XMark queries (x ‐ axis � number of query, y ‐ axis � execution time in sec.) MonetDB BaseX 11 GB 1 GB 111 MB 11 MB 1 MB 111 KB X ‐ Hive Idefix 22 GB 11 GB 1 GB 111 MB 11 MB 1 MB 111 KB Slide 12 1:07

Evaluation – XMark XMark queries (x ‐ axis � number of query, y ‐ axis � execution time in sec.) 111 MB 1 GB 11 GB MonetDB BaseX 111 MB 1 GB 11 GB X ‐ Hive Idefix Slide 13 1:07

Evaluation – DBLP DBLP queries (x ‐ axis: number of query, y ‐ axis: execution time in sec.) contains() function: [1] /dblp/*[contains(title, 'XPath')] range query: [2] /dblp/*[year/text() < 1940]/title exact predicate match: [3] /dblp//inproceedings[contains(@key, '/edbt/')] [year/text() = 2004] [4] /dblp/article[author/text() = 'Alan M. Turing'] [5] //inproceedings[author/text() = 'Jim Gray']/title MonetDB BaseX (no index) [6] //article[author/text() = 'Donald D. Chamberlin'] BaseX (with value index) [contains(title, 'XQuery')] Slide 14 1:07

Lessons Learned • hard ‐ coded queries might blur evaluation results • comparison troublesome with different systems � granularity of measurements (shredding, compilation, serialization, …) � impact of different system components (storage, query) � availability of different features (updates, complete query implementation) • handling of serialization output • assure correctness of large results • many factors to measure: � CPU load � memory I/O � disk I/O � memory consumption Slide 15 1:07

Future Work Merge BaseX & Idefix • comprehensive support for value ‐ based queries • full text queries, including scoring algorithms (like SRA/INEX) • optimize XML table compression • optimize disk layout (hybrid, networked, and holographic storage) • write Pathfinder plugin to support XQuery • complete update implementation Benchmarking • use of virtual machines for benchmark reproducibility • specify benchmark for XML updates • application benchmark for eMail storage Slide 16 1:07

Pushing XPath Accelerator to its Limits Christian Grn, Marc Kramis - PowerPoint PPT Presentation

1st International Workshop on Performance and Evaluation of Data Management Systems EXPDB 2006, June 30 Pushing XPath Accelerator to its Limits Christian Grn, Marc Kramis Alexander Holupirek, Marc H. Scholl, Marcel Waldvogel Department of Computer

City Limits Lions Clubs City Limits Lions Clubs City Limits Lions Clubs City Limits Lions

XPath: Arithmetical Operations XPath : Arithmetical Operations 3.1 Additional Features 3.1

Session 16 XPath 1 Objectives Understand XPath well enough to provide a background to jQuery

XPATH and XQUERY Two query language to search for features in XML documents XML Query

XPath Reference XPath leashed, Michael Benedikt and Christoph Koch, TR, 2006 1

Information Systems XPath Nikolaj Popov Research Institute for Symbolic Computation Johannes

9. Path expressions: XPath XPath is a language for selecting parts of XML documents it is

Different Types of Limits Besides ordinary, two-sided limits, there are one-sided limits (left-

XPath XPath is a language for describing paths in XML documents. XML query languages

Reference XPath leashed, Michael Benedikt and Christoph Koch, TR, 2006 XPath Formal setting

XPath Asst. Prof. Dr. Kanda Runapongsa Saikaew (krunapon@kku.ac.th) Dept. of Computer

Generative XPath One XPath to rule them all Oleg Parashchenko Saint-Petersburg State University,

XPath and XSLT Based on slides by Dan Suciu University of Washington CS330 Lecture November 12,

MAT 166 Calculus for Bus/Soc Chapter 3 Notes Limits The Deriviative David J. Gisch Limits

Pushing the Limits of Kernel Networking Networking Services Team, Red Hat Alexander Duyck August

Pushing the limits: Climate and Range Distribution of Two Forest Pests *****Colloque prsent en

Strongly coupled metals and insulators Sean Hartnoll (Stanford) Gauge/gravity duality 2013 @

Holographic Transport and the Hall Angle Mike Blake - DAMTP arXiv:1406.1659 with Aristomenis

Two short talks on current topics in Computer Science Jan Prins Department of Computer Science

Sorting Carola Wenk Slides courtesy of Charles Leiserson with small y changes by Carola Wenk

JET ENERGY LOSS IN A FLOWING PLASMA COMBINING QCD, STRINGS, NULL GEODESICS AND VISCOUS HYDRO With

Momentum dissipation and charge transport in holography Richard Davison, Leiden University CCTP

Soft Error Rate Trends 4th Workshop on Dependable and Secure Nanocomputing (WDSN-10) Alan Wood

Computer Generated Holograms Dr. P.W.M. Tsang Optical Generated Holography Hologram: Recording