5 storage and manipulation
play

5. Storage and Manipulation Foundations of XML Data of SSD - PDF document

5. Storage and Manipulation Foundations of XML Data of SSD Manipulation Shamelessly inspired by Ioana Manolescu tutorial Giorgio Ghelli The problem Classifying stores Consider the queries Essential criteria: $doc //


  1. 5. Storage and Manipulation Foundations of XML Data of SSD Manipulation Shamelessly “inspired” by Ioana Manolescu tutorial Giorgio Ghelli The problem Classifying stores • Consider the queries • Essential criteria: – $doc // e-mail – Clustering – $doc //.[name = 'ghelli']/e-mail – Encoding of parent/child relationship • We do not want to bring the whole $doc in main memory • Set manipulation rather than tuple manipulation OEM data model Storing OEM Bib • No schema! &o1 complex object • Storing objects in LORE: paper paper book – Store the graph, clustered in depth-first references references &o12 &o24 &o29 order author pages author author title http titlepublisher title author year author – Operator: Scan(document,path), returns a &o43 &25 &96 set of objects 1997 last firstname firstname lastname first lastname atomic object &243 &206 “Abiteboul” “Serge” “Victor” “Vianu” 122 133

  2. Indexing in LORE Access plans • Top down or bottom up navigation? • VIndex( l , o, pred ): all objects o with an select x incoming l-edge, satisfying the predicate from A.B x • LIndex( o , l , p): all parents of o via an l - where x.C = 5 edge A A A D D • BIndex(x, l , y): all edges labeled l D D B B B B B B B B B C C C C C C C C C 5 5 5 4 4 5 4 4 5 Access plans: bottom up Bottom up and top down select x select x from A.B x from A.B x where x.C = 5 where x.C = 5 Return(n2) Return(n2) Name(n3,'A') Name(n3,'A') Return(n1) NLJoin LIndex(n2,'B',n3) Select(n2,'=5') NLJoin LIndex(n2,'B',n3) LIndex(n1,'C',n2) Scan(n1,'C',n2) VIndex('C',n1,'=5') LIndex(n1,'C',n2) VIndex('C',n1,'=5') Scan('A','B',n1) Hybrid access plans Path indexes select x • PIndex( p , o): all objects reachable by from A.B x the path p where x.C = 5 Intersect Project(n2) Project(n2) Name(n1,'A') LIndex(n3,'C',n2) BIndex(n1,'B',n2) VIndex('C',n3,'=5')

  3. Using a path index DataGuides select x • Introduced in Lore: a compact representation of all paths in the graph from A.B x • A DG for s is an OEM object d such that: where x.C = 5 – Every path of s reaches exactly one object in d – Every path in d is a path for s Intersect • DG: a schema a posteriori • Used to: PIndex('A.B',n2) Project(n2) – Inform the user LIndex(n1,'C',n2) – Expand wildcards in paths – Inform the optimizer VIndex('C',n1,'=5') DataGuides for trees Building a DataGuide A A A • Similar to converting a NFA to a DFA • Linear time for trees D D B D B D B • Exponential in time and space for C C C C C C graphs A A A • In practice, works well for regular ? structures D D B D B D B A C C A C C A C Which dataguide is better? Optimization via dataguide • Minimal dataguide • Expanding paths: a//z => a/b/z | a/c/d/z • Strong dataguide: if p1 and p2 both reach the • Deleting paths that are not in the data same node in d , then p1 and p2 have the • Contracting paths: a/c/e/d/z => //d/z same target set in s – Each target-set in the source has its own node – May be more efficient for a bottom-up and in the guide evaluation – Easy to build • Keeping statistic information – Easy to maintain, by keeping track of the many-to- many node correspondence between s and d – However, statistic are needed for every k- length path, not just for rooted paths

  4. Graph Schemas Graph schemas A A A • Each edge in the scheme is labeled by a label predicate (a set of labels) D D B D B D ∨ B • Predicates are deterministic A C C A C C A C • Conformance is defined by a simulation between s and d: A – Root of data in root of schema – For every n1 in d1 with n1-l-n2, we have d1-l-d2 B C ¬ (B ∨ C) with n2 in d2 B • No request for surjectivity, or injectivity D E Graph indexing • Group nodes in sets, possibly disjoint • Store the extent of each set XML Storage in RDBMS • Grouping criteria: – Reachable by exactly the same Forward paths: 1- index – Indistinguishable by any F&B path: FB-index – Indistinguishable by the paths in a set Q: covering indexes – Indistinguishable by any path longer than k: A(k) index Using RDBMS for XML Storing data • Advantages • Data may be schema-less – Transactions • Data may have a schema – Optimization • Data may change its schema over time • Issues – Data storage – Query translation

  5. Approaches Unknown schema: the edge table From Pos Tag To Data employees 1 • Based on schema: - 1 employees 1 1 1 emp 2 – Based only on schema 2 1 ID 3 1 emp 11 emp 21 2 2 FN 4 Nancy emp – Based on schema + cost informations 2 2 3 BD 5 5 1 Day 6 8 • Unknown schema: FstNm BirthD HiringD ID 5 2 Month 7 dec 3 4 5 9 5 3 Year 8 1968 – Derive schema from data Day 6 M 7 1 N Y 8 2 4 HD 9 8 dec 1968 – Generic approach 9 … … … 1 2 emp 11 • User defined 11 … … … 1 3 emp 21 Navigating the edge table Partitioned edge table employees From Pos Tag To Data • //FN/text(): - 1 employees 1 From Pos To Data select e.Data from edge e where e.Tag = 'FN' 1 1 emp 2 - 1 1 2 1 ID 3 1 • //emp[ID=‘1’]/FN/text() 2 2 FN 4 Nancy emp select e3.Data 2 3 BD 5 From Pos To Data srom edge e1, edge e2, edge e3 5 1 Day 6 8 1 1 2 5 2 Month 7 dec where e1.Tag = ‘emp’ and e1.to = e2.from 1 2 11 and e2.Tag = ‘ID’ and e2.Data = 1 5 3 Year 8 1968 1 3 21 and e1.to = e3.from and e3.Tag = ‘FN’ 2 4 HD 9 9 … … … ID • Navigation through multi-way join (XPath to 1 2 emp 11 FO translation) From Pos To Data 11 … … … 2 1 3 1 1 3 emp 21 Navigation Related storage schemes • //emp[ID=‘1’]/FN/text() • The universal relation: select e1.Data – employees � � � emp � � �� ID � � �� FN � � �� … from edge e1, edge e2, edge e3 • Materialized views over edges: where e1.Tag = ‘emp’ and e1.to = e2.from and e2.Tag = ‘ID’ and e2.Data = 1 – emp � � ID � � FN � � HD … and e1.to = e3.from and e3.Tag = ‘FN’ � select FN.Data • The STORED approach: from emp, ID, FN – Materialized views based on pattern where emp.to = ID.from and ID.Data = 1 frequencies in the database and emp.to = FN.from – Overflow tables for the rest • Joining smaller tables

  6. Flat storage Path partitioning in Monet employees • For each root-to-inner-node path: emp emp emp – Path(n1,n2,ord): FstNm BirthD HiringD ID • employees.emp{(1,2,1);(1,11,2);(1,21,3)} • employees.emp.ID{(2,3,1);…} Day 6 M 7 1 N Y 8 8 dec 1968 • For each root-to-leaf path: – Path(n1,val) ID First Name BD-D BD-M BD-Y • employees.emp.ID.text{(3,’1’);…} 1 Nancy 8 dec 1968 2 Andrew 19 feb 1952 • Path summary 3 Janet 30 aug 1963 • No join for linear path expressions 4 Margaret 19 sep 1958 XRel approach: interval coding XParent • Tables: • Tables: – Path(PathID,PathExpr) – LabelPath(ID, lengh, pathExpr) – Element(DocID, PathID, Start, End, Ordinal) – Data(pathID, nID, ord, value) – Text(DocID, PathID, Start, End, Value) – Element(pathID, ord, nID) – Attribute(DocID, PathID, Start, End, Value) – ParentChild(pID, cID) • Ancestor relation: • Use ParentChild instead of interval – N1.start< N2.start and N2.end > N1.end coding: equi-join instead of <-join • Path expression: regexp matching with Path table, join the result with the data tables Schema-driven storage Sharing the address DB Employees Depts DB Employees Depts Employee* Dep* Name Employee* Dep* Address Name Street Number E-mail* Address GN FN Street GN FN Number Phone? Employee(PId,Id,Name,Name.GN,Name.FN) Dep(PId,Id) DB(Id,Employees,Depts) Address(PId,Id,Address,Addr.Street,Addr.Number) Employee(PId,Id,Name,Name.GN,Name.FN,Phone) E-Mail(PId,Id,E-mail) Employee(PId,Id,…,Address,Addr.Street,Addr.Number) Depts(PId,Id,Address,Addr.Street,Addr.Number) Dep(PId,Id,Address,Addr.Street,Addr.Number)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend