Optimization of Linear Recursive Queries in SQL
Carlos Ordonez
Abstract—Recursion is a fundamental computation mechanism which has been incorporated into the SQL language. This work focuses on the optimization of linear recursive queries in SQL. Query optimization is studied with two important graph problems: computing the transitive closure of a graph and getting the power matrix of its adjacency matrix. We present SQL implementations for two fundamental algorithms: Seminaive and Direct. Five query optimizations are studied: 1) Storage and indexing; 2) early selection; 3) early evaluation of nonrecursive joins; 4) pushing duplicate elimination; and 5) pushing aggregation. Experiments compare both evaluation algorithms and systematically evaluate the impact of optimizations with large input tables. Optimizations are evaluated on four types of graphs: binary trees, lists, cyclic graphs, and complete graphs, going from the best to worst case. In general, Seminaive is faster than Direct, except for complete graphs. Storing and indexing rows by vertex and pushing aggregation work well on trees, lists, and cyclic graphs. Pushing duplicate elimination is essential for complete graphs, but slows computation for acyclic graphs. Early selection with equality predicates significantly accelerates computation for all types of graphs. Index Terms—Recursive query, SQL, query optimization, transitive closure.
Ç 1 INTRODUCTION
R
ECURSION is fundamental in computer science. Most data
structures, like trees or lists, are recursive. Most importantly, many search algorithms have a natural recursive definition. Despite its prominent importance, recursion was not available in SQL for a long time. But the ANSI ’99 SQL standard introduced recursion into SQL with syntactic constructs to define recursive views and recursive-derived tables. This paper studies the optimiza- tion of linear recursive queries [24], [26] in SQL, which constitute a broad class of queries used in practice [2], [3], [17], [19]. Typical problems solved by linear recursive queries include parent/child relationships, path computa- tions in a graph, and bill of materials. Linear recursive queries have many applications in relational databases. Consider an ancestor/descendant example, with a table containing employee/manager information with the em- ployee id of the manager. Examples are “who are all the employees that are managed directly or indirectly by person X?” or “is person X under person Y in the
- rganization?”. Suppose we have a table relating pairs of
parts in a manufacturing environment where one column identifies one part and the second column corresponds to a subpart in a hierarchical fashion; this is the so-called bill of materials example. Examples are “list all subparts of part X” and “how many subparts does part X have two levels below?”. Assume that there is a geographical table with locations where each row indicates that there exists a road (with distance as an attribute) between two locations. Examples are “which is the shortest path between X and Y?,” “how many different routes are there between X and Y?,” “what is the distance between X and Y ?,” or “what locations cannot be reached from X?”. Although recursive query optimization has been exten- sively studied in the past, mostly in deductive databases [3], [9], [19], [20], [24] and to a lesser extent in relational database systems [2], [5], [6], [11], [13], there is no recent work that studies the optimization of linear recursive queries in SQL. Most research has proposed complex algorithms using sophisticated data structures. Instead,
- ur work studies how to optimize recursive queries with
existing storage organization and indexing mechanisms and relational algebra transformations. Thus, this paper revisits the classical problem of optimizing linear recursive queries, but focusing on SQL. This is a summary of contributions. We present implementations in SQL of two classical algorithms to evaluate linear recursive queries: Seminaive [3] and Direct [2]. We study the optimization of selection-projection-join- aggregation (SPJA) queries, following traditional query
- ptimization principles [7]. SPJA queries represent the most
common and general queries in a relational DBMS. Specifically, five query optimizations are studied: 1. storage and indexing of input, intermediate, and result tables; 2. early selection of rows by pushing predicates; 3. early or late evaluation of nonrecursive (external) joins; 4. pushing duplicate and cycle elimination into inter- mediate recursive steps; and 5. pushing aggregation through recursion. We also study how to improve the performance when there are deep recursion levels, many duplicate rows and cycles. We perform a systematic experimental evaluation with large tables storing graphs with different structure and levels of connectivity.
264 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
- VOL. 22,
- NO. 2,
FEBRUARY 2010
. The author is with the Department of Computer Science, University of Houston, Houston, TX 77204. E-mail: ordonez@cs.uh.edu. Manuscript received 13 Mar. 2008; revised 7 Nov. 2008; accepted 16 Mar. 2009; published online 31 Mar. 2009. Recommended for acceptance by D. Papadias. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2008-03-0142. Digital Object Identifier no. 10.1109/TKDE.2009.83.
1041-4347/10/$26.00 2010 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: University of Houston. Downloaded on January 11, 2010 at 18:00 from IEEE Xplore. Restrictions apply.