freddies dht based adaptive query processing via
play

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies - PowerPoint PPT Presentation

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03 Outline Background: PIER Motivation: Adaptive Query Processing (Eddies) Federated Eddies


  1. Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03

  2. Outline � Background: PIER � Motivation: Adaptive Query Processing (Eddies) � Federated Eddies (Freddies) � System Model � Routing Policies � Implementation � Experimental Results � Conclusions and Continuing Work

  3. PIER � Fully decentralized relational query processing engine � Principles: � Relaxed consistency � Organic Scaling � Data in its Natural Habitat � Standard Schemas via Grassroots software � Relational queries can be executed in a number of logically equivalent ways � Optimization step chooses the best performance-wise � Currently, PIER has no means to optimize queries

  4. Adaptive Query Processing � Traditional query optimization occurs at query time and is based on statistics. This is hard because: � Catalog (statistics) must be accurate and maintained � Cannot recover from poor choices � The story gets worse! Long running queries: � � Changing selectivity/costs of operators � Assumptions made at query time may no longer hold Federated/autonomous data sources: � � No control/knowledge of statistics Heterogeneous data sources: � � Different arrival rates � Thus, Adaptive Query Processing systems attempt to change execution order during the query � Query Scrambling, Tukwila, Wisconsin, Eddies

  5. Eddies � Eddy: A tuple router that dynamically chooses the order of operators in a query plan Optimize query at runtime on a per-tuple basis � Monitors selectivities and costs of operators to determine where � to send a tuple to next � Currently centralized in design and implementation Some other efforts for distributed Eddies from Wisconsin & � Singapore (neither use a DHT)

  6. Why use Eddies in P2P? (The easy answers) � Much of the promise of P2P lies in its fully distributed nature � No central point of synchronization � no central catalog � Distributed catalog with statistics helps, but does not solve all problems � Possibly stale, hard to maintain � Need CAP to do the best optimization � No knowledge of available resources or the current state of the system (load, etc) � This is the PIER Philosophy! � Eddies were designed for a federated query processor � Changing operator selectivities and costs � Federated/heterogeneous data sources

  7. Why Eddies in P2P? (The not so obvious answers) � Available compute resources in a P2P network are heterogeneous and dynamically changing � Where should the query be processed? � In a large P2P system, local data distributions, arrival rates, etc. maybe different than global

  8. Freddies: Federated Eddies � A Freddy is an adaptive query processing operator within the PIER framework � Goals: � Show feasibility of adaptive query processing in PIER � Build foundation and infrastructure for smarter adaptive query processing � Establish baseline for Freddy performance to improve upon with smarter routing policies

  9. An Example Freddy R join S S join T Put Local (Join Value RS) Operators Put (Join Value ST) To DHT Freddy Output Get(R) Get(T) Get(S) R S T From DHT

  10. System Model � Same functionality as centralized Eddy � Allows easy concept reuse � Freddy uses its Routing Policy to determine the next operator for a tuple � Tuples in a Freddy are tagged with DoneBits indicating which operators have processed it � Freddy does all state management, thus existing operators require no modifications � Local processing comes first (in most cases) � Conserve network bandwidth � Not as simple as it seems � Freddy: decide how to rehash a tuple � This determines join order � Challenge: Decoupling of routing decision and operator. Most Eddy techniques no longer valid

  11. Query Processing in Freddies � Query origin creates a query plan with a Freddy � Possible routings determined at this time, but not the order � Freddy operators on all participating nodes initiate data flow � As tuples arrive, the Freddy determines the next operator for this tuple based on the DoneBits and routing policy � Source tuples tagged with clean DoneBits and routed appropriately � When all DoneBits are set, the tuple is sent to the output operator (return to query origin)

  12. Tuple Routing Policy � Determines to which operator to send a tuple � Local information � Messages expensive � Monitor local usage and adjust locally � “Processing Buddy” information � During processing, discover general trends in input/output nodes’ processing capabilities/output rates, etc � For instance, want to alert previous Freddy of poor PUT decisions � Design space is huge � large research area

  13. Freddy Routing Policies � Simple (KISS): � Static � Random: Not as bad as you may think � Local Stat Monitoring (sampling) � More complex: � Queue lengths � Somewhat analogous to the “back-pressure” effect � Monitors DHT PUT ACKs � Load balancing through “learning” of global join key distribution � Piggyback stats on other messages � Don’t need global information, only stats about processing buddies (nodes with which we communicate) � Different sample than local – may or may not be better

  14. Implementation & Experimental Setup � Design Decisions: � Simplicity is key � Roughly 300 of NCSS (PIER is about 5300) � Single query processing operator � Separate routing policy module loaded at query time � Possible routing orders determined by simple optimizer � Required generalizations to the PIER execution engine to deal with generic operators � Allow PIER to run any dataflow operator � Simulator with 256 nodes, 100 tuples/table/node � Feasibility, not scalability � In the absence of global (or stale) knowledge, a static optimizer could chose any join ordering � we compare Freddy performance to all possible static plans

  15. 3-way join � R join S join T � R join S is expensive (multiples tuple count by 25) � S join T is highly selective (drops 90%) � Possible static join orderings: T R R S S T

  16. 3 Way Join Results 1000 900 800 700 Completion Time (s) 600 RST 500 STR Eddy 400 300 200 100 0 25 50 100 150 Bandwidth/Node (KB/s)

  17. 4-way join � R join S join T join U � S join T is expensive � Possible static join orderings: U R U R T U R S R S S T S T T U Note: A traditional optimizer can’t make R S T U this plan

  18. 4-Way Join 350 300 250 Completion Time (s) RSTU 200 STRU STUR TUSR Bushy 150 Eddy 100 50 0 50 75 100 125 150 Bandwidth/Node (KB/s)

  19. The Promise of Routing Policy � Illustrative example of how routing policy can 120 improve performance Aggregate Bandwidth (MB/s) 100 � This not meant to be an 80 exhaustive comparison of policies, rather to 60 show the possibilities 40 � EddyQL considers 20 number of outstanding PUT s (queue length) to 0 RST STR Eddy EddyQL decide where to send

  20. Conclusions and Continuing Work � Freddies provide adaptable query processing in a P2P system � Require no global knowledge � Baseline performance shows promise for smarter policies � In the future… � Explore Freddy performance in a dynamic environment � Explore more complex routing policies

  21. Questions? Comments? Snide remarks for Ryan? Glorious praise for Shawn? Thanks!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend