Freddies: DHT-Based Adaptive Query Processing via Federated Eddies - - PowerPoint PPT Presentation
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies - - PowerPoint PPT Presentation
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03 Outline Background: PIER Motivation: Adaptive Query Processing (Eddies) Federated Eddies
Outline
Background: PIER Motivation: Adaptive Query Processing (Eddies) Federated Eddies (Freddies)
System Model Routing Policies Implementation
Experimental Results Conclusions and Continuing Work
PIER
Fully decentralized relational query processing
engine
Principles:
Relaxed consistency Organic Scaling Data in its Natural Habitat Standard Schemas via Grassroots software
Relational queries can be executed in a number of
logically equivalent ways
Optimization step chooses the best performance-wise Currently, PIER has no means to optimize queries
Adaptive Query Processing
Traditional query optimization occurs at query time
and is based on statistics. This is hard because:
Catalog (statistics) must be accurate and maintained Cannot recover from poor choices
The story gets worse!
- Long running queries:
Changing selectivity/costs of operators Assumptions made at query time may no longer hold
- Federated/autonomous data sources:
No control/knowledge of statistics
- Heterogeneous data sources:
Different arrival rates
Thus, Adaptive Query Processing systems attempt
to change execution order during the query
Query Scrambling, Tukwila, Wisconsin, Eddies
Eddies
Eddy: A tuple router that dynamically chooses the order of
- perators in a query plan
- Optimize query at runtime on a per-tuple basis
- Monitors selectivities and costs of operators to determine where
to send a tuple to next
Currently centralized in design and implementation
- Some other efforts for distributed Eddies from Wisconsin &
Singapore (neither use a DHT)
Why use Eddies in P2P? (The easy answers)
Much of the promise of P2P lies in its fully
distributed nature
No central point of synchronization no central catalog Distributed catalog with statistics helps, but does not solve
all problems
Possibly stale, hard to maintain Need CAP to do the best optimization No knowledge of available resources or the current state of
the system (load, etc)
This is the PIER Philosophy!
Eddies were designed for a federated query
processor
Changing operator selectivities and costs Federated/heterogeneous data sources
Why Eddies in P2P? (The not so obvious answers)
Available compute resources in a P2P network
are heterogeneous and dynamically changing
Where should the query be processed?
In a large P2P system, local data distributions,
arrival rates, etc. maybe different than global
Freddies: Federated Eddies
A Freddy is an adaptive query processing
- perator within the PIER framework
Goals:
Show feasibility of adaptive query processing in
PIER
Build foundation and infrastructure for smarter
adaptive query processing
Establish baseline for Freddy performance to
improve upon with smarter routing policies
An Example Freddy
Freddy Put (Join Value RS) Put (Join Value ST) Get(R) Get(S) Output Get(T) R join S S join T
Local Operators To DHT From DHT R S T
System Model
Same functionality as centralized Eddy
Allows easy concept reuse Freddy uses its Routing Policy to determine the next
- perator for a tuple
Tuples in a Freddy are tagged with DoneBits indicating
which operators have processed it
Freddy does all state management, thus existing operators
require no modifications
Local processing comes first (in most cases)
Conserve network bandwidth Not as simple as it seems
Freddy: decide how to rehash a tuple
This determines join order Challenge: Decoupling of routing decision and operator.
Most Eddy techniques no longer valid
Query Processing in Freddies
Query origin creates a query plan with a Freddy
Possible routings determined at this time, but not the order
Freddy operators on all participating nodes initiate
data flow
As tuples arrive, the Freddy determines the next
- perator for this tuple based on the DoneBits and
routing policy
Source tuples tagged with clean DoneBits and routed
appropriately
When all DoneBits are set, the tuple is sent to the
- utput operator (return to query origin)
Tuple Routing Policy
Determines to which operator to send a tuple Local information
Messages expensive Monitor local usage and adjust locally
“Processing Buddy” information
During processing, discover general trends in input/output
nodes’ processing capabilities/output rates, etc
For instance, want to alert previous Freddy of poor PUT
decisions
Design space is huge large research area
Freddy Routing Policies
Simple (KISS):
Static Random: Not as bad as you may think Local Stat Monitoring (sampling)
More complex:
Queue lengths Somewhat analogous to the “back-pressure” effect Monitors DHT PUT ACKs Load balancing through “learning” of global join key
distribution
Piggyback stats on other messages Don’t need global information, only stats about processing
buddies (nodes with which we communicate)
- Different sample than local – may or may not be better
Implementation & Experimental Setup
Design Decisions:
Simplicity is key Roughly 300 of NCSS (PIER is about 5300) Single query processing operator Separate routing policy module loaded at query time Possible routing orders determined by simple optimizer
Required generalizations to the PIER execution
engine to deal with generic operators
Allow PIER to run any dataflow operator
Simulator with 256 nodes, 100 tuples/table/node
Feasibility, not scalability In the absence of global (or stale) knowledge, a static
- ptimizer could chose any join ordering we compare
Freddy performance to all possible static plans
3-way join
R join S join T R join S is expensive (multiples tuple count
by 25)
S join T is highly selective (drops 90%) Possible static join orderings:
R T S S R T
3 Way Join Results
100 200 300 400 500 600 700 800 900 1000 25 50 100 150
Bandwidth/Node (KB/s) Completion Time (s)
RST STR Eddy
4-way join
R join S join T join U S join T is expensive Possible static join orderings:
R T S U S U T R S R T U T S U R R S T U Note: A traditional
- ptimizer can’t make
this plan
4-Way Join
50 100 150 200 250 300 350 50 75 100 125 150 Bandwidth/Node (KB/s) Completion Time (s) RSTU STRU STUR TUSR Bushy Eddy
The Promise of Routing Policy
Illustrative example of
how routing policy can improve performance
This not meant to be an
exhaustive comparison
- f policies, rather to
show the possibilities
EddyQL considers
number of outstanding PUTs (queue length) to decide where to send
20 40 60 80 100 120
Aggregate Bandwidth (MB/s)
RST STR Eddy EddyQL
Conclusions and Continuing Work
Freddies provide adaptable query processing
in a P2P system
Require no global knowledge Baseline performance shows promise for smarter
policies
In the future…
Explore Freddy performance in a dynamic
environment
Explore more complex routing policies