Collaborative Query Coordination in Community-Driven Data Grids - - PowerPoint PPT Presentation

collaborative query coordination in community driven data
SMART_READER_LITE
LIVE PREVIEW

Collaborative Query Coordination in Community-Driven Data Grids - - PowerPoint PPT Presentation

Technische Universitt Mnchen HPDC '09 Collaborative Query Coordination in Community-Driven Data Grids Tobias Scholl, Angelika Reiser, and Alfons Kemper Department of Computer Science, Technische Universitt Mnchen Germany Technische


slide-1
SLIDE 1

Technische Universität München

Collaborative Query Coordination in Community-Driven Data Grids

Tobias Scholl, Angelika Reiser, and Alfons Kemper Department of Computer Science, Technische Universität München Germany

HPDC '09

slide-2
SLIDE 2

Technische Universität München

Community-Driven Data Grids (HiSbase)

slide-3
SLIDE 3

2009-06-13 HPDC 2009 – Collaborative Query Processing 3

Technische Universität München

The AstroGrid-D Project

  • German Astronomy

Community Grid http://www.gac-grid.org/

  • Funded by the German

Ministry of Education and Research

  • Part of D-Grid
slide-4
SLIDE 4

2009-06-13 HPDC 2009 – Collaborative Query Processing 4

Technische Universität München

Up-Coming Data-Intensive Applications

  • Alex Szalay, Jim Gray (Nature, 2006):

“Science in an exponential world”

  • Data rates

– Terabytes a day/night – Petabytes a year

  • LHC
  • LSST
  • LOFAR
  • Pan-STARRS

LOFAR LHC

slide-5
SLIDE 5

2009-06-13 HPDC 2009 – Collaborative Query Processing 5

Technische Universität München

The Multiwavelength Milky Way

http://adc.gsfc.nasa.gov/mw/

slide-6
SLIDE 6

2009-06-13 HPDC 2009 – Collaborative Query Processing 6

Technische Universität München

Research Challenges

  • Directly deal with Terabyte/Petabyte-scale data sets
  • Integrate with existing community infrastructures
  • High throughput for growing user communities
slide-7
SLIDE 7

2009-06-13 HPDC 2009 – Collaborative Query Processing 7

Technische Universität München

Current Sharing in Data Grids

  • Data autonomy
  • Policies allow partners to access data
  • Each institution ensures

– Availability (replication) – Scalability

  • Various organizational structures [Venugopal et al. 2006]:

– Centralized – Hierarchical – Federated – Hybrid

slide-8
SLIDE 8

2009-06-13 HPDC 2009 – Collaborative Query Processing 8

Technische Universität München

Community-Driven Data Grids (HiSbase)

slide-9
SLIDE 9

2009-06-13 HPDC 2009 – Collaborative Query Processing 9

Technische Universität München

“Distribute by Region – not by Archive!”

slide-10
SLIDE 10

2009-06-13 HPDC 2009 – Collaborative Query Processing 10

Technische Universität München

“Distribute by Region – not by Archive!”

slide-11
SLIDE 11

2009-06-13 HPDC 2009 – Collaborative Query Processing 11

Technische Universität München

“Distribute by Region – not by Archive!”

slide-12
SLIDE 12

2009-06-13 HPDC 2009 – Collaborative Query Processing 12

Technische Universität München

“Distribute by Region – not by Archive!”

slide-13
SLIDE 13

2009-06-13 HPDC 2009 – Collaborative Query Processing 13

Technische Universität München

Mapping Data to Nodes

slide-14
SLIDE 14

2009-06-13 HPDC 2009 – Collaborative Query Processing 14

Technische Universität München

Submission Characteristics

  • Portal-based submission
  • Browser in every

researcher‘s "tool box“

  • Scalability depends on portal
  • Institution-based

submission

  • All data nodes accept

queries

  • Submission via local data

node

slide-15
SLIDE 15

2009-06-13 HPDC 2009 – Collaborative Query Processing 15

Technische Universität München

Coordinator Selection Strategies

  • The node submitting the query

– SelfStrategy (SS)

  • A node containing relevant data (region-based strategies)

– FirstRegionStrategy (FRS) – SelfOrFirstRegionStrategy (SOFRS) – CenterOfGravityStrategy (COGS) – RandomRegionStrategy (RRS)

slide-16
SLIDE 16

2009-06-13 HPDC 2009 – Collaborative Query Processing 16

Technische Universität München

SelfStrategy (SS)

slide-17
SLIDE 17

2009-06-13 HPDC 2009 – Collaborative Query Processing 17

Technische Universität München

FirstRegionStrategy (FRS)

slide-18
SLIDE 18

2009-06-13 HPDC 2009 – Collaborative Query Processing 18

Technische Universität München

SelfOrFirstRegionStrategy (SOFRS)

  • Combination from SelfStrategy and FirstRegionStrategy
  • Submit node is coordinator if it covers data
  • Avoids unnecessary data transport
  • With many partitions and many nodes basically the same as

FirstRegionStrategy (as probability of Self-case decreases)

slide-19
SLIDE 19

2009-06-13 HPDC 2009 – Collaborative Query Processing 19

Technische Universität München

CenterOfGravityStrategy (COGS)

  • Further reduce amount of

data shipping

  • "Perfect spot“ for minimizing

data transfer

slide-20
SLIDE 20

2009-06-13 HPDC 2009 – Collaborative Query Processing 20

Technische Universität München

RandomRegionStrategy (RRS)

  • Select random relevant region
  • Tradeoff between balancing coordination

load and reducing data shipping

  • Probability(a) = 2/9
  • Probability(b) = 5/9
  • Probability(c) = 2/9
slide-21
SLIDE 21

2009-06-13 HPDC 2009 – Collaborative Query Processing 21

Technische Universität München

Evaluation

  • Coordination Strategies: SS, FRS, SOFRS, COGS, RRS
  • Submission Strategies: portal-based, institution-based
  • Observational data sets
  • Two workloads

– SDSS query log (Qobs) – Synthetic (Qscaled)

  • Network size
  • Network traffic measurements

– Number of routed messages – Coordination load balancing

  • Throughput Measurements

Pobs

slide-22
SLIDE 22

2009-06-13 HPDC 2009 – Collaborative Query Processing 22

Technische Universität München

Query Workloads

slide-23
SLIDE 23

2009-06-13 HPDC 2009 – Collaborative Query Processing 23

Technische Universität München

Routed Messages per Query (Qobs)

slide-24
SLIDE 24

2009-06-13 HPDC 2009 – Collaborative Query Processing 24

Technische Universität München

Routed Messages per Query (Qscaled)

slide-25
SLIDE 25

2009-06-13 HPDC 2009 – Collaborative Query Processing 25

Technische Universität München

Portal-based Coordination Load

slide-26
SLIDE 26

2009-06-13 HPDC 2009 – Collaborative Query Processing 26

Technische Universität München

Institution-based Coordination Load

slide-27
SLIDE 27

2009-06-13 HPDC 2009 – Collaborative Query Processing 27

Technische Universität München

Throughput

  • Throughput dependent on query complexity
  • No clear winner in terms of throughput

Qobs Qscaled

slide-28
SLIDE 28

2009-06-13 HPDC 2009 – Collaborative Query Processing 28

Technische Universität München

Workload-Aware Data Partitioning

  • Query skew (hot spots) triggered by increased interest in

particular subsets of the data

  • Two well-known query load balancing techniques:

– Data partitioning – Data replication

  • Finding trade-offs between both (see EDBT ’09 paper)
slide-29
SLIDE 29

2009-06-13 HPDC 2009 – Collaborative Query Processing 29

Technische Universität München

Load Balancing During Runtime

  • Complement workload-aware partitioning with runtime load-

balancing

  • Short-term peaks

– Master-slave approach – Load monitoring

  • Long-term trends

– Based on load monitoring – Histogram evolution

slide-30
SLIDE 30

2009-06-13 HPDC 2009 – Collaborative Query Processing 30

Technische Universität München

Related Work

  • On-line load balancing
  • Hundreds of thousands to

millions of nodes

  • Reacting fast
  • Treating objects

individually

HiSbase

slide-31
SLIDE 31

2009-06-13 HPDC 2009 – Collaborative Query Processing 31

Technische Universität München

Who Is the Query Coordinator?

  • Many challenges and opportunities in e-science for distributed

computing and database research – High-throughput data management – Correlation of distributed data sources

  • Collaborative Query Coordination

– Region-based strategies reduce number of messages – Load balancing independent of submission characteristic

slide-32
SLIDE 32

2009-06-13 HPDC 2009 – Collaborative Query Processing 32

Technische Universität München

Special Thanks To …

  • Ella Qiu, University of British Columbia

– DAAD Rise Internship – Support during implementation – Initial measurements

slide-33
SLIDE 33

2009-06-13 HPDC 2009 – Collaborative Query Processing 33

Technische Universität München

Get in Touch

  • Database systems group, TU München

– Web site: http://www-db.in.tum.de – E-mail: scholl@in.tum.de

  • The HiSbase project

– http://www-db.in.tum.de/research/projects/hisbase/

Thank You for Your Attention