WANalytics: Analytics for a geo- distributed data-intensive world - - PowerPoint PPT Presentation

wanalytics analytics for a geo distributed data intensive
SMART_READER_LITE
LIVE PREVIEW

WANalytics: Analytics for a geo- distributed data-intensive world - - PowerPoint PPT Presentation

WANalytics: Analytics for a geo- distributed data-intensive world Ashish Vulimiri * , Carlo Curino + , Brighten Godfrey * , Konstantinos Karanasos + , George Varghese + * UIUC + Microsoft Large organizations today: Massive data volumes


slide-1
SLIDE 1

WANalytics: Analytics for a geo- distributed data-intensive world

Ashish Vulimiri*, Carlo Curino+,
 Brighten Godfrey*, Konstantinos Karanasos+, George Varghese+

* UIUC + Microsoft

slide-2
SLIDE 2

Large organizations today:
 Massive data volumes

  • Data collected across


several data centers for
 low end-user latency

  • Use cases:

– User activity logs – Telemetry – …

DC1 ¡ DC2 ¡ DC3 ¡

slide-3
SLIDE 3

Current scales: 10s-100s TB/day

Microsoft n * 10s TB/day Twitter 100 TB/day Facebook 15 TB/day Yahoo 10 TB/day LinkedIn 10 TB/day

across up to 10s of data centers

slide-4
SLIDE 4

Data must be analyzed as a whole

  • Need to analyze all this

data to extract insight
 
 


  • Production workloads

today:

– Mix of SQL, MapReduce, machine learning, … Analy&cs ¡

SQL ¡ MR ¡ ML ¡ MR ¡ MR ¡ k-­‑means ¡

slide-5
SLIDE 5

Analytics on geo-distributed data:
 Centralized approach inadequate

Current solution: copy all data
 to central DC, run analytics there

  • 1. Consumes a lot of bandwidth

– Cross-DC bandwidth is expensive, very scarce – “Total Internet capacity” only ≈ 100 Tbps

  • 2. Incompatible with sovereignty

– Many countries considering making copying
 citizens’ data outside illegal – Speculation: derived info will still be OK

slide-6
SLIDE 6

Alternative: Geo-distributed analytics

we build system supporting geo-distributed
 analytics execution

  • Leave data partitioned across DCs
  • Push compute down (distribute workflow

execution)

slide-7
SLIDE 7

Geo-distributed analytics

preprocess ¡ adserve_log ¡

MapReduce ¡

click_log ¡

DC1 ¡

adserve_log ¡

SQL ¡ k-­‑means ¡ clustering ¡ Mahout ¡ preprocess ¡ click_log ¡ MapReduce ¡ adserve_log ¡ click_log ¡

Distributed ¡execu&on: ¡ ¡ ¡0.03 ¡TB/day ¡ Centralized ¡execu&on: ¡ ¡ ¡10 ¡TB/day ¡

t ¡= ¡0 ¡ push ¡down ¡ preprocess ¡

click_log ¡

DCn ¡

adserve_log ¡

t ¡= ¡1 ¡ distributed ¡ semi-­‑join ¡ t ¡= ¡2 ¡ centralized ¡ k-­‑means ¡

slide-8
SLIDE 8

Geo-distributed analytics

preprocess ¡ adserve_log ¡

MapReduce ¡

click_log ¡

DC1 ¡

adserve_log ¡

SQL ¡ k-­‑means ¡ clustering ¡ Mahout ¡ preprocess ¡ click_log ¡ MapReduce ¡ adserve_log ¡ click_log ¡

Distributed ¡execu&on: ¡ ¡ ¡0.03 ¡TB/day ¡ Centralized ¡execu&on: ¡ ¡ ¡10 ¡TB/day ¡

t ¡= ¡0 ¡ push ¡down ¡ preprocess ¡

click_log ¡

DCn ¡

adserve_log ¡

t ¡= ¡1 ¡ distributed ¡ semi-­‑join ¡ t ¡= ¡2 ¡ centralized ¡ k-­‑means ¡

slide-9
SLIDE 9

Geo-distributed analytics

preprocess ¡ adserve_log ¡

MapReduce ¡

click_log ¡

DC1 ¡

adserve_log ¡

SQL ¡ k-­‑means ¡ clustering ¡ Mahout ¡ preprocess ¡ click_log ¡ MapReduce ¡ adserve_log ¡ click_log ¡

Distributed ¡execu&on: ¡ ¡ ¡0.03 ¡TB/day ¡ Centralized ¡execu&on: ¡ ¡ ¡10 ¡TB/day ¡

click_log ¡

DCn ¡

adserve_log ¡

t ¡= ¡0 ¡ push ¡down ¡ preprocess ¡ t ¡= ¡1 ¡ distributed ¡ semi-­‑join ¡ t ¡= ¡2 ¡ centralized ¡ k-­‑means ¡

333x ¡cost ¡reducKon ¡

slide-10
SLIDE 10

Building a system for
 Geo-distributed analytics

  • Possible challenges to address:

– Bandwidth – Fault tolerance – – Latency – Consistency

  • Starting point: system we build targets the

batch applications considered earlier

Sovereignty

slide-11
SLIDE 11

PROBLEM DEFINITION

slide-12
SLIDE 12

Computational model

  • DAGs of arbitrary tasks over geo-distributed data
  • Tasks can be or

white box black box

preprocess ¡ adserve_log ¡

MapReduce ¡

click_log ¡

DC1 ¡

adserve_log ¡ click_log ¡

DCn ¡

adserve_log ¡

preprocess ¡ click_log ¡

MapReduce ¡ SQL ¡

correlaKon ¡ analysis ¡

user-­‑provided ¡ code ¡

slide-13
SLIDE 13

Unique characteristics
 (what make this problem novel)

  • 1. Arbitrary DAG of computational tasks
  • 2. No control over data partitioning

– Partitioning dictated by external factors,
 e.g. end-user latency

  • 3. Cross-DC bandwidth is only scarce resource

– CPU, storage within DCs is relatively cheap

  • 4. Unusual constraints:

– heterogeneous bandwidth cost/availability – sovereignty

  • 5. Bulk of load is stable, recurring workload

– Consistent with production logs

slide-14
SLIDE 14

Problem statement

  • Support arbitrary DAG workflows on


geo-distributed data

– Minimize bandwidth cost – Handle fault-tolerance, sovereignty

  • Configure system to optimize given


~stable recurring workload (set of DAGs)

slide-15
SLIDE 15

KEY TAKE-AWAY 1:


Geo-distributed analytics is a fun and industrially relevant new instance of classic DB problems

slide-16
SLIDE 16

OUR APPROACH

slide-17
SLIDE 17

Architecture

End-­‑users ¡

End-­‑user ¡facing ¡DB ¡ (handles ¡OLTP) ¡

Hive ¡ Mahout ¡ MapReduce ¡

Local ¡ ¡ ¡ETL ¡

Workload ¡ OpKmizer ¡

logs ¡ exec, ¡repl ¡ policy ¡

Coordinator ¡

ReporKng ¡ pipeline ¡

DAGs ¡ Results ¡

Data ¡transfer ¡opKmizaKon ¡

slide-18
SLIDE 18

Data transfer optimization:
 Trading CPU/storage for bandwidth

  • Runtime optimization that works irresp of

computation

  • CPU, storage within DCs is cheap
  • Bandwidth crossing DCs is expensive
  • This is one way we trade CPU/storage for

bandwidth reduction

slide-19
SLIDE 19

Data transfer optimization:
 Caching

  • We use aggressive caching:


Cache all intermediate output


  • If computation recurs:

– recompute results – send diff(new results, old results)


  • Actually worsens CPU, storage use

  • But saves cross-DC bandwidth

– all we care about

rold ¡ rnew ¡ rold ¡

src ¡ dst ¡ diff(rnew, ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡rold) ¡

slide-20
SLIDE 20

Data transfer optimization:
 Caching

  • Caching naturally helps if one DAG arrives

repeatedly (intra-DAG)

  • But interestingly: also helps


inter-DAG

– When multiple DAGs share
 common sub-operations – (Because we cache all
 intermediate output)

  • E.g. TPC-CH

– 5.99x for a part of the workload

slide-21
SLIDE 21

Data transfer optimization:
 Caching ≈ View maintenance

  • Caching is a low-level, mechanical form of

(materialized) view maintenance + Works for arbitrary computation

  • Compared to relational view maintenance
  • Is less efficient (CPU, storage)
  • Misses some opportunities
slide-22
SLIDE 22

KEY TAKE-AWAY 2:


The extreme ratio of bandwidth to
 CPU/storage allows for novel optimizations

slide-23
SLIDE 23

WORKLOAD OPTIMIZER

slide-24
SLIDE 24

Robust evolutionary approach

  • Start by supporting existing “centralized” plan
  • Continuous adaptation (loop):

– Come up with a set of alternative hypotheses – Measure their costs using pseudo-distributed execution

  • Novel mechanism with zero bandwidth-cost overhead

– Compute new best plan

  • Execution strategy
  • Data replication strategy

– Deploy new best plan

slide-25
SLIDE 25

Robust evolutionary approach

  • Start by supporting existing “centralized” plan
  • Continuous adaptation (loop):

– Come up with a set of alternative hypotheses – Measure their costs using pseudo-distributed execution

  • Novel mechanism with zero bandwidth-cost overhead

– Compute new best plan

  • Execution strategy
  • Data replication strategy

– Deploy new best plan

slide-26
SLIDE 26

Robust evolutionary approach

  • Start by supporting existing “centralized” plan
  • Continuous adaptation (loop):

– Come up with a set of alternative hypotheses – Measure their costs using pseudo-distributed execution

  • Novel mechanism with zero bandwidth-cost overhead

– Compute new best plan

  • Execution strategy
  • Data replication strategy

– Deploy new best plan

today ¡ (for ¡rest ¡see ¡paper) ¡

slide-27
SLIDE 27

Optimizing execution:
 Subproblem definition

  • Given:

– Core workload: a set of recurrent DAGs – Sovereignty, fault-tolerance requirements

  • Need to decide best choice of:

– Strategy for each task (e.g. hash join vs semi join) – Which task goes to which DC

slide-28
SLIDE 28

Optimizing execution:
 Difficulties

  • 1. Optimizing even one task in isolation


is very hard


  • 2. Should jointly optimize all tasks in each DAG

  • 3. Should jointly optimize all DAGs in workload

– Caching


  • 4. Sovereignty, fault-tolerance
slide-29
SLIDE 29

Optimizing execution:
 Difficulties

  • 1. Optimizing even one task in isolation


is very hard

DAG: ¡ Data: ¡

DC1 ¡

P1 ¡ Q1 ¡

DCn ¡

Pn ¡ Qn ¡

P ¡ Q ¡

slide-30
SLIDE 30

Optimizing execution:
 Difficulties

  • 1. Optimizing even one task in isolation


is very hard


  • 2. Should jointly optimize all tasks in each DAG

  • 3. Should jointly optimize all DAGs in workload

– Recall: caching helps when DAGs share sub-operations


  • 4. Sovereignty, fault-tolerance
slide-31
SLIDE 31

Optimizing execution:
 Greedy heuristic

  • Process all DAGs in parallel, separately.


In each DAG:

– Go over tasks in topological order – For each task, greedily pick
 lowest-cost available strategy


slide-32
SLIDE 32

When does the greedy heuristic work?

  • Contractive DAGs: picks optimal strategy

– make up 98% of DAGs in our experiments

filter ¡ aggr ¡ summarize ¡

Data ¡size ¡

extract ¡features ¡ combine ¡

Data ¡size ¡

slide-33
SLIDE 33

When does the greedy heuristic work?

  • Contractive DAGs: picks optimal strategy [98%]
  • DAGs that expand then contract: may not [2%]

filter ¡ aggr ¡ summarize ¡

Data ¡size ¡

extract ¡features ¡ combine ¡

Data ¡size ¡

slide-34
SLIDE 34

Optimizing execution:
 Beyond the heuristic

  • Have a precise ILP formulation for special cases

– SQL-only DAGs – MapReduce-only DAGs – (Handles fault-tolerance and sovereignity as constraints)

  • Alternate heuristics
  • General problem remains open
slide-35
SLIDE 35

KEY TAKE-AWAY 3:


The optimization space is massive, yet simple heuristics seem to yield good results

slide-36
SLIDE 36

EVALUATION

slide-37
SLIDE 37

Prototype: WANalytics

  • Implemented Hadoop-stack prototype

– MapReduce, Hive, OpenNLP, Mahout, …


  • Experiments up to 10s of TBs scale

– Real Microsoft production workload – Three standard synthetic benchmarks:
 BigBench, TPC-CH, Berkeley Big-Data – Mix of relational and non-relational

slide-38
SLIDE 38

0.00001 0.0001 0.001 0.01 0.1 1 10 0.0001 0.001 0.01 0.1 1 10

Data transfer
 TB (compressed) TB (raw, uncompressed)
 Size of OLTP updates since last OLAP run

Centralized Distributed: no caching Distributed: with caching

Results: BigBench

330x ¡

slide-39
SLIDE 39

0.00001 0.0001 0.001 0.01 0.1 1 10 0.0001 0.001 0.01 0.1 1 10

Data transfer
 TB (compressed) TB (raw, uncompressed)
 Size of OLTP updates since last OLAP run

Centralized Distributed: no caching Distributed: with caching

Results: TPC-CH

360x ¡

slide-40
SLIDE 40

Data transfer
 
 Size of OLTP updates since last OLAP run

Centralized Distributed: no caching Distributed: with caching

Results: Microsoft
 production workload

257x ¡

slide-41
SLIDE 41

0.00001 0.0001 0.001 0.01 0.1 1 0.0001 0.001 0.01 0.1

Data transfer
 TB (compressed) TB (raw, uncompressed)
 Size of OLTP updates since last OLAP run

Centralized Distributed: no caching Distributed: with caching

Results: Berkeley Big-Data

3.5x ¡

slide-42
SLIDE 42

KEY TAKE-AWAY 4:


The opportunity here is substantial: more than two orders of magnitude in
 many workloads

slide-43
SLIDE 43

OPEN PROBLEMS

slide-44
SLIDE 44

Open Problems

  • Evolve optimizer beyond greedy
  • Even more general computational models

– e.g. iteration

  • Latency
  • Consistency
  • Sovereignty / privacy
slide-45
SLIDE 45

Open Problems

  • Evolve optimizer beyond greedy
  • Even more general computational models

– e.g. iteration

  • Latency
  • Consistency
  • Sovereignty / privacy
slide-46
SLIDE 46

Sovereignty: Partial support

  • Our system respects “data-at-rest”

regulations (e.g., German data should not be stored outside of Germany)

  • But we allow arbitrary queries on the data
  • Limitation: we don’t differentiate between

– Acceptable queries, e.g.
 “what’s the total revenue from each city” – Problematic queries, e.g.
 SELECT * FROM Germany

slide-47
SLIDE 47

Sovereignty: Partial support

  • Solution: either

– Legally vet the core workload of queries/views – Use differential privacy mechanism

  • Open problem
slide-48
SLIDE 48

KEY TAKE-AWAY 5:


This is just the first step, lots of related work, lots of fun work ahead

slide-49
SLIDE 49

Related Work

  • Distributed and parallel databases
  • Single-DC frameworks (Hadoop/Spark/…)
  • Data warehouses
  • Scientific workflow systems
  • Sensor networks
  • Stream-processing systems
slide-50
SLIDE 50

Unique characteristics
 (what make this problem novel)

  • 1. Arbitrary DAG of computational tasks
  • 2. No control over data partitioning

– Partitioning dictated by external factors,
 e.g. end-user latency

  • 3. Cross-DC bandwidth is only scarce resource

– CPU, storage within DCs is relatively cheap

  • 4. Unusual constraints:

– heterogeneous bandwidth cost/availability – sovereignty

  • 5. Bulk of load is stable, recurring workload

– Consistent with production logs

slide-51
SLIDE 51

Summary

  • Centralized analytics is becoming untenable
  • Proposal: geo-distributed analytics execution
  • WANalytics, our system, introduces

– Pseudo-distributed measurement – Joint multi-query + redundancy optimization – Caching

  • On real and synthetic workloads:


up to 360x less bandwidth than centralized

  • Many challenges remain