Wave Computing in the Cloud Bingsheng He Microsoft Research Asia - - PowerPoint PPT Presentation

wave computing in the cloud
SMART_READER_LITE
LIVE PREVIEW

Wave Computing in the Cloud Bingsheng He Microsoft Research Asia - - PowerPoint PPT Presentation

Wave Computing in the Cloud Bingsheng He Microsoft Research Asia Joint work with Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Hongyi Wang, Lidong Zhou 5/18/2009 1 My Dream Wave Computing 5/18/2009 2 But, Today, Wave Computing is


slide-1
SLIDE 1

Wave Computing in the Cloud

Bingsheng He Microsoft Research Asia

Joint work with Mao Yang, Zhenyu Guo, Rishan Chen, Wei Lin, Bing Su, Hongyi Wang, Lidong Zhou

5/18/2009 1

slide-2
SLIDE 2

My Dream Wave Computing

5/18/2009 2

slide-3
SLIDE 3

But, Today, Wave Computing is Actually…

5/18/2009 3

The Wave model is a new paradigm for cloud computing.

slide-4
SLIDE 4

State-of-the-art in the Cloud

5/18/2009 4

(MapReduce and its brothers: G. Y. M. )

  • We provide scalability and fault-

tolerance on thousands of machines.

  • We provide the query interference

using high level languages.

slide-5
SLIDE 5

Are G.Y.M.’s Executions Optimal?

5/18/2009 5

  • We looked at a query trace

from a production system (20 thousand queries, 29 million machine hours).

  • We focused on the I/O

and computation efficiency.

(Mr. Leopard)

slide-6
SLIDE 6

Our Finding: “Far From Ideal”

5/18/2009 6

33% 67% Redudant I/O on input data Distinct I/O 30% 70% Common computation steps Other computation steps

0.2 0.4 0.6 0.8 1 Current Production System Ideal System Normalized Total I/O

(Results from simulation) 46%

slide-7
SLIDE 7

I/O Redundancy

  • Two sample workloads

– Obtaining the top ten hottest Chinese pages daily – Obtaining the top ten hottest English pages daily

5/18/2009 7

Extract

Filter: “Chinese” Compute Top Ten Output

Extract

Filter: “English” Compute Top Ten Output

Current system Extract

Filter: “Chinese” Compute Top Ten Output Filter: “English” Compute Top Ten Output

Ideal system

slide-8
SLIDE 8

Computation Redundancy

  • Two sample workloads

– Obtaining the top ten hottest Chinese pages daily – Obtaining the top ten hottest Chinese pages weekly

5/18/2009 8

Extract

Filter: “Chinese” Compute Top Ten

Extract

Filter: “Chinese” Compute Top Ten

Every day: Every week: Common computation on per-day log (Ideally)

slide-9
SLIDE 9

Why?

Correlations among queries

– Temporal correlations among queries (A series of queries with recurrent computation)

5/18/2009 9

98% 2% Recurring queries Non- recurring queries

slide-10
SLIDE 10

Why?

Correlations among queries

– Spatial correlations among queries (Input data are targeted by multiple individual queries)

5/18/2009 10

75% 25% Accesses to top ten files Accesses to other files

slide-11
SLIDE 11

How To Exploit the Correlations?

5/18/2009 11

Err… This is a little tricky. What about developing these?

  • a probabilistic model on scheduling

the input data access

  • a predictive cache server
  • a speculative query decomposer.

(G.Y.M.) (Mr. Leopard) No… Let’s K.I.S.S.:

  • Since correlations are inherent, we

need a notion to capture them.

  • Our solution is the Wave model to

capture the correlation for both the user and the system.

slide-12
SLIDE 12

The Wave Model

  • Key concepts capturing the correlation among

queries

– Data: not a static file, but a stream with periodically updated (append-only) – Query: computation on the input stream – Query series: recurrent computation on the stream

5/18/2009 12

slide-13
SLIDE 13

Optimization Opportunities in Waves

  • Shared scan

– Identifies the same input stream accesses among queries

  • Shared computation

– Identifies common computation steps among queries

  • Query decomposition

– Decomposes a query into a series of smaller queries – Uncovers more opportunities for shared scan and computation

5/18/2009 13

slide-14
SLIDE 14

Query Optimizations in Wave Computing

1 2 3 4 5 6 7 8 9 Series 3 (weekly) Series 2 (daily) Series 1 (daily)

  • Decomposition
  • Form jumbo queries
  • Optimizations on jumbo queries
  • Shared scan and computation

a jumbo query

14

Query series 1: Obtaining the top ten hottest Chinese pages daily; Query series 2: Obtaining the top ten hottest English pages daily; Query series 3: Obtaining the top ten hottest Chinese pages weekly;

slide-15
SLIDE 15

Ultimate (Wave+Cloud)

5/18/2009 15

+ =

Time

Individual query series Jumbo queries

slide-16
SLIDE 16

Comet: Integration into DryadLINQ

16

Translation: query to logical representation (expression tree) Transformation: logical->physical Encapsulation: physical->Dryad execution graph Code generation

Query normalization More rules; Views Shared scan/partitioning Cost model

slide-17
SLIDE 17

An Example of Query Decomposition in DryadLINQ

5/18/2009 17

Decompose an

  • perator

Q  seven daily queries + one combining query

Daily query

Combining Automatic query decomposition is challenging.

Views (Cost estimation) Combine all the views

slide-18
SLIDE 18

Micro Benchmark

  • Overall effectiveness

– Logical optimization of Comet reduces 12.3% of total I/O. – Full (Logical + Physical optimizations) of Comet reduces 42.3% of total I/O.

20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 Total I/O (GB) Day Original Logical Full

(Running three sample queries on one week data of around 120 GB; A cluster of 40 machine)

18

slide-19
SLIDE 19

Summary

  • The Wave model is a new paradigm for

capturing the query correlations in the cloud.

  • The Wave model enables significant
  • pportunities in improving performance and

resource utilization.

  • Comet: our ongoing project integrating Wave

computing into DryadLINQ.

5/18/2009 19