Machine Image Graph Data Learning Processing Analysis Mining - - PowerPoint PPT Presentation

machine image graph data learning processing analysis
SMART_READER_LITE
LIVE PREVIEW

Machine Image Graph Data Learning Processing Analysis Mining - - PowerPoint PPT Presentation

C LOUD PROGRAMMING Andrew Harris & Long Kai 1 M OTIVATION Research problem : How to write distributed data-parallel programs for a compute cluster? Drawback of Parallel Databases (SQL) : Too limited for many applications. Very


slide-1
SLIDE 1

CLOUD PROGRAMMING

Andrew Harris & Long Kai

1

slide-2
SLIDE 2

MOTIVATION

 Research problem: How to write distributed

data-parallel programs for a compute cluster?

 Drawback of Parallel Databases (SQL): Too

limited for many applications.

 Very restrictive type system  The declarative query is unnatural.  Drawback of Map Reduce: Too low-level and

rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse.

2

slide-3
SLIDE 3

Image Processing

LAYERS

3

Server Cluster Services Hadoop Map-Reduce / Dryad Pig Latin / DryadLINQ Server Server Server Other Languages Machine Learning Graph Analysis Data Mining Applications

Other Applications

slide-4
SLIDE 4

PIG LATIN:

A Not-So-Foreign Language for Data Processing

4

slide-5
SLIDE 5

DATAFLOW LANGUAGE

 User specifies a sequence of steps where each

step specifies only a single, high level data

  • transformation. Similar to relational algebra and

procedural – desirable for programmers.

 With SQL, the user specifies a set of declarative

  • constraints. Non-procedural and desirable for

non-programmers.

5

slide-6
SLIDE 6

AN SAMPLE CODE OF PIG LATIN

6

SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6;

  • utput = FOREACH big_groups GENERATE

category, AVG(good_urls.pagerank);

SQL Pig Latin

Pig Latin program is a sequence of steps, each of which carries out a single data transformation.

slide-7
SLIDE 7

DATA MODEL

 Atom: Contains a simple atomic value such as a

string or a number, e.g., ‘Joe’.

 Tuple: Sequence of fields, each of which might be any

data type, e.g., (‘Joe’, ‘lakers’)

 Bag: A collection of tuples with possible duplicates.

Schema of a bag is flexible.

 Map: A collection of data items, where each item has

an associated key through which it can be looked up. Keys must be data atoms.

7

slide-8
SLIDE 8

A COMPARISON WITH RELATIONAL ALGEBRA

8

 Everything is a bag.  Dataflow language.  FILTER is same as

the Select operator.

 Everything is a table.  Dataflow language.  Select operator is same

as the FILTER cmd.

Pig Latin Relational Algebra

Pig Latin has only included a small set of carefully chosen primitives that can be easily parallelized.

slide-9
SLIDE 9

SPECIFYING INPUT DATA: LOAD

queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp);

 The input file is “query_log.txt”.  The input file should be converted into tuples by

using the custom myLoad deserializer.

 The loaded tuples have three fields named userId,

queryString, and timestamp.

9

Note that the LOAD command does not imply database-style loading into tables. It’s only logical.

slide-10
SLIDE 10

PER-TUPLE PROCESSING: FOREACH

Expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString);

 expandQuery is a User Defined Function.  Nesting can be eliminated by the use of the

FLATTEN keyword in the GENERATE clause.

 userId, FLETTEN(expandQuery(queryString));

10

slide-11
SLIDE 11

DISCARDING UNWANTED DATA: FILTER

real_queries = FILTER queries BY userId neq `bot'; real_queries = FILTER queries BY NOT isBot(userId);

 Again, isBot is a User Defined Function  Operations might be ==, eq, !=, neq, <, >, <=, >=  A comparison operation may utilize Boolean

  • perators (AND, OR, NOT) with several expressions

11

slide-12
SLIDE 12

GETTING RELATED DATA TOGETHER: COGROUP

grouped_data = COGROUP results BY queryString, revenue BY queryString;

 group together tuples from one or more data sets, that

are related in some way, so that they can subsequently be processed together.

 In general, the output of a COGROUP contains one

tuple for each group.

 The first field of the tuple (named group) is the group

  • identifier. Each of the next fields is a bag, one for each

input being cogrouped.

12

slide-13
SLIDE 13

MORE ABOUT COGROUP

13

COGROUP + FLATTEN = JOIN

slide-14
SLIDE 14

EXAMPLE: MAP-REDUCE IN PIG LATIN

map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0;

  • utput = FOREACH key_groups GENERATE reduce(*);

 A map function operates on one input tuple at a time,

and outputs a bag of key-value pairs.

 The reduce function operates on all values for a key

at a time to produce the final results.

14

slide-15
SLIDE 15

IMPLEMENTATION

 Building a logical plan:  Pig builds a logical plan for every bag that the user

defines.

 No processing is carried out when the logical plans are

  • constructed. Processing is triggered only when the user

invokes a STORE command on a bag.

 Compilation of the logical plan into a physical plan.

15

slide-16
SLIDE 16

MAP-REDUCE PLAN COMPILATION

 The map-reduce primitive essentially provides

the ability to do a large-scale group by, where the map tasks assign keys for grouping, and the reduce tasks process a group at a time.

 Converting each (CO)GROUP command in the

logical plan into a distinct map-reduce job with its own map and reduce functions.

16

slide-17
SLIDE 17

OTHER FEATURES

 Fully nested data model.  Extensive support for user-defined functions.  Manages plain input files without any schema

information.

 A novel debugging environment.

17

slide-18
SLIDE 18

DISCUSSION: PIG LATIN MEETS MAP-REDUCE

 Is it necessary to run Pig Latin on Map-Reduce

platform?

 Is Map-Reduce a perfect platform for Pig Latin?

Any drawbacks?

 Data must be materialized and replicated on the

distributed file system between successive map- reduce jobs.

 Not flexible enough.  Well, it does work fine. parallelism, load-

balancing, and fault-tolerance……

18

slide-19
SLIDE 19

DRYADLINQ

A SYSTEM FOR GENERAL-PURPOSE DISTRIBUTED DATA-PARALLEL COMPUTING

19

slide-20
SLIDE 20

DRYAD EXECUTION PLATFORM

 Job execution plan is a dataflow

graph.

 A Dryad application combines

computational “vertices” with communication “channels” to form a dataflow graph.

20

slide-21
SLIDE 21

MAP-REDUCE IN DRYADLINQ

21

slide-22
SLIDE 22

IMPLEMENTATION - OPTIMIZATIONS

 Static Optimizations

 Pipelining: Multiple operators may be executed in a single

process.

 Removing redundancy: DryadLINQ removes unnecessary

partitioning steps.

 Eager Aggregation: Aggregations are moved in front of

partitioning operators where possible.

 I/O reduction: Where possible, uses TCP-pipe and in-memory

FIFO channels instead of persisting temporary data to files.

 Dynamic Optimizations

 Dynamically sets the number of vertices in each stage at run

time based on the size of its input data.

 Dynamically mutate the execution graph as information from

the running job becomes available.

22

slide-23
SLIDE 23

MAP-REDUCE IN DRYADLINQ

23

Step (1) is static, (2) and (3) are dynamic based on the volume and location of the data in the inputs.

slide-24
SLIDE 24

1

Incremental Processing with Percolator

Long Kai and Andrew Harris

slide-25
SLIDE 25

2

We optimized the flow of processing... Now what? Make it update faster!

slide-26
SLIDE 26

3

Incremental Processing

  • Instead of processing the entire dataset,
  • nly process what needs to be updated
  • Requires random read/write access to

data

  • Suitable for data that is independent

(data pieces do not depend on other data pieces) or only marginally dependent

  • Reduces seeking time, processing
  • verhead, insertion/update costs
slide-27
SLIDE 27

4

Google Percolator

  • Introduced at OSDI ’10
  • Core tech behind Google Caffeine

search platform - driving app: Google’s indexer

  • Allows random access and incremental

updates to petabyte-scale data sets

  • Dramatically reduces cost of updates,

allowing for “fresher” search results

slide-28
SLIDE 28

5

Previous Google System

  • Same number of

documents (billions per day)

  • 100 MapReduces to

compile web index for these documents

  • Each document

spent 2-3 days being indexed

slide-29
SLIDE 29

6

How It Works

  • bserver

Bigtable Bigtable Tabletserver Tabletserver Chunkserver Chunkserver

database

App with App with Percolator Percolator Library Library

documents All communication handled via RPCs Single lines of code in observer Google indexing system uses ~10 observers

slide-30
SLIDE 30

7

Transactions

  • Observer-Bigtable communication is

handled as an ACID transaction

  • Observer nodes themselves handle

deadlock resolution

  • Simple lock cleanup synchronization
  • All writes are increasingly timestamped

via coordinated timestamp oracle

slide-31
SLIDE 31

8

Fault Tolerance

Result of dropping 33% of tablet servers in use

slide-32
SLIDE 32

9

Pushing Updates

  • Percolator clients open a write-only

connection with Bigtable

  • Obtain write lock for specific table

location

  • If locked, determine if lock is from a

previously failed transaction

  • Overhead:
slide-33
SLIDE 33

10

Notifying the Observers

  • Handled separately from writes (data

connections are unidirectional)

  • Otherwise similar to database triggers
  • Multiple Bigtable changes may produce
  • nly one notification
slide-34
SLIDE 34

11

Notifying the Observers

Bigtable Bigtable

  • bserved

column is changed one

  • r more times

NOTIFY NOTIFY Observer Observer

new update transaction

  • bserver

receives most recent column data

slide-35
SLIDE 35

12

Keeping Clean

Observer Observer

Key Value Notify

Search Search Thread Thread Search Search Thread Thread Search Search Thread Thread

Percolator workers spawn threads which search randomly, report changed cells to

  • bserver

(sequential search) (transactions)

slide-36
SLIDE 36

13

Benefits!

  • Closer to DBMS performance
  • “Only” 30x processing overhead

against comparison DBMS (TPC-E, a stock market trading backend)

  • Fresher data pushed for lower costs
  • 100x faster document movement
  • 1000x faster document processing
  • Data set is also 3x larger than

previous!

  • Fixes stragglers - everything updates
slide-37
SLIDE 37

14

Discussion

  • Transactions introduce read/write
  • verhead relative to Bigtable size -

when does scaling break down?

  • Not suitable for updating heavily

dependent or rapidly mutating data sets

  • how do you adapt for these?
  • In lightly dependent data sets, causally

linked children may report updates before their parents - implications?