Machine Image Graph Data Learning Processing Analysis Mining - PowerPoint PPT Presentation

C LOUD PROGRAMMING Andrew Harris & Long Kai 1

M OTIVATION  Research problem : How to write distributed data-parallel programs for a compute cluster?  Drawback of Parallel Databases (SQL) : Too limited for many applications.  Very restrictive type system  The declarative query is unnatural.  Drawback of Map Reduce: Too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. 2

L AYERS … Machine Image Graph Data Learning Processing Analysis Mining Other Applications Applications Pig Latin / DryadLINQ Other Languages Hadoop Map-Reduce / Dryad Cluster Services Server Server Server Server 3

P IG L ATIN : A Not-So-Foreign Language for Data Processing 4

D ATAFLOW LANGUAGE  User specifies a sequence of steps where each step specifies only a single, high level data transformation. Similar to relational algebra and procedural – desirable for programmers.  With SQL, the user specifies a set of declarative constraints. Non-procedural and desirable for non-programmers. 5

A N SAMPLE CODE OF PIG LATIN SQL Pig Latin SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 good_urls = FILTER urls BY pagerank > 0.2; Pig Latin program is a sequence of steps, groups = GROUP good_urls BY category; each of which carries out a big_groups = FILTER groups BY single data transformation. COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); 6

D ATA M ODEL  Atom : Contains a simple atomic value such as a string or a number, e.g., ‘Joe’.  Tuple : Sequence of fields, each of which might be any data type, e.g., (‘Joe’, ‘lakers’)  Bag : A collection of tuples with possible duplicates. Schema of a bag is flexible.  Map : A collection of data items, where each item has an associated key through which it can be looked up. Keys must be data atoms. 7

A C OMPARISON WITH R ELATIONAL A LGEBRA Pig Latin Relational Algebra  Everything is a bag.  Everything is a table.  Dataflow language.  Dataflow language.  FILTER is same as  Select operator is same the Select operator. as the FILTER cmd. Pig Latin has only included a small set of carefully chosen primitives that can be easily parallelized . 8

S PECIFYING I NPUT D ATA : LOAD queries = LOAD `query_log.txt' USING myLoad() AS (userId, queryString, timestamp);  The input file is “query_log.txt”.  The input file should be converted into tuples by using the custom myLoad deserializer.  The loaded tuples have three fields named userId, queryString, and timestamp. Note that the LOAD command does not imply 9 database-style loading into tables. It’s only logical.

P ER - TUPLE P ROCESSING : FOREACH Expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString);  expandQuery is a User Defined Function.  Nesting can be eliminated by the use of the FLATTEN keyword in the GENERATE clause.  userId, FLETTEN(expandQuery(queryString)); 10

D ISCARDING U NWANTED D ATA : FILTER real_queries = FILTER queries BY userId neq `bot'; real_queries = FILTER queries BY NOT isBot(userId);  Again, isBot is a User Defined Function  Operations might be ==, eq, !=, neq, <, >, <=, >=  A comparison operation may utilize Boolean operators (AND, OR, NOT) with several expressions 11

G ETTING R ELATED D ATA T OGETHER : COGROUP grouped_data = COGROUP results BY queryString, revenue BY queryString;  group together tuples from one or more data sets, that are related in some way, so that they can subsequently be processed together.  In general, the output of a COGROUP contains one tuple for each group.  The first field of the tuple (named group) is the group identifier. Each of the next fields is a bag, one for each input being cogrouped. 12

M ORE ABOUT COGROUP COGROUP + FLATTEN = JOIN 13

E XAMPLE : M AP -R EDUCE IN P IG L ATIN map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0; output = FOREACH key_groups GENERATE reduce(*);  A map function operates on one input tuple at a time, and outputs a bag of key-value pairs.  The reduce function operates on all values for a key at a time to produce the final results. 14

I MPLEMENTATION  Building a logical plan :  Pig builds a logical plan for every bag that the user defines.  No processing is carried out when the logical plans are constructed. Processing is triggered only when the user invokes a STORE command on a bag.  Compilation of the logical plan into a physical plan . 15

M AP -R EDUCE P LAN C OMPILATION  The map-reduce primitive essentially provides the ability to do a large-scale group by, where the map tasks assign keys for grouping, and the reduce tasks process a group at a time.  Converting each (CO)GROUP command in the logical plan into a distinct map-reduce job with its own map and reduce functions. 16

O THER FEATURES  Fully nested data model.  Extensive support for user-defined functions.  Manages plain input files without any schema information.  A novel debugging environment. 17

D ISCUSSION : P IG L ATIN MEETS M AP -R EDUCE  Is it necessary to run Pig Latin on Map-Reduce platform?  Is Map-Reduce a perfect platform for Pig Latin? Any drawbacks?  Data must be materialized and replicated on the distributed file system between successive map- reduce jobs.  Not flexible enough.  Well, it does work fine. parallelism, load- balancing, and fault-tolerance…… 18

D RYAD LINQ A S YSTEM FOR G ENERAL -P URPOSE D ISTRIBUTED D ATA -P ARALLEL C OMPUTING 19

D RYAD E XECUTION P LATFORM  Job execution plan is a dataflow graph.  A Dryad application combines computational “vertices” with communication “channels” to form a dataflow graph. 20

M AP -R EDUCE IN D RYAD LINQ 21

I MPLEMENTATION - O PTIMIZATIONS  Static Optimizations  Pipelining : Multiple operators may be executed in a single process.  Removing redundancy : DryadLINQ removes unnecessary partitioning steps.  Eager Aggregation : Aggregations are moved in front of partitioning operators where possible.  I/O reduction : Where possible, uses TCP-pipe and in-memory FIFO channels instead of persisting temporary data to files.  Dynamic Optimizations  Dynamically sets the number of vertices in each stage at run time based on the size of its input data.  Dynamically mutate the execution graph as information from 22 the running job becomes available.

M AP -R EDUCE IN D RYAD LINQ Step (1) is static, (2) and (3) are dynamic based on the volume and 23 location of the data in the inputs.

Incremental Processing with Percolator Long Kai and Andrew Harris 1

We optimized the flow of processing... Now what? Make it update faster! 2

Incremental Processing • Instead of processing the entire dataset, only process what needs to be updated • Requires random read/write access to data • Suitable for data that is independent (data pieces do not depend on other data pieces) or only marginally dependent • Reduces seeking time, processing overhead, insertion/update costs 3

Google Percolator • Introduced at OSDI ’10 • Core tech behind Google Caffeine search platform - driving app: Google’s indexer • Allows random access and incremental updates to petabyte-scale data sets • Dramatically reduces cost of updates, allowing for “fresher” search results 4

Previous Google System • Same number of documents (billions per day) • 100 MapReduces to compile web index for these documents • Each document spent 2-3 days being indexed 5

How It Works App with App with Bigtable Bigtable Percolator Percolator Chunkserver Chunkserver Tabletserver Tabletserver Library Library observer database documents All communication handled via RPCs Single lines of code in observer Google indexing system uses ~10 observers 6

Transactions • Observer-Bigtable communication is handled as an ACID transaction • Observer nodes themselves handle deadlock resolution • Simple lock cleanup synchronization • All writes are increasingly timestamped via coordinated timestamp oracle 7

Fault Tolerance Result of dropping 33% of tablet servers in use 8

Pushing Updates • Percolator clients open a write-only connection with Bigtable • Obtain write lock for specific table location • If locked, determine if lock is from a previously failed transaction • Overhead: 9

Notifying the Observers • Handled separately from writes (data connections are unidirectional) • Otherwise similar to database triggers • Multiple Bigtable changes may produce only one notification 10

Notifying the Observers Bigtable Bigtable NOTIFY NOTIFY Observer Observer observer new update observed receives most transaction column is recent column changed one data or more times 11

Keeping Clean (sequential search) Key Value Notify Observer Observer Search Search Thread Thread (transactions) Search Search Thread Thread Percolator workers spawn threads which Search Search search Thread Thread randomly, report changed cells to observer 12

Machine Image Graph Data Learning Processing Analysis Mining - PowerPoint PPT Presentation

C LOUD PROGRAMMING Andrew Harris & Long Kai 1 M OTIVATION Research problem : How to write distributed data-parallel programs for a compute cluster? Drawback of Parallel Databases (SQL) : Too limited for many applications. Very

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

Image Processing CS 110 Why Image Processing? Medical Images

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Color image processing The use of color in image processing is primarily motivated by two Image

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Image restoration IMAGE P ROCES S IN G IN P YTH ON Rebeca Gonzalez Data Engineer Restore an

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

SET50 Index Futures Contract Specifications www.tfex.co.th at 21 Mar 13 1 Single Stock

GCF 25th Meeting of the Board (B.25) Ben Bartle Senior consultant B.25: Presentation overview

Microfoundation of Inflation Persistence of a New Keynesian Phillips Curve Marcelle Chauvet and

I N T E R P E R S O N A L R E L AT I O N S H I P S W h a t i t m e a n s t o l o v e o n e

(a) The man in the corner taught his dachshund to play golf EOS gorithm for dep

How much has wealth concentration grown in the United States? A re-examination of data from

Welcome to CS 100! CS 100: Introduction to the Profession Matthew Bauer & Michael Lee Agenda

Network (RAN) The mission of the Spartan Research Administrators Network (RAN) is to provide the

Machine Image Graph Data Learning Processing Analysis Mining - PowerPoint PPT Presentation

C LOUD PROGRAMMING Andrew Harris & Long Kai 1 M OTIVATION Research problem : How to write distributed data-parallel programs for a compute cluster? Drawback of Parallel Databases (SQL) : Too limited for many applications. Very

Introduction: What is Image Processing? CS 4640: Image Processing Basics January 10, 2012 What

Image Restoration Image Enhancement and Image Restoration both deal with improving images. Image

Image Processing Todays Class Image Representations: Matrices Image Representations: RGB,

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

Image Processing CS 110 Why Image Processing? Medical Images

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Color image processing The use of color in image processing is primarily motivated by two Image

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Image restoration IMAGE P ROCES S IN G IN P YTH ON Rebeca Gonzalez Data Engineer Restore an

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

SET50 Index Futures Contract Specifications www.tfex.co.th at 21 Mar 13 1 Single Stock

GCF 25th Meeting of the Board (B.25) Ben Bartle Senior consultant B.25: Presentation overview

Microfoundation of Inflation Persistence of a New Keynesian Phillips Curve Marcelle Chauvet and

I N T E R P E R S O N A L R E L AT I O N S H I P S W h a t i t m e a n s t o l o v e o n e

(a) The man in the corner taught his dachshund to play golf EOS gorithm for dep

How much has wealth concentration grown in the United States? A re-examination of data from

Welcome to CS 100! CS 100: Introduction to the Profession Matthew Bauer &amp; Michael Lee Agenda

Network (RAN) The mission of the Spartan Research Administrators Network (RAN) is to provide the

Welcome to CS 100! CS 100: Introduction to the Profession Matthew Bauer & Michael Lee Agenda