Introduction to Data Management CSE 344 Section: Parallel Database - PowerPoint PPT Presentation

Introduction to Data Management CSE 344 Section: Parallel Database and Pig Latin CSE 344 - Fall 201 5 1

Announcements • HW8 is out . The last HW ！ ☺ – You'll have $100 credits for Amazon AWS . Enough for the homework. Make sure you terminate your clusters immeditely everytime when you do not need them. Amazon will charge your own credit cards when the $100 is used up – Start setting up account now(!). Takes time. • And follow instructions!! Usually the biggest problem. CSE 344 - Fall 201 5 2

Pig Latin Mini-Tutorial CSE 344 - Fall 201 5 3

Pig Latin Overview • Data model = loosely typed nested relations • Query model = a SQL-like, dataflow language • Execution model: – Option 1: run locally on your machine; e.g. to debug – Option 2: compile into parallel computing graphs, e.g. MapReduce jobs in Hadoop CSE 344 - Fall 2014 4

Example • Input: a table of urls: (url, category, pagerank) • Compute the average pagerank of all sufficiently high pageranks, for each category • Return the answers only for categories with sufficiently many such pages CSE 344 - Fall 2014 5

Page(url, category, pagerank) First in SQL … SELECT category, AVG(pagerank) FROM Page WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6 CSE 344 - Fall 2014 6

Page(url, category, pagerank) … then in Pig-Latin good_urls = FILTER urls BY pagerank > 0.2 groups = GROUP good_urls BY category big_groups = FILTER groups BY COUNT(good_urls) > 10 6 output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank) CSE 344 - Fall 2014 7

Types in Pig-Latin • Atomic: string or number, e.g. ‘Alice’ or 55 • Tuple : (‘Alice’, 55, ‘salesperson’) • Bag : {(‘Alice’, 55, ‘salesperson’), (‘Betty’,44, ‘manager’), … } • Maps : we will try not to use these 8 CSE 344 - Fall 2014

Types in Pig-Latin Tuple components can be referenced by number • $0, $1, $2, … Bags can be nested ! Non 1 st Normal Form • {(‘a’, {1,4,3}), (‘c’,{ }), (‘d’, {2,2,5,3,2})} CSE 344 - Fall 2014 9

[Olston’2008] 26

[Olston’2008] Loading data • Input data = FILES ! – Heard that before ? • The LOAD command parses an input file into a bag of records • Both parser (=“deserializer”) and output type are provided by user For HW8: simply use the code provided CSE 344 - Fall 2014 11

[Olston’2008] Loading data queries = LOAD ‘query_log.txt’ USING userLoadFcn( ) AS (userID, queryString, timeStamp) Pig provides a set of built-in load/store functions A = LOAD 'student' USING PigStorage('\t') AS (name: chararray, age:int, gpa: float); same as A = LOAD 'student' AS (name: chararray, age:int, gpa: float); CSE 344 - Fall 2014 1 2

[Olston’2008] Loading data • USING userfuction( ) -- is optional – Default deserializer expects tab-delimited file • AS type – is optional – Default is a record with unnamed fields; refer to them as $0, $1, … • The return value of LOAD is just a handle to a bag – The actual reading is done in pull mode, or parallelized CSE 344 - Fall 2014 13

[Olston’2008] FOREACH expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString) expandQuery( ) is a UDF that produces likely expansions Note: it returns a bag, hence expanded_queries is a nested bag CSE 344 - Fall 2014 14

[Olston’2008] FOREACH expanded_queries = FOREACH queries GENERATE userId, flatten(expandQuery(queryString)) Now we get a flat collection CSE 344 - Fall 2014 15

[Olston’2008] CSE 344 - Fall 2014 16

[Olston’2008] FLATTEN Note that it is NOT a normal function ! (that’s one thing I don’t like about Pig-latin) • A normal FLATTEN would do this: – FLATTEN({{2,3},{5},{},{4,5,6}}) = {2,3,5,4,5,6} – Its type is: {{T}} à {T} • The Pig Latin FLATTEN does this: – FLATTEN({4,5,6}) = 4, 5, 6 – What is its Type? {T} à T, T, T, … , T ????? CSE 344 - Fall 2014 17

[Olston’2008] FILTER Remove all queries from Web bots: real_queries = FILTER queries BY userId neq ‘bot’ Better: use a complex UDF to detect Web bots: real_queries = FILTER queries BY NOT isBot(userId) CSE 344 - Fall 2014 18

[Olston’2008] JOIN results: {(queryString, url, position)} revenue: {(queryString, adSlot, amount)} join_result = JOIN results BY queryString revenue BY queryString join_result : {(queryString, url, position, adSlot, amount)} CSE 344 - Fall 2014 19

[Olston’2008] CSE 344 - Fall 2014 20

[Olston’2008] GROUP BY revenue: {(queryString, adSlot, amount)} grouped_revenue = GROUP revenue BY queryString query_revenues = FOREACH grouped_revenue GENERATE queryString, SUM(revenue.amount) AS totalRevenue grouped_revenue: {(queryString, {(adSlot, amount)})} query_revenues: {(queryString, totalRevenue)} CSE 344 - Fall 2014 21

[Olston’2008] Co-Group results: {(queryString, url, position)} revenue: {(queryString, adSlot, amount)} grouped_data = COGROUP results BY queryString, revenue BY queryString; grouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})} What is the output type in general ? CSE 344 - Fall 2014 22

[Olston’2008] Co-Group Is this an inner join, or an outer join ? CSE 344 - Fall 2014 23

[Olston’2008] Co-Group grouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})} url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); distributeRevenue is a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue attributed to them. CSE 344 - Fall 2014 2 4

[Olston’2008] Co-Group v.s. Join grouped_data: {(queryString, results:{(url, position)}, revenue:{(adSlot, amount)})} grouped_data = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH grouped_data GENERATE FLATTEN(results), FLATTEN(revenue); Result is the same as JOIN CSE 344 - Fall 2014 25

[Olston’2008] Asking for Output: STORE STORE query_revenues INTO `theoutput' USING userStoreFcn(); Meaning: write query_revenues to the file ‘theoutput’ CSE 344 - Fall 2014 26

Introduction to Data Management CSE 344 Section: Parallel Database - PowerPoint PPT Presentation

Introduction to Data Management CSE 344 Section: Parallel Database and Pig Latin CSE 344 - Fall 201 5 1 Announcements HW8 is out . The last HW You'll have $100 credits for Amazon AWS . Enough for the homework. Make sure you

Introduction to Data Management CSE 344 Section 5: RC/RA TA: Siena Dumas Ang CSE 344 - Winter

Lecture 4 Additional Slides CSE 344, Winter 2014 Sudeepa Roy CSE 344 - Winter 2014 1 NOTE:

Introduction to Data Management CSE 344 Section 6: Relational Calculus and Some XML CSE 344 -

Je kyll Isla nd Authority Golf Ma ste r Pla n 1 Je kyll Isla nd F a c ts 2 Rounds by Compa

344 Organic Chemistry Laboratory Chemical and Laboratory Safety Chapter 1 in 344 lab manual

CSE 344 Introduc/on to Data Management Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun Homework 8

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 344 Introduc/on to Data Management Sec%on 4: Rela%onal Algebra Outline HW3 Check-in

CSE 344 Introduc/on to Data Management Sec/on 9: AWS,

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

Parallel DBs & MapReduce CSE 344 SECTION 10 Big Bi g Data The Three

TODAY TODAY Bell Tower Visitors 2014: 3,344 2013: 2,692 2012: 2,322 2015 as of Oct. 9 th :

Learjet 45XR N344LJ // SN 45-344 // 2007 Learjet 45 SN 45-344 Learjet 45 SN 45-344

November 14, 2011 F 202.344.8300 Via Electronic Mail Richard M. Thomas, Esquire Associate

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

The OSI Layers Surasak Sanguanpong nguan@ku.ac.th http://www.cpe.ku.ac.th/~nguan Last updated:

Link Layer 5.1 Introduction and 5.6 Hubs and switches services 5.7 PPP 5.2 Error

libVNF: building VNFs made easy Priyanka Naik, Akash Kanase, Trishal Patel, Mythili Vutukuru Dept.

The OSI Model and the TCP/IP Protocol Suite 1 Example 1 Assum Assume Mar aria ia an and Ann

THE USE OF ARTIFICIAL INTELLIGENCE (AI) TO SOLVE PUBLIC SECTOR PROBLEMS October 22, 2019

What <<b u s i n e s s>> can learn

A DoS-Resilient Information System for Dynamic Data Management by Baumgart, M. and Scheideler, C.

Introduction IMGD 2905 1 What is data analysis for game development? 2 1 3/12/2019 What is

Introduction to Data Management CSE 344 Section: Parallel Database - PowerPoint PPT Presentation

Introduction to Data Management CSE 344 Section: Parallel Database and Pig Latin CSE 344 - Fall 201 5 1 Announcements HW8 is out . The last HW You'll have $100 credits for Amazon AWS . Enough for the homework. Make sure you

Introduction to Data Management CSE 344 Section 5: RC/RA TA: Siena Dumas Ang CSE 344 - Winter

Lecture 4 Additional Slides CSE 344, Winter 2014 Sudeepa Roy CSE 344 - Winter 2014 1 NOTE:

Introduction to Data Management CSE 344 Section 6: Relational Calculus and Some XML CSE 344 -

Je kyll Isla nd Authority Golf Ma ste r Pla n 1 Je kyll Isla nd F a c ts 2 Rounds by Compa

344 Organic Chemistry Laboratory Chemical and Laboratory Safety Chapter 1 in 344 lab manual

CSE 344 Introduc/on to Data Management Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun Homework 8

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

CSE 344 Introduc/on to Data Management Sec%on 4: Rela%onal Algebra Outline HW3 Check-in

CSE 344 Introduc/on to Data Management Sec/on 9: AWS,

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

Parallel DBs &amp; MapReduce CSE 344 SECTION 10 Big Bi g Data The Three

TODAY TODAY Bell Tower Visitors 2014: 3,344 2013: 2,692 2012: 2,322 2015 as of Oct. 9 th :

Learjet 45XR N344LJ // SN 45-344 // 2007 Learjet 45 SN 45-344 Learjet 45 SN 45-344

November 14, 2011 F 202.344.8300 Via Electronic Mail Richard M. Thomas, Esquire Associate

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

The OSI Layers Surasak Sanguanpong nguan@ku.ac.th http://www.cpe.ku.ac.th/~nguan Last updated:

Link Layer 5.1 Introduction and 5.6 Hubs and switches services 5.7 PPP 5.2 Error

libVNF: building VNFs made easy Priyanka Naik, Akash Kanase, Trishal Patel, Mythili Vutukuru Dept.

The OSI Model and the TCP/IP Protocol Suite 1 Example 1 Assum Assume Mar aria ia an and Ann

THE USE OF ARTIFICIAL INTELLIGENCE (AI) TO SOLVE PUBLIC SECTOR PROBLEMS October 22, 2019

What &lt;&lt;b u s i n e s s&gt;&gt; can learn

A DoS-Resilient Information System for Dynamic Data Management by Baumgart, M. and Scheideler, C.

Introduction IMGD 2905 1 What is data analysis for game development? 2 1 3/12/2019 What is

Parallel DBs & MapReduce CSE 344 SECTION 10 Big Bi g Data The Three

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

What <<b u s i n e s s>> can learn