Topics in Data Science Cheng Ren, Lixing Lian Outline - PowerPoint PPT Presentation

MapReduce ¡Extension ¡ Topics ¡in ¡Data ¡Science Cheng ¡Ren, ¡Lixing ¡Lian ¡

Outline • Scien7fic ¡Workloads ¡ ¡-‑ ¡ Hadoop’s ¡Adolescence: ¡An ¡Analysis ¡of ¡Hadoop ¡Usage ¡in ¡Scien7fic ¡Workloads ¡ ¡ • Itera7ve ¡extension ¡ ¡-‑ ¡ Haloop: ¡efficient ¡itera7ve ¡data ¡processing ¡on ¡large ¡clusters ¡ ¡ • Adap7ve ¡Indexes ¡ ¡-‑ ¡ Only ¡Aggressive ¡Elephants ¡are ¡Fast ¡Elephants(HAIL)

Hadoop's ¡adolescence ¡ ¡ ¡ ¡An ¡analysis ¡ of ¡Hadoop ¡usage ¡ ¡in ¡scien7fic ¡ workload ¡

HaLoop: ¡Efficient ¡Itera7ve ¡Data ¡ Processing ¡On ¡Large ¡Clusters Yingyi ¡Bu, ¡Bill ¡Howe, ¡Magda ¡Balazinska, ¡Michael ¡D. ¡Ernst

Outline • Mo7va7on ¡ • Examples ¡that ¡cannot ¡be ¡executed ¡perfectly ¡ • Architecture ¡ • Caching ¡ideas ¡ ¡

Mo7va7on • MapReduce ¡can’t ¡express ¡recursion/itera7on ¡ • Lots ¡of ¡interes7ng ¡programs ¡need ¡loops ¡ ¡ -‑ ¡graph ¡algorithms ¡ ¡ -‑ ¡clustering ¡ ¡ -‑ ¡machine ¡learning ¡ ¡ -‑ ¡recursive ¡queries ¡(CTEs, ¡datalog, ¡WITH ¡clause) ¡ • Dominant ¡solu7on: ¡Use ¡a ¡driver ¡program ¡outside ¡of ¡ MapReduce ¡ • Hypothesis: ¡making ¡MapReduce ¡loop-‑aware ¡affords ¡ op7miza7on ¡ -‑ ¡lays ¡a ¡founda7on ¡for ¡scalable ¡implementa7ons ¡of ¡ recursive ¡languages

Example ¡1: ¡PageRank

PageRank ¡Implementa7on ¡on ¡MapReduce

¡ ¡ ¡ ¡What’s ¡the ¡problem? L ¡and ¡Count ¡are ¡loop ¡invariants, ¡but ¡ 1. ¡They ¡are ¡loaded ¡on ¡each ¡itera7on ¡ 2. ¡They ¡are ¡shuffled ¡on ¡each ¡itera7on ¡ 3. ¡Also, ¡fixpoint ¡evaluated ¡as ¡a ¡separate ¡MapReduce ¡job ¡per ¡itera7on

¡ ¡ ¡ ¡Example ¡2: ¡Transi7ve ¡Closure

Transi7ve ¡Closure ¡on ¡MapReduce

¡ ¡ ¡ ¡What’s ¡the ¡problem? Friend ¡is ¡loop ¡invariant, ¡but ¡ 1. Friend ¡is ¡loaded ¡on ¡each ¡itera7on ¡ 2. Friend ¡is ¡shuffled ¡on ¡each ¡itera7on

Push ¡loops ¡into ¡MapReduce! • Architecture ¡ • Cache ¡loop-‑invariant ¡data ¡ ¡ • Programming ¡Model

HaLoop ¡Architecture

Inter-‑itera7on ¡caching

RI: ¡Reducer ¡Input ¡Cache • Provides: ¡ ¡ ¡ ¡ ¡ ¡-‑ ¡Access ¡to ¡loop ¡invariant ¡data ¡without ¡map/shuffle ¡ • Data: ¡ ¡ -‑ ¡Reducer ¡func7on ¡ • Assumes: ¡ ¡ 1. ¡Sta7c ¡par77oning ¡(implies: ¡no ¡new ¡nodes) ¡ ¡ 2. ¡Determinis7c ¡mapper ¡implementa7on ¡ ¡ • PageRank ¡ ¡ -‑ ¡Avoid ¡loading ¡and ¡shuffling ¡the ¡web ¡graph ¡at ¡every ¡itera7on ¡ • Transi7ve ¡Closure ¡ ¡ -‑ ¡Avoid ¡loading ¡and ¡shuffling ¡the ¡friends ¡graph ¡at ¡every ¡itera7on ¡

RO: ¡Reducer ¡Output ¡Cache • Provides: ¡ ¡ ¡ ¡ ¡ ¡-‑ ¡Distributed ¡access ¡to ¡output ¡of ¡previous ¡itera7ons ¡ • Used ¡by: ¡ ¡ -‑ ¡Fixpoint ¡evalua7on ¡ • Assumes: ¡ ¡ 1. ¡Par77oning ¡constant ¡across ¡itera7ons ¡ ¡ 2. ¡Reducer ¡output ¡key ¡func7onally ¡determines ¡ ¡ ¡ Reducer ¡input ¡key ¡ ¡ • PageRank ¡ ¡ -‑ ¡Allows ¡distributed ¡fixpoint ¡evalua7on ¡ ¡ -‑ ¡Obviates ¡extra ¡MapReduce ¡job ¡ • Transi7ve ¡Closure ¡ ¡ -‑ ¡No ¡help ¡

MI: ¡Mapper ¡Input ¡Cache • Provides: ¡ ¡ ¡ ¡ ¡ ¡-‑ ¡Access ¡to ¡non-‑local ¡mapper ¡input ¡on ¡later ¡itera7ons ¡ • Data ¡for: ¡ ¡ ¡ ¡ -‑ ¡Map ¡func7on ¡ • Assumes: ¡ ¡ Mapper ¡input ¡does ¡not ¡change ¡ ¡ -‑ ¡Avoids ¡non-‑local ¡data ¡reads ¡on ¡itera7ons ¡> ¡0

Programming ¡Model • Mapper/reducer ¡stay ¡the ¡same! ¡ • Touch ¡points ¡ ¡ ¡ ¡– ¡Input/Output: ¡for ¡each ¡<itera7on, ¡step> ¡ ¡ ¡ ¡– ¡Cache ¡filter: ¡which ¡tuple ¡to ¡cache? ¡ ¡ ¡ ¡– ¡Distance ¡func7on: ¡op7onal ¡ • Nested ¡job ¡containing ¡child ¡jobs ¡as ¡loop ¡body ¡ • Minimize ¡extra ¡programming ¡efforts

Conclusions Rela7vely ¡simple ¡changes ¡to ¡MapReduce/Hadoop ¡can ¡ • ¡-‑ ¡support ¡itera7ve/recursive ¡programs ¡ ¡-‑ ¡TaskTracker ¡(Cache ¡management) ¡ ¡-‑ ¡Scheduler ¡(Cache ¡awareness) ¡ ¡-‑ ¡Programming ¡model ¡(mul7-‑step ¡loop ¡bodies, ¡cache ¡control) ¡ ¡ Op7miza7ons ¡ • ¡-‑ ¡Caching ¡reducer ¡input ¡realizes ¡the ¡largest ¡gain ¡ ¡-‑ ¡Good ¡to ¡eliminate ¡extra ¡MapReduce ¡step ¡for ¡termina7on ¡checks ¡ ¡-‑ ¡Mapper ¡input ¡cache ¡benefit ¡inconclusive; ¡need ¡a ¡busier ¡cluster ¡

Only ¡Aggressive ¡Elephants ¡ are ¡fast ¡Elephants Jens ¡Diirich, ¡Jorge-‑Arnulfo ¡Quiané-‑Ruiz, ¡Stefan ¡Richter, ¡ Stefan ¡Schuh, ¡Alekh ¡Jindal, ¡Jörg ¡Schad

Outline • Mo7va7on ¡ • Comparison ¡between ¡Hadoop ¡and ¡HAIL ¡ ¡ • Upload ¡pipeline ¡ • Query ¡pipeline ¡ ¡

Bob • Analyze ¡a ¡large ¡web ¡log ¡by ¡filtering ¡condi7ons. ¡(source ¡ IP, ¡web ¡address) ¡ ¡ • He ¡uses ¡a ¡sequence ¡of ¡different ¡filter ¡condi7ons, ¡each ¡ one ¡triggering ¡a ¡new ¡MapReduce ¡job. ¡ ¡ • He ¡is ¡not ¡exactly ¡sure ¡what ¡he ¡is ¡looking ¡for. ¡ ¡ • “Let’s ¡see ¡what ¡I ¡am ¡going ¡to ¡encounter ¡on ¡the ¡way . ”

Bob • This ¡kind ¡of ¡use-‑case ¡illustrates ¡an ¡exploratory ¡usage ¡of ¡ Hadoop ¡MapReduce. ¡ ¡ • It ¡is ¡a ¡major ¡use-‑case ¡of ¡Hadoop ¡MapReduce. ¡ ¡ -‑ ¡One ¡major ¡problem: ¡slow ¡query ¡run7mes. ¡ ¡ -‑ ¡Time ¡dominated ¡by ¡the ¡I/O ¡for ¡reading ¡all ¡input ¡data.

H adoop ¡ A ggressive ¡ I ndexing ¡ L ibrary

VS. HDFS ¡+ ¡ HAIL ¡+ ¡ MapReduce MapReduce

HDFS ¡+ ¡MapReduce

HDFS HDFS ¡blocks ¡ 64MB ¡(default) horizontal ¡par77ons Datanodes

HDFS Allows ¡two ¡Failovers

MapReduce map(row) ¡-‑> ¡set ¡of ¡(ikey, ¡value)

MapReduce map(docID, ¡document) ¡-‑> ¡set ¡of ¡(term, ¡docID)

HAIL ¡+ ¡MapReduce

HAIL HDFS ¡blocks ¡ 64MB ¡(default) horizontal ¡par77ons

HAIL 1. Convert ¡the ¡input ¡file ¡into ¡binary ¡PAX ¡ ¡ 2. Create ¡a ¡series ¡of ¡different ¡sort ¡orders ¡ ¡ 3. Create ¡mul7ple ¡clustered ¡indexes. -‑ ¡If ¡indexes ¡cannot ¡help, ¡fall ¡back ¡to ¡standard ¡Hadoop ¡scanning.

HAIL ¡changes ¡the ¡upload ¡pipeline ¡ of ¡HDFS ¡in ¡order ¡to ¡create ¡different ¡ clustered ¡indexes ¡on ¡each ¡data ¡ block ¡replica.

HAIL ¡Upload ¡Pipeline

HAIL ¡Upload ¡Pipeline Why ¡Clustered ¡Indexes? ¡ -‑ Unclustered ¡indexes ¡are ¡only ¡compe77ve ¡for ¡very ¡selec7ve ¡ queries ¡as ¡they ¡may ¡trigger ¡considerable ¡random ¡I/O ¡for ¡ non-‑selec7ve ¡index ¡traversals. ¡ ¡ -‑ Clustered ¡index ¡do ¡not ¡have ¡that ¡problem. ¡Whatever ¡the ¡ selec7vity, ¡we ¡will ¡read ¡the ¡clustered ¡index ¡and ¡scan ¡the ¡ qualifying ¡blocks.

Topics in Data Science Cheng Ren, Lixing Lian Outline - PowerPoint PPT Presentation

MapReduce Extension Topics in Data Science Cheng Ren, Lixing Lian Outline Scien7fic Workloads - Hadoops Adolescence: An Analysis of Hadoop Usage in

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Data Preprocessing Week 2 Topics Topics Data Types Data Repositories Data

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Dealing With Missing Data Possible Future Topics Novice user topics: Advanced topics:

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

Topics Redux Michael R. Gunson February 23, 2001 1 AIRS Topics Status mrg Topics From Last

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Advanced MySQL topics Presented by : John A Mahady AndrewInfoServices.com Topics Topics

6/30/20 SIO15-SS1 2020 Topics 01/02: Nat. Disasters/Forces and Energy SIO15-SS1 2020 Topics

EFFICACY TOPICS EFFICACY TOPICS Public ICH meeting - Brussels 14 th November 2008 International

Discrete Topics in Data Mining Dr. Pauli Miettinen Discrete Topics in Data Mining Universitt

WITH C++ Prof. Amr Goneid AUC Introduction to Stacks & Queues Prof. amr Goneid, AUC 1

Distributed Systems read/write [disconnect] BUT it forces read/write mechanism Remote

Introduction to Lock-Free Programming Olivier Goffart 2014 About Me QStyleSheetStyle Itemviews

UMBC A B M A L T F O U M B C I M Y O R T 1 (Feb. 21, 2002) I E S R C E O

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Compiling and Linking C code Assembly C Source C Source C Source Source .c Code Code Code

Alternating-time temporal logic Mehdi Dastani BBL-521 M.M.Dastani@uu.nl ATL: Alternating-time

CS 225 Data Structures Se Sept. 20 20 Ar Array Li Lists - St Stac acks and and Que

Topics in Data Science Cheng Ren, Lixing Lian Outline - PowerPoint PPT Presentation

MapReduce Extension Topics in Data Science Cheng Ren, Lixing Lian Outline Scien7fic Workloads - Hadoops Adolescence: An Analysis of Hadoop Usage in

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Data Preprocessing Week 2 Topics Topics Data Types Data Repositories Data

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Dealing With Missing Data Possible Future Topics Novice user topics: Advanced topics:

Topics The Scientific Data Deluge Data-Intensive Scientific Discovery NSF OCI Data/Viz Task

Topics Redux Michael R. Gunson February 23, 2001 1 AIRS Topics Status mrg Topics From Last

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Advanced MySQL topics Presented by : John A Mahady AndrewInfoServices.com Topics Topics

6/30/20 SIO15-SS1 2020 Topics 01/02: Nat. Disasters/Forces and Energy SIO15-SS1 2020 Topics

EFFICACY TOPICS EFFICACY TOPICS Public ICH meeting - Brussels 14 th November 2008 International

Discrete Topics in Data Mining Dr. Pauli Miettinen Discrete Topics in Data Mining Universitt

WITH C++ Prof. Amr Goneid AUC Introduction to Stacks &amp; Queues Prof. amr Goneid, AUC 1

Distributed Systems read/write [disconnect] BUT it forces read/write mechanism Remote

Introduction to Lock-Free Programming Olivier Goffart 2014 About Me QStyleSheetStyle Itemviews

UMBC A B M A L T F O U M B C I M Y O R T 1 (Feb. 21, 2002) I E S R C E O

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Compiling and Linking C code Assembly C Source C Source C Source Source .c Code Code Code

Alternating-time temporal logic Mehdi Dastani BBL-521 M.M.Dastani@uu.nl ATL: Alternating-time

CS 225 Data Structures Se Sept. 20 20 Ar Array Li Lists - St Stac acks and and Que

WITH C++ Prof. Amr Goneid AUC Introduction to Stacks & Queues Prof. amr Goneid, AUC 1