BOOM Analytics: Exploring Data-Centric, Declarative Programming for - PowerPoint PPT Presentation

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud Stefan Istrate ¸ University of Cambridge February 10, 2011 Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 1 / 17

Outline Introduction 1 The problem The solution Background 2 Overlog BOOM Analytics 3 HDFS Rewrite (BOOM-FS) The Availability Rev The Scalability Rev The Monitoring Rev MapReduce Port (BOOM-MR) Performance 4 Conclusions 5 Questions / Comments 6 Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 2 / 17

Introduction The problem The problem Building and debugging distributed software is extremely difficult. The developer spends time on: orchestrating concurrent computation and communication across machines minimize the delays handle failures instead of being creative Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 3 / 17

Introduction The solution The solution A broad range of distributed software can be recast in a data-parallel programming model. Solution: adopt a data-centric approach to system design switch to declarative programming languages Advantages: raised level of abstraction for programmers improved code simplicity better speed of development ease of software evolution program correctness Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 4 / 17

Introduction The solution BOOM Analytics BOOM = Berkeley Orders Of Magnitude BOOM Analytics = reimplementation of HDFS and Hadoop MapReduce in Overlog Why Hadoop? It shows the distributed power of a cluster. 1 Significant distributed features are missing => It can be extended. 2 Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 5 / 17

Background Overlog Overlog declarative language (logic of computation, not the control flow) based on Datalog defined over relational tables query language that makes no changes to the stored tables rules: r head ( � col − list � ) ⊢ r 1 ( � col − list � ) ,..., r n ( � col − list � ) extends Datalog can specify location of data primary keys and aggregation defines a model for processing and generating changes to tables relational tables may be partitioned across a set of machines implementations: P2, JOL (Java-based Overlog) Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 6 / 17

BOOM Analytics HDFS Rewrite (BOOM-FS) HDFS files system metadata stored at a centralized NameNode file data distributed across DataNodes by default, data chunks of 64MB replicated three times DataNodes send heartbeat messages to the NameNode clients only contact the NameNode Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 7 / 17

BOOM Analytics HDFS Rewrite (BOOM-FS) BOOM-FS represent file system metadata as a collection of relations Name Description Relevant attributes file Files fileid, parentfileid, name, isDir fqpath Fully-qualified pathnames path, fileid fchunk Chunks per file chunkid, fileid datanode DataNode heartbeats nodeAddr, lastHeartbeatTime hb_chunk Chunk heartbeats nodeAddr, chunkid, length metadata and heartbeat protocols implemented with Overlog rules data protocol implemented in Java 4 person-months of work System Lines of Java Lines of Overlog HDFS 21,700 0 BOOM-FS 1,431 469 Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 8 / 17

BOOM Analytics The Availability Rev The Availability Rev Goal: hot standby replication for NameNodes Solution: Paxos algorithm solves consensus in the network is a collection of logical invariants messages and disk writes → insertions into tables invariants → rules Results: 400 lines of code 6 person-weeks of development time Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 9 / 17

BOOM Analytics The Scalability Rev The Scalability Rev Goal: scale out the NameNode across multiple partitions Solution: add a ’partition’ column to tables to split them across nodes Results: 8 hours of development time Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 10 / 17

BOOM Analytics The Monitoring Rev The Monitoring Rev Goal: develop performance monitoring and debugging tools Solution: replicate the body of each rule and send it to a log table add a relation called “die” to JOL when “die” is added throw a Java exception Results: performance monitoring: 64 lines of code, less than 1 day debugging: 60 lines of code, 8 person-hours Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 11 / 17

BOOM Analytics MapReduce Port (BOOM-MR) Hadoop MapReduce single master node (JobTracker) many worker nodes (TaskTrackers) job is divided in maps and reduces map : reads an input chunk, runs a function, partition the output into buckets reduce : fetch hash buckets, sort by key, runs a function, writes to distributed file system fixed number of slots for every TaskTracker heartbeat protocol between each TaskTracker and JobTracker Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 12 / 17

BOOM Analytics MapReduce Port (BOOM-MR) BOOM-MR Name Description Revelant attributes job Job definitions jobid, priority, submit_time, status, jobConf task Task definitions jobid, taskid, type, partition, status taskAttempt Task attempts jobid, taskid, attemptid, progress, state, phase, tracker, input_loc, start, finish taskTracker TaskTracker definitions name, hostname, state, map_count, re- duce_count, max_map, max_reduce evaluation on Hadoop’s default First-Come-First-Serve (FCFS) policy and the LATE (Longest Approximation Time to End) policy better results for LATE Results: initial version: one person-month debugging and tuning: two person-months 55 Overlog rules 6573 lines removed from Hadoop Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 13 / 17

Performance Performance Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 14 / 17

Performance Performance (cont.) Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 15 / 17

Conclusions Conclusions Good things: focus on what, not on how simplified code faster development program correctness Bad things: system load averages higher with BOOM Analytics Overlog needs some other features difficult and time-consuming to read the code hard for programmers to switch to declarative programming Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 16 / 17

Questions / Comments Questions / Comments? Stefan Istrate (University of Cambridge) ¸ BOOM Analytics February 10, 2011 17 / 17

BOOM Analytics: Exploring Data-Centric, Declarative Programming for - PowerPoint PPT Presentation

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud Stefan Istrate University of Cambridge February 10, 2011 Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 1 / 17 Outline

BOOM Analytics: Exploring Data-Centric,Declarative Programming for the Cloud Jadwiga Kaska 21

BOOM! BOOM! BOOM! BOOM! Linking Technology to RTI & PBS PBS RTI Connection 3-

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

DANCES: Schottische Boom Boom Music: Boom Boom Pow by Black Eyed Peas Schottische up and

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

BOOM Analycs: Exploring Data-Centric, Declarave Programming for

Paying new hires fairly Ben Teusch HR Analytics Consultant DataCamp Human Resources Analytics

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Real Time Early Warning Indicators Alessi Detken for Costly Asset Price Boom/Bust Cycles:

Welcome everyone to SPECIAL EVENTS: MAKING YOUR EVENT GO FROM HUM-DRUM TO BOOM-BOOM-POW 1

Boom, Bust, KABOOM? Prospects for Physician Resources in Canada Town Hall Meeting Faculty of

East Hamilton City ARC Hamilton Population 2036 2011 Baby Boom *** Baby Boom Echo Echo

MOU- USCG and DBRC Allows DBRC to use USCG owned Boom Vane Remains a USCG Property Item

The Political Economy of Chinas Housing Boom Xu Lu & Adam (Jiwei) Zhang Stanford May 27,

Sophus3 I Paul Rutishauser Paul Rutishauser Editor, Auto Market Intelligence

Distributed Systems in Practice, in Theory Aysylu Greenberg June 14, 2016 How I got into

Biostatistical Challenges in R&D Conflicting regulators, upbeat developers and big data: How

Matter (PM2.5) Planning Presentation to: Mat-Su Borough Assembly Meeting COMMISSIONER HARTIG

MOOD ROUTES: EMBODIED PROGRAM DEVELOPMENT HELLO! John Hannah Director, Special Projects

ARM A commodity risk management system. 1. . ARM: : A commodity ri risk management system.

Tax Law Confidential: Attorney-Client Privilege and the Work Product Doctrine in Tax Practice Adam

Red Hat Satellite 6 Josh Swanson IT Infrastructure Analyst Large Manufacturing Company in the

BOOM Analytics: Exploring Data-Centric, Declarative Programming for - PowerPoint PPT Presentation

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud Stefan Istrate University of Cambridge February 10, 2011 Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 1 / 17 Outline

BOOM Analytics: Exploring Data-Centric,Declarative Programming for the Cloud Jadwiga Kaska 21

BOOM! BOOM! BOOM! BOOM! Linking Technology to RTI &amp; PBS PBS RTI Connection 3-

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

DANCES: Schottische Boom Boom Music: Boom Boom Pow by Black Eyed Peas Schottische up and

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

BOOM Analy*cs: Exploring Data-Centric, Declara*ve Programming for

Paying new hires fairly Ben Teusch HR Analytics Consultant DataCamp Human Resources Analytics

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Real Time Early Warning Indicators Alessi Detken for Costly Asset Price Boom/Bust Cycles:

Welcome everyone to SPECIAL EVENTS: MAKING YOUR EVENT GO FROM HUM-DRUM TO BOOM-BOOM-POW 1

Boom, Bust, KABOOM? Prospects for Physician Resources in Canada Town Hall Meeting Faculty of

East Hamilton City ARC Hamilton Population 2036 2011 Baby Boom *** Baby Boom Echo Echo

MOU- USCG and DBRC Allows DBRC to use USCG owned Boom Vane Remains a USCG Property Item

The Political Economy of Chinas Housing Boom Xu Lu &amp; Adam (Jiwei) Zhang Stanford May 27,

Sophus3 I Paul Rutishauser Paul Rutishauser Editor, Auto Market Intelligence

Distributed Systems in Practice, in Theory Aysylu Greenberg June 14, 2016 How I got into

Biostatistical Challenges in R&amp;D Conflicting regulators, upbeat developers and big data: How

Matter (PM2.5) Planning Presentation to: Mat-Su Borough Assembly Meeting COMMISSIONER HARTIG

MOOD ROUTES: EMBODIED PROGRAM DEVELOPMENT HELLO! John Hannah Director, Special Projects

ARM A commodity risk management system. 1. . ARM: : A commodity ri risk management system.

Tax Law Confidential: Attorney-Client Privilege and the Work Product Doctrine in Tax Practice Adam

Red Hat Satellite 6 Josh Swanson IT Infrastructure Analyst Large Manufacturing Company in the

BOOM! BOOM! BOOM! BOOM! Linking Technology to RTI & PBS PBS RTI Connection 3-

BOOM Analycs: Exploring Data-Centric, Declarave Programming for

The Political Economy of Chinas Housing Boom Xu Lu & Adam (Jiwei) Zhang Stanford May 27,

Biostatistical Challenges in R&D Conflicting regulators, upbeat developers and big data: How