BOOM Analytics: Exploring Data-Centric, Declarative Programming for - - PowerPoint PPT Presentation

boom analytics exploring data centric declarative
SMART_READER_LITE
LIVE PREVIEW

BOOM Analytics: Exploring Data-Centric, Declarative Programming for - - PowerPoint PPT Presentation

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud Stefan Istrate University of Cambridge February 10, 2011 Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 1 / 17 Outline


slide-1
SLIDE 1

BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud

¸ Stefan Istrate

University of Cambridge

February 10, 2011

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 1 / 17

slide-2
SLIDE 2

Outline

1

Introduction The problem The solution

2

Background Overlog

3

BOOM Analytics HDFS Rewrite (BOOM-FS) The Availability Rev The Scalability Rev The Monitoring Rev MapReduce Port (BOOM-MR)

4

Performance

5

Conclusions

6

Questions / Comments

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 2 / 17

slide-3
SLIDE 3

Introduction The problem

The problem

Building and debugging distributed software is extremely difficult. The developer spends time on:

  • rchestrating concurrent computation and communication across

machines minimize the delays handle failures instead of being creative

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 3 / 17

slide-4
SLIDE 4

Introduction The solution

The solution

A broad range of distributed software can be recast in a data-parallel programming model. Solution: adopt a data-centric approach to system design switch to declarative programming languages Advantages: raised level of abstraction for programmers improved code simplicity better speed of development ease of software evolution program correctness

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 4 / 17

slide-5
SLIDE 5

Introduction The solution

BOOM Analytics

BOOM = Berkeley Orders Of Magnitude BOOM Analytics = reimplementation of HDFS and Hadoop MapReduce in Overlog Why Hadoop?

1

It shows the distributed power of a cluster.

2

Significant distributed features are missing => It can be extended.

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 5 / 17

slide-6
SLIDE 6

Background Overlog

Overlog

declarative language (logic of computation, not the control flow) based on Datalog

defined over relational tables query language that makes no changes to the stored tables rules: rhead(col − list) ⊢ r1(col − list),...,rn(col − list)

extends Datalog

can specify location of data primary keys and aggregation defines a model for processing and generating changes to tables

relational tables may be partitioned across a set of machines implementations: P2, JOL (Java-based Overlog)

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 6 / 17

slide-7
SLIDE 7

BOOM Analytics HDFS Rewrite (BOOM-FS)

HDFS

files system metadata stored at a centralized NameNode file data distributed across DataNodes by default, data chunks of 64MB replicated three times DataNodes send heartbeat messages to the NameNode clients only contact the NameNode

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 7 / 17

slide-8
SLIDE 8

BOOM Analytics HDFS Rewrite (BOOM-FS)

BOOM-FS

represent file system metadata as a collection of relations

Name Description Relevant attributes file Files fileid, parentfileid, name, isDir fqpath Fully-qualified pathnames path, fileid fchunk Chunks per file chunkid, fileid datanode DataNode heartbeats nodeAddr, lastHeartbeatTime hb_chunk Chunk heartbeats nodeAddr, chunkid, length metadata and heartbeat protocols implemented with Overlog rules data protocol implemented in Java 4 person-months of work System Lines of Java Lines of Overlog HDFS 21,700 BOOM-FS 1,431 469

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 8 / 17

slide-9
SLIDE 9

BOOM Analytics The Availability Rev

The Availability Rev

Goal: hot standby replication for NameNodes Solution: Paxos algorithm solves consensus in the network is a collection of logical invariants messages and disk writes → insertions into tables invariants → rules Results: 400 lines of code 6 person-weeks of development time

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 9 / 17

slide-10
SLIDE 10

BOOM Analytics The Scalability Rev

The Scalability Rev

Goal: scale out the NameNode across multiple partitions Solution: add a ’partition’ column to tables to split them across nodes Results: 8 hours of development time

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 10 / 17

slide-11
SLIDE 11

BOOM Analytics The Monitoring Rev

The Monitoring Rev

Goal: develop performance monitoring and debugging tools Solution: replicate the body of each rule and send it to a log table add a relation called “die” to JOL when “die” is added throw a Java exception Results: performance monitoring: 64 lines of code, less than 1 day debugging: 60 lines of code, 8 person-hours

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 11 / 17

slide-12
SLIDE 12

BOOM Analytics MapReduce Port (BOOM-MR)

Hadoop MapReduce

single master node (JobTracker) many worker nodes (TaskTrackers) job is divided in maps and reduces map: reads an input chunk, runs a function, partition the output into buckets reduce: fetch hash buckets, sort by key, runs a function, writes to distributed file system fixed number of slots for every TaskTracker heartbeat protocol between each TaskTracker and JobTracker

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 12 / 17

slide-13
SLIDE 13

BOOM Analytics MapReduce Port (BOOM-MR)

BOOM-MR

Name Description Revelant attributes job Job definitions jobid, priority, submit_time, status, jobConf task Task definitions jobid, taskid, type, partition, status taskAttempt Task attempts jobid, taskid, attemptid, progress, state, phase, tracker, input_loc, start, finish taskTracker TaskTracker definitions name, hostname, state, map_count, re- duce_count, max_map, max_reduce evaluation on Hadoop’s default First-Come-First-Serve (FCFS) policy and the LATE (Longest Approximation Time to End) policy better results for LATE Results: initial version: one person-month debugging and tuning: two person-months 55 Overlog rules 6573 lines removed from Hadoop

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 13 / 17

slide-14
SLIDE 14

Performance

Performance

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 14 / 17

slide-15
SLIDE 15

Performance

Performance (cont.)

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 15 / 17

slide-16
SLIDE 16

Conclusions

Conclusions

Good things: focus on what, not on how simplified code faster development program correctness Bad things: system load averages higher with BOOM Analytics Overlog needs some other features difficult and time-consuming to read the code hard for programmers to switch to declarative programming

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 16 / 17

slide-17
SLIDE 17

Questions / Comments

Questions / Comments?

¸ Stefan Istrate (University of Cambridge) BOOM Analytics February 10, 2011 17 / 17