[PPT] - Scalable data store and analy/cs pla1orm for monitoring PowerPoint Presentation

SLIDE 1

Scalable ¡data ¡store ¡and ¡analy/cs ¡pla1orm ¡for ¡monitoring ¡WLCG, ¡ ¡ ¡a ¡distributed ¡data-‑intensive ¡scien/fic ¡infrastructure ¡ ¡

Uthay ¡Suthakar ¡

Brunel ¡University ¡ eepguus@brunel.ac.uk ¡

SLIDE 2

Topics

Introduc/on ¡to ¡current ¡architecture ¡
Proposed ¡architecture ¡ ¡
Lambda ¡architecture ¡
Review ¡of ¡technologies ¡

SLIDE 3

Current ¡architecture:

Robust ¡architecture. ¡
It ¡does ¡the ¡job! ¡

¡ But ¡ ¡

Expensive. ¡
Does ¡not ¡scale ¡well. ¡
Does ¡not ¡support ¡real-‑/me ¡

analy/cs. ¡

SLIDE 4

Proposed ¡architecture:

Batch ¡Layer ¡

Stores ¡constantly ¡growing ¡

dataset. ¡

Real-‑Time ¡Processing ¡ Layer ¡

Perform ¡analy/cs ¡on ¡fresh ¡

data. ¡

Serving ¡Layer ¡

Stores ¡the ¡batch ¡ processed ¡views ¡for ¡ interac/ve ¡querying. ¡

SLIDE 5

Lambda ¡Architecture

Three ¡layers ¡architecture: ¡ ¡

Batch ¡Layer ¡– ¡for ¡batch ¡

processing ¡on ¡Big ¡Data ¡and ¡ producing ¡queryable ¡views. ¡

Serving ¡Layer ¡– ¡for ¡ad-‑hoc ¡

query ¡(ideally ¡from ¡views ¡ generated ¡by ¡the ¡batch ¡layer). ¡

Speed ¡Layer ¡– ¡for ¡real-‑/me ¡

views ¡based ¡on ¡incremental ¡

algorithms. ¡

SLIDE 6

SLIDE 7

Batch ¡Layer ¡(i): ¡Hadoop ¡& ¡MapReduce

Programming ¡model ¡proposed ¡by ¡Google. ¡
Solve ¡the ¡complex ¡issues ¡(compute ¡in ¡parallel, ¡load ¡balance ¡& ¡fault ¡
tolerance). ¡
Two ¡primi/ve ¡parallel ¡methods ¡(Map ¡and ¡Reduce). ¡

SLIDE 8

Batch ¡Layer ¡(ii): ¡Stratosphere

Stratosphere ¡extends ¡the ¡well-‑known ¡MapReduce ¡model ¡with ¡new ¡operators. ¡
All ¡operators ¡will ¡start ¡working ¡in ¡memory. ¡
Support ¡Java ¡or ¡Scala. ¡
Scales ¡horizontally. ¡
Seamlessly ¡integrates ¡into ¡exis/ng ¡Hadoop. ¡
Built-‑In ¡Op/mizer. ¡

SLIDE 9

SLIDE 10

Serving ¡Layer ¡(1): ¡Apache ¡Drill ¡

Inspired ¡by ¡Google’s ¡Dremel. ¡
Drill ¡provides ¡a ¡distributed ¡execu/on ¡engine ¡for ¡interac/ve ¡queries. ¡
Low ¡latency ¡ad-‑hoc ¡queries ¡to ¡many ¡different ¡data ¡sources. ¡
Goal ¡is ¡to ¡scale ¡to ¡10,000 ¡servers ¡and ¡process ¡petabytes ¡of ¡data ¡within ¡seconds. ¡
Supports ¡mul/ple ¡data ¡models: ¡

¡-‑ ¡Schema: ¡Protocol ¡Buffers ¡& ¡Apache ¡Avro ¡ ¡-‑ ¡Schema-‑less: ¡JSON,BSON, ¡etc.. ¡

SLIDE 11

Serving ¡Layer ¡(ii): ¡Cloudera ¡Impala

Massively ¡Parallel ¡Processing ¡query ¡engine. ¡
Low-‑latency ¡SQL ¡queries. ¡
¡Interac/ve ¡analy/cs ¡directly ¡on ¡data ¡stored ¡in ¡Hadoop ¡without ¡data ¡movement ¡or ¡predefined ¡schemas. ¡
Shares ¡workload ¡management, ¡metadata, ¡ODBC ¡driver, ¡SQL ¡syntax ¡and ¡user ¡interface ¡with ¡Apache. ¡
SQL-‑92 ¡features ¡of ¡Hive ¡Query ¡Language ¡including ¡SELECT, ¡joins, ¡and ¡aggregate ¡func/ons. ¡

SLIDE 12

Serving ¡Layer(iii): ¡Presto ¡(Facebook)

Distributed ¡SQL ¡query ¡engine ¡op/mized ¡for ¡ad-‑hoc ¡analysis. ¡
Supports ¡complex ¡queries, ¡aggrega/ons, ¡joins, ¡and ¡window ¡func/ons. ¡
Read-‑Only. ¡

SLIDE 13

SLIDE 14

Speed ¡Layer ¡(i): ¡Storm ¡

Exposes ¡parallel ¡real-‑/me ¡computa/on ¡model. ¡
Highly ¡Scalable. ¡
Guarantees ¡that ¡every ¡message ¡will ¡be ¡processed. ¡
¡Transac/onal ¡topologies. ¡
Stream ¡Processing. ¡
Con/nuous ¡Computa/on. ¡
Distributed ¡RPC. ¡
Stream ¡Groupings. ¡

SLIDE 15

Speed ¡Layer ¡(ii): ¡Amazon ¡Kinesis ¡

Streaming ¡data ¡as ¡managed ¡service ¡

(Cloud ¡Service). ¡

Based ¡on ¡metering ¡system ¡(charged ¡

based ¡on ¡shards ¡and ¡HTTP ¡PUT ¡ transac/on). ¡

Capacity ¡of ¡the ¡streams ¡are ¡

configured ¡as ¡shards ¡(throughput ¡ capacity). ¡

Kinesis ¡Client ¡Library ¡– ¡responsible ¡

for ¡load ¡balancing, ¡coordina/on ¡and ¡ error ¡handling. ¡ ¡

SLIDE 16

Speed ¡Layer ¡(iii): ¡Samza

Three ¡layers; ¡stream ¡layer, ¡execu/ng ¡layer ¡and ¡processing ¡layer. ¡
Samza ¡is ¡pluggable. ¡
Streams ¡are ¡par//oned ¡and ¡ordered ¡sequen/ally. ¡
stream ¡is ¡composed ¡of ¡immutable ¡messages ¡of ¡a ¡similar ¡type ¡(kaea ¡topics). ¡
States ¡are ¡co-‑located ¡with ¡each ¡tasks. ¡
Check ¡poin/ng ¡for ¡failure ¡recovery. ¡

SLIDE 17

Speed ¡Layer ¡(iv): ¡S4

Distributed ¡stream ¡processing ¡engine ¡inspired ¡by ¡the ¡MapReduce. ¡
Combina/on ¡of ¡MapReduce ¡and ¡the ¡Actors ¡model. ¡
Provides ¡a ¡simple ¡Programming ¡Interface. ¡
Decentralized ¡and ¡Symmetric ¡architecture ¡(managed ¡by ¡ZooKeeper). ¡
Pluggable ¡architecture. ¡
Lossy ¡failover ¡is ¡acceptable ¡– ¡Processes ¡are ¡moved ¡to ¡standby. ¡
Several ¡PEs ¡are ¡available ¡for ¡standard ¡tasks ¡such ¡as ¡count, ¡

¡ ¡ ¡ ¡ ¡ ¡aggregate, ¡join, ¡and ¡so ¡on… ¡

SLIDE 18

Spark, ¡Shark, ¡Spark ¡Stream, ¡etc… ¡(i)

In-‑memory ¡distributed ¡compu/ng ¡framework. ¡
Provides ¡a ¡general ¡programming ¡model ¡(operators ¡such ¡as ¡

Map, ¡Reduce, ¡Join, ¡Filter, ¡GroupBy, ¡Sort, ¡LeiOuterJoin, ¡ RightOuterJoin, ¡Count, ¡Union, ¡Cross, ¡etc..). ¡

Low-‑latency ¡computa(ons ¡by ¡caching ¡the ¡working ¡dataset ¡

in ¡memory. ¡

Fault ¡tolerance ¡by ¡lineage ¡or ¡check ¡poin/ng. ¡
Spark ¡extends ¡it’s ¡engine ¡for ¡stream ¡processing. ¡
Provides ¡same ¡Spark ¡APIs ¡for ¡processing ¡stream. ¡

Scalable data store and analy/cs pla1orm for monitoring - - PowerPoint PPT Presentation

Scalable ¡data ¡store ¡and ¡analy/cs ¡pla1orm ¡for ¡monitoring ¡WLCG, ¡ ¡ ¡a ¡distributed ¡data-‑intensive ¡scien/fic ¡infrastructure ¡ ¡

Topics

Current ¡architecture:

Proposed ¡architecture:

Lambda ¡Architecture

Batch ¡Layer ¡(i): ¡Hadoop ¡& ¡MapReduce

Batch ¡Layer ¡(ii): ¡Stratosphere

Serving ¡Layer ¡(1): ¡Apache ¡Drill ¡

Serving ¡Layer ¡(ii): ¡Cloudera ¡Impala

Serving ¡Layer(iii): ¡Presto ¡(Facebook)

Speed ¡Layer ¡(i): ¡Storm ¡

Speed ¡Layer ¡(ii): ¡Amazon ¡Kinesis ¡

Speed ¡Layer ¡(iii): ¡Samza

Speed ¡Layer ¡(iv): ¡S4

Spark, ¡Shark, ¡Spark ¡Stream, ¡etc… ¡(i)

Summary

Scalable ¡data ¡store ¡and ¡analy/cs ¡pla1orm ¡for ¡monitoring ¡WLCG, ¡ ¡ ¡a ¡distributed ¡data-­‑intensive ¡scien/fic ¡infrastructure ¡ ¡

Topics

Current ¡architecture:

Proposed ¡architecture:

Lambda ¡Architecture

Batch ¡Layer ¡(i): ¡Hadoop ¡& ¡MapReduce

Batch ¡Layer ¡(ii): ¡Stratosphere

Serving ¡Layer ¡(1): ¡Apache ¡Drill ¡

Serving ¡Layer ¡(ii): ¡Cloudera ¡Impala

Serving ¡Layer(iii): ¡Presto ¡(Facebook)

Speed ¡Layer ¡(i): ¡Storm ¡

Speed ¡Layer ¡(ii): ¡Amazon ¡Kinesis ¡

Speed ¡Layer ¡(iii): ¡Samza

Speed ¡Layer ¡(iv): ¡S4

Spark, ¡Shark, ¡Spark ¡Stream, ¡etc… ¡(i)

Summary

Scalable ¡data ¡store ¡and ¡analy/cs ¡pla1orm ¡for ¡monitoring ¡WLCG, ¡ ¡ ¡a ¡distributed ¡data-‑intensive ¡scien/fic ¡infrastructure ¡ ¡