Apache Storm: Hands-on Session A.A. 2018/19 Fabiana Rossi Laurea - - PDF document

apache storm hands on session
SMART_READER_LITE
LIVE PREVIEW

Apache Storm: Hands-on Session A.A. 2018/19 Fabiana Rossi Laurea - - PDF document

Macroareadi Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Storm: Hands-on Session A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno The reference Big Data stack High-level


slide-1
SLIDE 1

Apache Storm: Hands-on Session

A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno

Macroareadi Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Fabiana Rossi - SABD 2018/19 2

Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

slide-2
SLIDE 2

Apache Storm

  • Apache Storm
  • Open-source, real-time, scalable streaming system
  • Provides an abstraction layer to execute DSP applications
  • Initially developed by Twitter
  • Topology
  • DAG of spouts (sources of streams) and bolts (operators and

data sinks

  • stream: sequence of key-value pairs

3

bolt spout

Fabiana Rossi - SABD 2018/19

Stream grouping in Storm

  • Data parallelism in Storm: how are streams

partitioned among multiple tasks (threads of execution)?

  • Shuffle grouping
  • Randomly partitions the tuples
  • Field grouping
  • Hashes on a subset of the tuple attributes

4 Fabiana Rossi - SABD 2018/19

slide-3
SLIDE 3

Stream grouping in Storm

  • All grouping (i.e., broadcast)
  • Replicates the entire stream to all the consumer

tasks

  • Global grouping
  • Sends the entire stream to a single bolt
  • Direct grouping
  • Sends tuples to the consumer bolts in the same

executor

5 Fabiana Rossi - SABD 2018/19

Storm architecture

6

  • Master-worker architecture

Fabiana Rossi - SABD 2018/19

slide-4
SLIDE 4

Storm components: Nimbus and Zookeeper

  • Nimbus

– The master node – Clients submit topologies to it – Responsible for distributing and coordinating the topology execution

  • Zookeeper

– Nimbus uses a combination of the local disk(s) and Zookeeper to store state about the topology

Fabiana Rossi - SABD 2018/19 7

Storm components: worker

  • Task: operator instance

– The actual work for a bolt or a spout is done in the task

  • Executor: smallest schedulable entity

– Execute one or more tasks related to same operator

  • Worker process: Java process running one or

more executors

  • Worker node: computing

resource, a container for

  • ne or more worker processes

8 Fabiana Rossi - SABD 2018/19

slide-5
SLIDE 5

Storm components: supervisor

  • Each worker node runs a supervisor

The supervisor:

  • receives assignments from Nimbus (through

ZooKeeper) and spawns workers based on the assignment

  • sends to Nimbus (through ZooKeeper) a

periodic heartbeat;

  • advertises the topologies that they are

currently running, and any vacancies that are available to run more topologies

Fabiana Rossi - SABD 2018/19 9

Running a Topology in Storm

Storm allows two running mode: local, cluster

  • Local mode: the topology is execute on a single node
  • the local mode is usually used for testing purpose
  • we can check whether our application runs as expected
  • Cluster mode: the topology is distributed by Storm on

multiple workers

  • The cluster mode should be used to run our application on

the real dataset

  • Better exploits parallelism
  • The application code is transparently distributed
  • The topology is managed and monitored at run-time

10

Fabiana Rossi - SABD 2018/19

slide-6
SLIDE 6

Running a Topology in Storm

To run a topology in local mode, we just need to create an in-process cluster

  • it is a simplification of a cluster
  • lightweight Storm functions wrap our code
  • It can be instantiatedusing the LocalCluster class.

For example:

11

... LocalCluster cluster = new LocalCluster(); cluster.submitTopology("myTopology", conf, topology); Utils.sleep(10000); // wait [param] ms cluster.killTopology("myTopology"); cluster.shutdown(); ...

Fabiana Rossi - SABD 2018/19

Running a Topology in Storm

To run a topology in cluster mode, we need to perform the following steps:

1. Configure the application for the submission, using the StormSubmitter class. For example:

12

... Config conf = new Config(); conf.setNumWorkers(NUM_WORKERS); StormSubmitter.submitTopology("mytopology", conf, topology); ...

Fabiana Rossi - SABD 2018/19

NUM_WORKERS

  • number of worker processes to be used for running the topology
slide-7
SLIDE 7

Running a Topology in Storm

2. Create a jar containing your code and all the dependencies of your code

  • do not include the Storm library
  • this can be easily done using Maven: use the Maven Assembly Plugin and

configure your pom.xml:

13

<plugin> <artifactId>maven­assembly­plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar­with­ dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>com.path.to.main.Class</mainClass> </manifest> </archive> </configuration> </plugin>

Running a Topology in Storm

3. Submit the topology to the cluster using the storm client, as follows

14

$ $STORM_HOME/bin/storm jar

path/to/allmycode.jar full.classname.Topology arg1 arg2 arg3

Fabiana Rossi - SABD 2018/19

slide-8
SLIDE 8

Running a Topology in Storm

15

application code control messages

Fabiana Rossi - SABD 2018/19

A container-based Storm cluster

Fabiana Rossi - SABD 2018/19

slide-9
SLIDE 9

Running a Topology in Storm

Weare going to create a (local) Storm cluster using Docker

We need to run several containers, each of which will manage a service of our system:

  • Zookeeper
  • Nimbus
  • Worker1, Worker2, Worker3
  • Storm Client (storm-cli): we use storm-cli to run topologies or

scripts that feed our DSP application

Auxiliary services: they that will be useful to interact with

  • ur Storm topologies
  • Redis
  • RabbitMQ: a message queue service

17 Fabiana Rossi - SABD 2018/19

Docker Compose

To easily coordinate the execution of these multiple services, we use Docker Compose

  • Read more at https://docs.docker.com/compose/

Docker Compose:

  • is not bundled within the installation of Docker
  • it can be installed following the official Docker documentation
  • https://docs.docker.com/compose/install/
  • Allows to easily express the container to be instantiated at once,

and the relations among them

  • By itself, docker compose runs the composition on a single

machine; however, in combination with Docker Swarm, containers can be deployed on multiple nodes

18 Fabiana Rossi - SABD 2018/19

slide-10
SLIDE 10

Docker Compose

  • Wespecify how to compose containers in a easy-to-read file, by

default named docker­compose.yml

  • To start the docker composition (in background with -d):
  • To stop the docker composition:
  • By default, docker-compose looks for the docker­

compose.yml file in the current working directory; we can change the file with the configuration using the ­f flag

19

$ docker­compose up ­d $ docker­compose down

Fabiana Rossi - SABD 2018/19

Docker Compose

  • There are different versions of the docker compose file format
  • Wewill use the version 3, supported from Docker Compose 1.13

20

On the docker compose file format: https://docs.docker.com/compose/compose-file/

Fabiana Rossi - SABD 2018/19

slide-11
SLIDE 11

Example: Exclamation

  • Problem: Suppose to have a random source
  • f words. Create a DSP application that adds

two exclamation points to each word.

21 Fabiana Rossi - SABD 2018/19

Example: Exclamation

  • Problem: Suppose to have a random source
  • f words. Create a DSP application that adds

two exclamation points to each word.

  • Solution (1):

22 Fabiana Rossi - SABD 2018/19

slide-12
SLIDE 12

A simple topology: ExclamationTopology

23

... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("word", new RandomNamesSpout(), 1); builder.setBolt("exclaim1", new ExclamationBolt(), 1) .shuffleGrouping("word"); builder.setBolt("exclaim2", new ExclamationBolt(), 1) .shuffleGrouping("exclaim1"); Config conf = new Config(); conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar( "ExclamationTopology", conf, builder.createTopology() ); ...

Fabiana Rossi - SABD 2018/19

Example: Exclamation

  • Problem: Suppose to have a random source of
  • words. Create a DSP application that adds two

exclamation points to each word.

  • Solution (2):

24 Fabiana Rossi - SABD 2018/19

slide-13
SLIDE 13

Example: WordCount

  • Problem: Suppose to have a random source
  • f sentences. Create a DSP application

that counts the number of occurrences of each word.

25 Fabiana Rossi - SABD 2018/19

Example: WordCount

  • Problem: Suppose to have a random source
  • f sentences. Create a DSP application

that counts the number of occurrences of each word.

  • Solution:

26 Fabiana Rossi - SABD 2018/19

slide-14
SLIDE 14

WordCount

27

... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentenceBolt(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCountBolt(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); ... StormSubmitter.submitTopologyWithProgressBar( "WordCount", conf, builder.createTopology() ); ...

Fabiana Rossi - SABD 2018/19

Example: Rolling Count

  • Problem: Suppose to have a random source
  • f words. Create a DSP application that

determines the top-N rank of words within a sliding window of 9 secs and sliding interval

  • f 3 secs.

28 Fabiana Rossi - SABD 2018/19

slide-15
SLIDE 15

Example: Rolling Count

  • Problem: Suppose to have a random source of
  • words. Create a DSP application that determines the

top-N rank of words within a sliding window of 9 secs and sliding interval of 3 secs.

  • Solution:

29 Fabiana Rossi - SABD 2018/19

Rolling Count

30

... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(spoutId, new RandomNamesSpout(), 5); builder.setBolt(counterId, new RollingCountBolt(), 4) .fieldsGrouping(spoutId, new Fields("word")); builder.setBolt(intermediateRankerId, new IntermediateRankingBolt(TOP_N), 4) .fieldsGrouping(counterId, new Fields("obj")); builder.setBolt(totalRankerId, new TotalRankingsBolt(TOP_N), 1) .globalGrouping(intermediateRankerId); StormSubmitter.submitTopologyWithProgressBar(...); ...

Fabiana Rossi - SABD 2018/19

slide-16
SLIDE 16

Rolling Count on a Window (1)

  • Storm 1.0 has explicitly introduced the concept of

Window.

  • We revise a simplified version of the previous

application (Rolling Count) relying on the window primitives by Storm.

  • The idea is to identify the top-5 elements (words)

most popular in a sliding window.

Fabiana Rossi - SABD 2018/19 31

Rolling Count on a Window (2)

Fabiana Rossi - SABD 2018/19 32

  • We create a data stream processing application

which comprises the following operators:

  • a datasource, which emits sentences
  • a splitter
  • rolling count operator with a sliding window;
  • the length of the sliding window is 9 secs

and it slides every 3 secs;

  • To better visualize the results, we include an

auxiliary operator that exports results on a message queue, implemented with rabbitMQ.

slide-17
SLIDE 17

Rolling Count on a Window (3)

33 Fabiana Rossi - SABD 2018/19

... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentenceBolt(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCountWindowBasedBolt() .withWindow( BaseWindowedBolt.Duration.seconds(9), // length BaseWindowedBolt.Duration.seconds(3) // sliding ) , 12) .fieldsGrouping("split", new Fields("word")); StormSubmitter.submitTopologyWithProgressBar(...); ...

Rolling Count on a Window (4)

34 Fabiana Rossi - SABD 2018/19

public class WordCountWindowBasedBolt extends BaseWindowedBolt { ... public void execute(TupleWindow tuples) { List<Tuple> incoming = tuples.getNew(); for (Tuple tuple : incoming){ ... } List<Tuple> expired = tuples.getExpired(); for (Tuple tuple : expired){ ... } } ... }

Implementation of the windowed operator

slide-18
SLIDE 18

DEBS Grand Challenge 2015 (1)

35 Fabiana Rossi - SABD 2018/19

  • Analysis of taxi trips based on data streams originating

from New York City taxis

  • Input data streams: include starting point, drop-off point,

timestamps, and information related to the payment

  • Query 1: identify the top 10 most frequent routes during

the last 30 minutes (sliding window)

  • Use geo-spatial grids to define the events of interest

DEBG Grand Challenge 2015 (2)

36 Fabiana Rossi - SABD 2018/19

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("datasource", new RedisSpout(redisUrl, redisPort)); builder.setBolt("parser", new ParseLine()) .setNumTasks(numTasks) .shuffleGrouping("datasource"); builder.setBolt("filterByCoordinates", new FilterByCoordinates()) .setNumTasks(numTasks) .shuffleGrouping("parser"); builder.setBolt("metronome", new Metronome()) .setNumTasks(numTasksMetronome) .shuffleGrouping("filterByCoordinates"); builder.setBolt("computeCellID", new ComputeCellID()) .setNumTasks(numTasks) .shuffleGrouping("filterByCoordinates");

slide-19
SLIDE 19

DEBG Grand Challenge 2015 (3)

37 Fabiana Rossi - SABD 2018/19

builder.setBolt("countByWindow", new CountByWindow()) .setNumTasks(numTasks) .fieldsGrouping("computeCellID", new Fields(ComputeCellID.F_ROUTE)) .allGrouping("metronome", Metronome.S_METRONOME); builder.setBolt("partialRank", new PartialRank(10)) .setNumTasks(numTasks) .fieldsGrouping("countByWindow", new Fields(ComputeCellID.F_ROUTE)); builder.setBolt("globalRank", new GlobalRank(...), 1) .setNumTasks(numTasksGlobalRank) .shuffleGrouping("partialRank"); StormTopology stormTopology = builder.createTopology();