Apache Storm: Hands-on Session A.A. 2019/20 Fabiana Rossi Laurea - - PowerPoint PPT Presentation
Apache Storm: Hands-on Session A.A. 2019/20 Fabiana Rossi Laurea - - PowerPoint PPT Presentation
Macroareadi Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Storm: Hands-on Session A.A. 2019/20 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno The reference Big Data stack High-level
The reference Big Data stack
Fabiana Rossi - SABD 2019/20 2
Resource Management Data Storage Data Processing High-level Interfaces Support / Integration
Apache Storm
- Apache Storm
- Open-source, real-time, scalable streaming system
- Provides an abstraction layer to execute DSP applications
- Initially developed by Twitter
- Topology
- DAG of spouts (sources of streams) and bolts (operators and
data sinks
- stream: sequence of key-value pairs
3
bolt spout
Fabiana Rossi - SABD 2019/20
Stream grouping in Storm
- Data parallelism in Storm: how are streams
partitioned among multiple tasks (threads of execution)?
- Shuffle grouping
- Randomly partitions the tuples
- Field grouping
- Hashes on a subset of the tuple attributes
4 Fabiana Rossi - SABD 2019/20
Stream grouping in Storm
- All grouping (i.e., broadcast)
- Replicates the entire stream to all the consumer
tasks
- Global grouping
- Sends the entire stream to a single bolt
- Direct grouping
- Sends tuples to the consumer bolts in the same
executor
5 Fabiana Rossi - SABD 2019/20
Storm architecture
6
- Master-worker architecture
Fabiana Rossi - SABD 2019/20
Storm components: Nimbus and Zookeeper
- Nimbus
– The master node – Clients submit topologies to it – Responsible for distributing and coordinating the topology execution
- Zookeeper
– Nimbus uses a combination of the local disk(s) and Zookeeper to store state about the topology
7 Fabiana Rossi - SABD 2019/20
Storm components: worker
- Task: operator instance
– The actual work for a bolt or a spout is done in the task
- Executor: smallest schedulable entity
– Execute one or more tasks related to same operator
- Worker process: Java process running one or
more executors
- Worker node: computing
resource, a container for
- ne or more worker processes
8 Fabiana Rossi - SABD 2019/20
Storm components: supervisor
- Each worker node runs a supervisor
The supervisor:
- receives assignments from Nimbus (through
ZooKeeper) and spawns workers based on the assignment
- sends to Nimbus (through ZooKeeper) a
periodic heartbeat;
- advertises the topologies that they are
currently running, and any vacancies that are available to run more topologies
9 Fabiana Rossi - SABD 2019/20
Running a Topology in Storm
Storm allows two running mode: local, cluster
- Local mode: the topology is execute on a single node
- the local mode is usually used for testing purpose
- we can check whether our application runs as expected
- Cluster mode: the topology is distributed by Storm on
multiple workers
- The cluster mode should be used to run our application on
the real dataset
- Better exploits parallelism
- The application code is transparently distributed
- The topology is managed and monitored at run-time
10
Fabiana Rossi - SABD 2019/20
Running a Topology in Storm
To run a topology in local mode, we just need to create an in-process cluster
- it is a simplification of a cluster
- lightweight Storm functions wrap our code
- It can be instantiatedusing the LocalCluster class.
For example:
11
... LocalCluster cluster = new LocalCluster(); cluster.submitTopology("myTopology", conf, topology); Utils.sleep(10000); // wait [param] ms cluster.killTopology("myTopology"); cluster.shutdown(); ...
Fabiana Rossi - SABD 2019/20
Running a Topology in Storm
To run a topology in cluster mode, we need to perform the following steps:
1. Configure the application for the submission, using the StormSubmitter class. For example:
12
... Config conf = new Config(); conf.setNumWorkers(NUM_WORKERS); StormSubmitter.submitTopology("mytopology", conf, topology); ...
NUM_WORKERS
- number of worker processes to be used for running the topology
Fabiana Rossi - SABD 2019/20
Running a Topology in Storm
2. Create a jar containing your code and all the dependencies of your code
- do not include the Storm library
- this can be easily done using Maven: use the Maven Assembly Plugin and
configure your pom.xml:
13
<plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptorRefs> <descriptorRef>jar-with- dependencies</descriptorRef> </descriptorRefs> <archive> <manifest> <mainClass>com.path.to.main.Class</mainClass> </manifest> </archive> </configuration> </plugin>
Running a Topology in Storm
3. Submit the topology to the cluster using the storm client, as follows
14
$ $STORM_HOME/bin/storm jar
path/to/allmycode.jar full.classname.Topology arg1 arg2 arg3
Fabiana Rossi - SABD 2019/20
Running a Topology in Storm
15
application code control messages
Fabiana Rossi - SABD 2019/20
A container-based Storm cluster
Fabiana Rossi - SABD 2019/20
Running a Topology in Storm
Weare going to create a (local) Storm cluster using Docker
We need to run several containers, each of which will manage a service of our system:
- Zookeeper
- Nimbus
- Worker1, Worker2, Worker3
- Storm Client (storm-cli): we use storm-cli to run topologies or
scripts that feed our DSP application
Auxiliary services: they that will be useful to interact with
- ur Storm topologies
- Redis
- RabbitMQ: a message queue service
17 Fabiana Rossi - SABD 2019/20
Docker Compose
To easily coordinate the execution of these multiple services, we use Docker Compose
- Read more at https://docs.docker.com/compose/
Docker Compose:
- is not bundled within the installation of Docker
- it can be installed following the official Docker documentation
- https://docs.docker.com/compose/install/
- Allows to easily express the container to be instantiated at once,
and the relations among them
- By itself, docker compose runs the composition on a single
machine; however, in combination with Docker Swarm, containers can be deployed on multiple nodes
18 Fabiana Rossi - SABD 2019/20
Docker Compose
- Wespecify how to compose containers in a easy-to-read file, by
default named docker-compose.yml
- To start the docker composition (in background with -d):
- To stop the docker composition:
- By default, docker-compose looks for the docker-
compose.yml file in the current working directory; we can change the file with the configuration using the -f flag
19
$ docker-compose up -d $ docker-compose down
Fabiana Rossi - SABD 2019/20
Docker Compose
- There are different versions of the docker compose file format
- Wewill use the version 3, supported from Docker Compose 1.13
20
On the docker compose file format: https://docs.docker.com/compose/compose-file/
Fabiana Rossi - SABD 2019/20
Example: Exclamation
- Problem: Suppose to have a random source
- f words. Create a DSP application that adds
two exclamation points to each word.
21 Fabiana Rossi - SABD 2019/20
Example: Exclamation
- Problem: Suppose to have a random source
- f words. Create a DSP application that adds
two exclamation points to each word.
- Solution (1):
22 Fabiana Rossi - SABD 2019/20
A simple topology: ExclamationTopology
23
... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("word", new RandomNamesSpout(), 1); builder.setBolt("exclaim1", new ExclamationBolt(), 1) .shuffleGrouping("word"); builder.setBolt("exclaim2", new ExclamationBolt(), 1) .shuffleGrouping("exclaim1"); Config conf = new Config(); conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar( "ExclamationTopology", conf, builder.createTopology() ); ...
Fabiana Rossi - SABD 2019/20
Example: Exclamation
- Problem: Suppose to have a random source of
- words. Create a DSP application that adds two
exclamation points to each word.
- Solution (2):
24 Fabiana Rossi - SABD 2019/20
Example: WordCount
- Problem: Suppose to have a random source
- f sentences. Create a DSP application
that counts the number of occurrences of each word.
25 Fabiana Rossi - SABD 2019/20
Example: WordCount
- Problem: Suppose to have a random source
- f sentences. Create a DSP application
that counts the number of occurrences of each word.
- Solution:
26 Fabiana Rossi - SABD 2019/20
WordCount
27
... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentenceBolt(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCountBolt(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); ... StormSubmitter.submitTopologyWithProgressBar( "WordCount", conf, builder.createTopology() ); ...
Fabiana Rossi - SABD 2019/20
Example: Rolling Count
- Problem: Suppose to have a random source
- f words. Create a DSP application that
determines the top-N rank of words within a sliding window of 9 secs and sliding interval
- f 3 secs.
28 Fabiana Rossi - SABD 2019/20
Example: Rolling Count
- Problem: Suppose to have a random source of
- words. Create a DSP application that determines the
top-N rank of words within a sliding window of 9 secs and sliding interval of 3 secs.
- Solution:
29 Fabiana Rossi - SABD 2019/20
Rolling Count
30
... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout(spoutId, new RandomNamesSpout(), 5); builder.setBolt(counterId, new RollingCountBolt(), 4) .fieldsGrouping(spoutId, new Fields("word")); builder.setBolt(intermediateRankerId, new IntermediateRankingBolt(TOP_N), 4) .fieldsGrouping(counterId, new Fields("obj")); builder.setBolt(totalRankerId, new TotalRankingsBolt(TOP_N), 1) .globalGrouping(intermediateRankerId); StormSubmitter.submitTopologyWithProgressBar(...); ...
Fabiana Rossi - SABD 2019/20
Word Count on a Window (1)
- Storm 1.0 has explicitly introduced the concept of
Window.
- We revise a simplified version of the previous
Word Count application relying on the window primitives by Storm.
- The idea is to compute the word count in a sliding
window.
31 Fabiana Rossi - SABD 2019/20
Word Count on a Window (2)
32
- We create a data stream processing application
which comprises the following operators:
- a datasource, which emits sentences
- a splitter
- word count operator with a sliding window;
- the length of the sliding window is 9 secs
and it slides every 3 secs;
- To better visualize the results, we include an
auxiliary operator that exports results on a message queue, implemented with rabbitMQ.
Fabiana Rossi - SABD 2019/20
Word Count on a Window (3)
33
... TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentenceBolt(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCountWindowBasedBolt() .withWindow( BaseWindowedBolt.Duration.seconds(9), // length BaseWindowedBolt.Duration.seconds(3) // sliding ) , 12) .fieldsGrouping("split", new Fields("word")); StormSubmitter.submitTopologyWithProgressBar(...); ...
Word Count on a Window (4)
34
public class WordCountWindowBasedBolt extends BaseWindowedBolt { ... public void execute(TupleWindow tuples) { List<Tuple> incoming = tuples.getNew(); for (Tuple tuple : incoming){ ... } List<Tuple> expired = tuples.getExpired(); for (Tuple tuple : expired){ ... } } ... }
Implementation of the windowed operator
Fabiana Rossi - SABD 2019/20
DEBS Grand Challenge 2015 (1)
35
- Analysis of taxi trips based on data streams originating
from New York City taxis
- Input data streams: include starting point, drop-off point,
timestamps, and information related to the payment
- Query 1: identify the top 10 most frequent routes during
the last 30 minutes (sliding window)
- Use geo-spatial grids to define the events of interest
Fabiana Rossi - SABD 2019/20
DEBG Grand Challenge 2015 (2)
36
TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("datasource", new RedisSpout(redisUrl, redisPort)); builder.setBolt("parser", new ParseLine()) .setNumTasks(numTasks) .shuffleGrouping("datasource"); builder.setBolt("filterByCoordinates", new FilterByCoordinates()) .setNumTasks(numTasks) .shuffleGrouping("parser"); builder.setBolt("metronome", new Metronome()) .setNumTasks(numTasksMetronome) .shuffleGrouping("filterByCoordinates"); builder.setBolt("computeCellID", new ComputeCellID()) .setNumTasks(numTasks) .shuffleGrouping("filterByCoordinates");
DEBG Grand Challenge 2015 (3)
37
builder.setBolt("countByWindow", new CountByWindow()) .setNumTasks(numTasks) .fieldsGrouping("computeCellID", new Fields(ComputeCellID.F_ROUTE)) .allGrouping("metronome", Metronome.S_METRONOME); builder.setBolt("partialRank", new PartialRank(10)) .setNumTasks(numTasks) .fieldsGrouping("countByWindow", new Fields(ComputeCellID.F_ROUTE)); builder.setBolt("globalRank", new GlobalRank(...), 1) .setNumTasks(numTasksGlobalRank) .shuffleGrouping("partialRank"); StormTopology stormTopology = builder.createTopology();
Fabiana Rossi - SABD 2019/20