- Md. Sadil Khan
Data Collection With Apache Flume Md. Sadil Khan Rohit Aich - - PowerPoint PPT Presentation
Data Collection With Apache Flume Md. Sadil Khan Rohit Aich - - PowerPoint PPT Presentation
Data Collection With Apache Flume Md. Sadil Khan Rohit Aich Bhowmick Outline Data Collection Current Problem Introduction to Apache Flume Flume Features The Flume Architecture Data Flow in Flume Flume Goals
Outline
- Data Collection
- Current Problem
- Introduction to Apache Flume
- Flume Features
- The Flume Architecture
- Data Flow in Flume
- Flume Goals
- Reliability
- Scalability
- Extensibility
- Manageability
- Twitter Data Collection
Data Collection
⚫ Data collection plays the most important role in the Big Data cycle. The Internet provides almost
unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data.
⚫ For example, let’s assume we would like to build a system that recommends restaurants. The first step
would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time.
⚫ Big data describes voluminous amounts of structured, semi-structured and unstructured data collected
by organizations. But because it takes a lot of time and money to load big data into a traditional relational database for analysis, new approaches for collecting and analyzing data have emerged. To gather and then mine big data for information, raw data with extended metadata is aggregated in a data lake. From there, machine learning and artificial intelligence programs use complex algorithms to look for repeatable patterns.
Current Problem
- Situation: Each and every day, due to the ease of access in the digital world, we are generating more
and more data everyday from our smartphone, laptop, tablet, etc.
- Problem: Companies want to analyse, gather insights from this huge amount of data for business
related goals. We need a reliable, scalable, extensible and manageable way to gather the data where we can efficiently process it!
Introduction to Apache Flume
⚫ Apache Flume is a highly reliable,distributed and
configurable streaming tool for aggregating and transporting large amounts of streaming data such as log files, events from various sources to a centralized data store (like HDFS and HBASE). It was developed by Cloudera.
Flume Featues
⚫
Flume collects data efficiently, aggregates and moves large amounts of log data from many different sources to a centralized store (HDFS,HBASE).
⚫
Simple and flexible architecture, and provides a streaming of data flows and leverages data movement from multiple machines in within an enterprise into Hadoop.
⚫
Flume isn’t restricted to log data aggregation and it can transport massive quantities of event data including but not limited to network traffic data, social-media-generated data,email messages and pretty much any data source possible.
⚫
Built-in support for several Sources and destination platforms to integrate with.
The Flume Architecture
⚫ Data generators (such as Facebook, Twitter) generate data which gets
collected by individual Flume agents running on them.
⚫ Thereafter, a data collector (which is also an agent) collects the data from
the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.
The Flume Architecture
- Event: Event is the single log entry or basic unit of data which we transport further.An event is
composed of zero or more headers and a body.
- Logfile: In computing, a log file is a file that records either occurances in an operating system or
- ther software runs, or messages between different users of a communication software. Logging is
the act of keeping a log. In the simplest case, messages are written to a single log file.
- Agent: It’s a JVM process which receives the data (events) from clients or other agents and
forwards it to its next destination (sink or agent). Flume may have more than one agent. It has 3 components -Flume Source, Flume Channel and Flume Sink.
- Processor: Intermediate Processing (aggregation). This process is optional.
- Collector: Write data to permanent storage.
Components of Agent -Source
➢ Source ➢ Flume Source is configured within an agent and it listens for events from an
external source(eg:web server) it reads data, translates events, and handles failure situations.
➢ But source doesn’t know how to store the event. So, after receiving enough data
to produce a Flume event, it sends events to the channel to which the source is connected.
➢ The external source sends events to Flume in a format that is recognised by the
target Flume source.
➢ Spooling Directory Source ➢ Exec Source ➢ Netcat Source ➢ HTTP Source ➢ Twitter Source ➢ Avro Source
Component of Flume- Channel
Channel
⚫ Channels are communication bridges between sources and sinks within an agent.
Once a Flume source receives an agent, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink.
⚫ Memory channel stores the events from an in-memory queue and from there
events will be accessed by sink. Because of software or hardware failure, if the agent process dies in the middle, then all the events currently in the memory channel are lost forever.
⚫ The file channel is another example – It is backed by the local file system. Unlike,
memory channel, file channel writes the contents to a file on the file system that is deleted only after successful delivery to the sink. Notes: The Memory channel is the fastest but has the risk of data loss. The file channels are typically much slower but effectively provide guaranteed delivery to the sink.
Components of Agent - Sink
⚫ Sink removes the event from the channel and puts it into an external repository like HDFS or
forwards it to the Flume source of the next Flume agent in the flow. The source and Sink within the given agent run asynchronously with the events staged in the channel. 1) HDFS Sink 2) Logger Sink 3) File Roll Sink 4) HbaseSInk 5) MorphineSolr Sink 6)Avro Sink
Data Flow in Flume
⚫ Multi-hop Flow: Basically, before reaching the final destination there can be
multiple agents and an event may travel through more than one agent, within Flume.
⚫ Fan-out Flow: In very simple language when data transfers or the data flow from
- ne source to multiple channels that is what we call fan-out flow. Basically, in Flume
Data flow, it is of two categories −
⚫ 1. Replicating: It is the data flow where the data will be replicated in all the
configured channels.
⚫ 2. Multiplexing: On defining Multiplexing we can say the data flow where the data
will be sent to a selected channel which is mentioned in the header of the event.
⚫ Fan-in Flow: The data will be transferred from many sources to one channel.
Flume: Failure Handling
⚫ For each event, there are two transactions which take place. ⚫ One at the sender and one at the receiver. ⚫ Basically, the sender sends events to the receiver. ⚫ Although, the receiver commits its own transaction and sends a “received”
signal to the sender, soon after receiving the data.
⚫ Thus, the sender commits its transaction just after receiving the signal.
Flume Goals: Reliability
⚫ Flume uses transactional approach to data flow,by default.
- Best Effort
- Store on Failure and Retry
- End to End Reliability
Flume Goals: Scalability
- Horizontally Scalable Data Path
- Load Balancing
Flume Goals: Scalability
- Horizontally Scalable Control Path
Flume Goals: Extensibility
- Simple Source and Sink API
- Event streaming and composition of simple operation
- Plug in Architecture
- Add your own sources, sinks, decorators
Flume Goals: Manageability
- Centralized Data Flow Management
Interface
Flume Goals: Manageability
- Configuring Flume
- Output Bucketing
Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”)}];
/logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
Twitter Data Collection
A webserver generates log data and this data is collected by an agent in Flume. The channel buffers this data to a sink, which finally pushes it to centralized stores. To fetch Twitter data, we will have to follow the steps given below:-
⚫ Create a twitter Application ⚫ Install / Start HDFS ⚫ Configure Flume
Fetching Twitter Application
Twitter Data Collection
⚫ To create a Twitter application, click on the following link https://apps.twitter.com/.
Sign in to your Twitter account. You will have a Twitter Application Management window where you can create, delete, and manage Twitter Apps.
⚫ Click on the Create New App button. You will be redirected to a window where you
will get an application form in which you have to fill in your details in order to create the App. While filling the website address, give the complete URL pattern, for example, http://example.com.
Twitter Data Collection
⚫ Fill in the details, accept the Developer Agreement when finished, click on the
Create your Twitter application button which is at the bottom of the page. If everything goes fine, an App will be created with the given details as shown below.
⚫ Under keys and Access Tokens tab at the bottom of the page, you can observe a
button named Create my access token. Click on it to generate the access token.
⚫ Finally, click on the Test OAuth button which is on the right side top of the page. This will
lead to a page which displays your Consumer key, Consumer secret, Access token, and Access token secret. Copy these details. These are useful to configure the agent in Flume.
Twitter Data Collection
⚫ Starting HDFS
docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 80:80 -p 8088:8088 -p 8080:80 -p 50075:50075 -p 7180:7180 -p 50070:50070 cloudera/quickstart /usr/bin/docker-quickstart
⚫ Creating Directory
Hadoop dfs mkdir /hadoop/twitter
Twitter Data Collection
Configuring Flume
⚫ We have to configure the source, the channel, and the sink using the configuration file in
the conf folder. The example given in this slide uses an experimental source provided by Apache Flume named Twitter 1% Firehose Memory channel and HDFS sink.
Twitter 1% Firehose Memory
⚫ This source is highly experimental. It connects to the 1% sample Twitter Firehose using
streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.
⚫ We will get this source by default along with the installation of Flume. The jar files
corresponding to this source can be located in the lib folder as shown below.
⚫ Setting the classpath
Set the classpath variable to the lib folder of Flume in Flume-env.sh file as shown below. export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/* This source needs the details such as Consumer key, Consumer secret, Access token, and Access token secret of a Twitter application. While configuring this source, you have to provide values to the following properties :-
⚫ Channels ⚫ Source Type: org.apache.flume.source.twitter.TwitterSource ⚫ ConsumerKey – The OAuth consumer key ⚫ ConsumerSecret- Oauth consumer secret ⚫ AcessToken – Oauth acess token ⚫ AccessTokenSecret – Oauth token secret ⚫ MaxBatchSize- Maximum no of twitter messages that should be in a twitter batch. The
default value is 1000.
⚫ MaxBatchDurationMillis – Maximum number of milliseconds to wait before closing a
- batch. The default value is 1000.
Twitter Data Collection
Twitter Data Collection
Channel
We are using the memory channel. To configure the memory channel, we must provide value to the type of the channel.
⚫ Type – It holds the type of the channel. In our example, the type is MemChannel. ⚫ Capacity – It’s the maximum number of events stored in the channel. Its default value is
100(optional).
⚫ TransactionCapacity- It’s the maximum number of events the channel accepts or sends. Its
default value is 100 (optional).
HDFS Sink This sink writes data into the HDFS. To configure this sink, you must provide the following details.
⚫ Channel ⚫ Type- Hdfs ⚫ Hdfs.path -the path of the directory in HDFS where data is to be stored.
Twitter Data Collection
# Naming the components on the current agent. TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS # Describing/Configuring the source TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token secret TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout, hbase, nosql # Describing/Configuring the sink TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/Hadoop/seqgen_data/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 # Describing/Configuring the channel TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100 # Binding the source and sink to the channel TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sinks.HDFS.channel = MemChannel
$ cd $FLUME_HOME $ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
Execution