Data Collection With Apache Flume Md. Sadil Khan Rohit Aich - - PowerPoint PPT Presentation

data collection
SMART_READER_LITE
LIVE PREVIEW

Data Collection With Apache Flume Md. Sadil Khan Rohit Aich - - PowerPoint PPT Presentation

Data Collection With Apache Flume Md. Sadil Khan Rohit Aich Bhowmick Outline Data Collection Current Problem Introduction to Apache Flume Flume Features The Flume Architecture Data Flow in Flume Flume Goals


slide-1
SLIDE 1
  • Md. Sadil Khan

Rohit Aich Bhowmick

Data Collection With Apache Flume

slide-2
SLIDE 2

Outline

  • Data Collection
  • Current Problem
  • Introduction to Apache Flume
  • Flume Features
  • The Flume Architecture
  • Data Flow in Flume
  • Flume Goals
  • Reliability
  • Scalability
  • Extensibility
  • Manageability
  • Twitter Data Collection
slide-3
SLIDE 3

Data Collection

⚫ Data collection plays the most important role in the Big Data cycle. The Internet provides almost

unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data.

⚫ For example, let’s assume we would like to build a system that recommends restaurants. The first step

would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time.

⚫ Big data describes voluminous amounts of structured, semi-structured and unstructured data collected

by organizations. But because it takes a lot of time and money to load big data into a traditional relational database for analysis, new approaches for collecting and analyzing data have emerged. To gather and then mine big data for information, raw data with extended metadata is aggregated in a data lake. From there, machine learning and artificial intelligence programs use complex algorithms to look for repeatable patterns.

slide-4
SLIDE 4

Current Problem

  • Situation: Each and every day, due to the ease of access in the digital world, we are generating more

and more data everyday from our smartphone, laptop, tablet, etc.

  • Problem: Companies want to analyse, gather insights from this huge amount of data for business

related goals. We need a reliable, scalable, extensible and manageable way to gather the data where we can efficiently process it!

slide-5
SLIDE 5

Introduction to Apache Flume

⚫ Apache Flume is a highly reliable,distributed and

configurable streaming tool for aggregating and transporting large amounts of streaming data such as log files, events from various sources to a centralized data store (like HDFS and HBASE). It was developed by Cloudera.

slide-6
SLIDE 6

Flume Featues

Flume collects data efficiently, aggregates and moves large amounts of log data from many different sources to a centralized store (HDFS,HBASE).

Simple and flexible architecture, and provides a streaming of data flows and leverages data movement from multiple machines in within an enterprise into Hadoop.

Flume isn’t restricted to log data aggregation and it can transport massive quantities of event data including but not limited to network traffic data, social-media-generated data,email messages and pretty much any data source possible.

Built-in support for several Sources and destination platforms to integrate with.

slide-7
SLIDE 7

The Flume Architecture

⚫ Data generators (such as Facebook, Twitter) generate data which gets

collected by individual Flume agents running on them.

⚫ Thereafter, a data collector (which is also an agent) collects the data from

the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.

slide-8
SLIDE 8

The Flume Architecture

  • Event: Event is the single log entry or basic unit of data which we transport further.An event is

composed of zero or more headers and a body.

  • Logfile: In computing, a log file is a file that records either occurances in an operating system or
  • ther software runs, or messages between different users of a communication software. Logging is

the act of keeping a log. In the simplest case, messages are written to a single log file.

  • Agent: It’s a JVM process which receives the data (events) from clients or other agents and

forwards it to its next destination (sink or agent). Flume may have more than one agent. It has 3 components -Flume Source, Flume Channel and Flume Sink.

  • Processor: Intermediate Processing (aggregation). This process is optional.
  • Collector: Write data to permanent storage.
slide-9
SLIDE 9

Components of Agent -Source

➢ Source ➢ Flume Source is configured within an agent and it listens for events from an

external source(eg:web server) it reads data, translates events, and handles failure situations.

➢ But source doesn’t know how to store the event. So, after receiving enough data

to produce a Flume event, it sends events to the channel to which the source is connected.

➢ The external source sends events to Flume in a format that is recognised by the

target Flume source.

➢ Spooling Directory Source ➢ Exec Source ➢ Netcat Source ➢ HTTP Source ➢ Twitter Source ➢ Avro Source

slide-10
SLIDE 10

Component of Flume- Channel

Channel

⚫ Channels are communication bridges between sources and sinks within an agent.

Once a Flume source receives an agent, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink.

⚫ Memory channel stores the events from an in-memory queue and from there

events will be accessed by sink. Because of software or hardware failure, if the agent process dies in the middle, then all the events currently in the memory channel are lost forever.

⚫ The file channel is another example – It is backed by the local file system. Unlike,

memory channel, file channel writes the contents to a file on the file system that is deleted only after successful delivery to the sink. Notes: The Memory channel is the fastest but has the risk of data loss. The file channels are typically much slower but effectively provide guaranteed delivery to the sink.

slide-11
SLIDE 11

Components of Agent - Sink

⚫ Sink removes the event from the channel and puts it into an external repository like HDFS or

forwards it to the Flume source of the next Flume agent in the flow. The source and Sink within the given agent run asynchronously with the events staged in the channel. 1) HDFS Sink 2) Logger Sink 3) File Roll Sink 4) HbaseSInk 5) MorphineSolr Sink 6)Avro Sink

slide-12
SLIDE 12

Data Flow in Flume

⚫ Multi-hop Flow: Basically, before reaching the final destination there can be

multiple agents and an event may travel through more than one agent, within Flume.

⚫ Fan-out Flow: In very simple language when data transfers or the data flow from

  • ne source to multiple channels that is what we call fan-out flow. Basically, in Flume

Data flow, it is of two categories −

⚫ 1. Replicating: It is the data flow where the data will be replicated in all the

configured channels.

⚫ 2. Multiplexing: On defining Multiplexing we can say the data flow where the data

will be sent to a selected channel which is mentioned in the header of the event.

⚫ Fan-in Flow: The data will be transferred from many sources to one channel.

slide-13
SLIDE 13

Flume: Failure Handling

⚫ For each event, there are two transactions which take place. ⚫ One at the sender and one at the receiver. ⚫ Basically, the sender sends events to the receiver. ⚫ Although, the receiver commits its own transaction and sends a “received”

signal to the sender, soon after receiving the data.

⚫ Thus, the sender commits its transaction just after receiving the signal.

slide-14
SLIDE 14

Flume Goals: Reliability

⚫ Flume uses transactional approach to data flow,by default.

  • Best Effort
  • Store on Failure and Retry
  • End to End Reliability
slide-15
SLIDE 15

Flume Goals: Scalability

  • Horizontally Scalable Data Path
  • Load Balancing
slide-16
SLIDE 16

Flume Goals: Scalability

  • Horizontally Scalable Control Path
slide-17
SLIDE 17

Flume Goals: Extensibility

  • Simple Source and Sink API
  • Event streaming and composition of simple operation
  • Plug in Architecture
  • Add your own sources, sinks, decorators
slide-18
SLIDE 18

Flume Goals: Manageability

  • Centralized Data Flow Management

Interface

slide-19
SLIDE 19

Flume Goals: Manageability

  • Configuring Flume
  • Output Bucketing

Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”)}];

/logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt

slide-20
SLIDE 20

Twitter Data Collection

A webserver generates log data and this data is collected by an agent in Flume. The channel buffers this data to a sink, which finally pushes it to centralized stores. To fetch Twitter data, we will have to follow the steps given below:-

⚫ Create a twitter Application ⚫ Install / Start HDFS ⚫ Configure Flume

Fetching Twitter Application

slide-21
SLIDE 21

Twitter Data Collection

⚫ To create a Twitter application, click on the following link https://apps.twitter.com/.

Sign in to your Twitter account. You will have a Twitter Application Management window where you can create, delete, and manage Twitter Apps.

⚫ Click on the Create New App button. You will be redirected to a window where you

will get an application form in which you have to fill in your details in order to create the App. While filling the website address, give the complete URL pattern, for example, http://example.com.

slide-22
SLIDE 22

Twitter Data Collection

⚫ Fill in the details, accept the Developer Agreement when finished, click on the

Create your Twitter application button which is at the bottom of the page. If everything goes fine, an App will be created with the given details as shown below.

⚫ Under keys and Access Tokens tab at the bottom of the page, you can observe a

button named Create my access token. Click on it to generate the access token.

slide-23
SLIDE 23

⚫ Finally, click on the Test OAuth button which is on the right side top of the page. This will

lead to a page which displays your Consumer key, Consumer secret, Access token, and Access token secret. Copy these details. These are useful to configure the agent in Flume.

Twitter Data Collection

slide-24
SLIDE 24

⚫ Starting HDFS

docker run --hostname=quickstart.cloudera --privileged=true -t -i -p 8888:8888 -p 80:80 -p 8088:8088 -p 8080:80 -p 50075:50075 -p 7180:7180 -p 50070:50070 cloudera/quickstart /usr/bin/docker-quickstart

⚫ Creating Directory

Hadoop dfs mkdir /hadoop/twitter

Twitter Data Collection

slide-25
SLIDE 25

Configuring Flume

⚫ We have to configure the source, the channel, and the sink using the configuration file in

the conf folder. The example given in this slide uses an experimental source provided by Apache Flume named Twitter 1% Firehose Memory channel and HDFS sink.

Twitter 1% Firehose Memory

⚫ This source is highly experimental. It connects to the 1% sample Twitter Firehose using

streaming API and continuously downloads tweets, converts them to Avro format, and sends Avro events to a downstream Flume sink.

⚫ We will get this source by default along with the installation of Flume. The jar files

corresponding to this source can be located in the lib folder as shown below.

slide-26
SLIDE 26

⚫ Setting the classpath

Set the classpath variable to the lib folder of Flume in Flume-env.sh file as shown below. export CLASSPATH=$CLASSPATH:/FLUME_HOME/lib/* This source needs the details such as Consumer key, Consumer secret, Access token, and Access token secret of a Twitter application. While configuring this source, you have to provide values to the following properties :-

⚫ Channels ⚫ Source Type: org.apache.flume.source.twitter.TwitterSource ⚫ ConsumerKey – The OAuth consumer key ⚫ ConsumerSecret- Oauth consumer secret ⚫ AcessToken – Oauth acess token ⚫ AccessTokenSecret – Oauth token secret ⚫ MaxBatchSize- Maximum no of twitter messages that should be in a twitter batch. The

default value is 1000.

⚫ MaxBatchDurationMillis – Maximum number of milliseconds to wait before closing a

  • batch. The default value is 1000.

Twitter Data Collection

slide-27
SLIDE 27

Twitter Data Collection

Channel

We are using the memory channel. To configure the memory channel, we must provide value to the type of the channel.

⚫ Type – It holds the type of the channel. In our example, the type is MemChannel. ⚫ Capacity – It’s the maximum number of events stored in the channel. Its default value is

100(optional).

⚫ TransactionCapacity- It’s the maximum number of events the channel accepts or sends. Its

default value is 100 (optional).

HDFS Sink This sink writes data into the HDFS. To configure this sink, you must provide the following details.

⚫ Channel ⚫ Type- Hdfs ⚫ Hdfs.path -the path of the directory in HDFS where data is to be stored.

slide-28
SLIDE 28

Twitter Data Collection

# Naming the components on the current agent. TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS # Describing/Configuring the source TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource TwitterAgent.sources.Twitter.consumerKey = Your OAuth consumer key TwitterAgent.sources.Twitter.consumerSecret = Your OAuth consumer secret TwitterAgent.sources.Twitter.accessToken = Your OAuth consumer key access token TwitterAgent.sources.Twitter.accessTokenSecret = Your OAuth consumer key access token secret TwitterAgent.sources.Twitter.keywords = tutorials point,java, bigdata, mapreduce, mahout, hbase, nosql # Describing/Configuring the sink TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:50070/user/Hadoop/seqgen_data/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 # Describing/Configuring the channel TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100 # Binding the source and sink to the channel TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sinks.HDFS.channel = MemChannel

slide-29
SLIDE 29

$ cd $FLUME_HOME $ bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent

Execution

Twitter Data Collection