 
              Data Collection With Apache Flume Md. Sadil Khan Rohit Aich Bhowmick
Outline • Data Collection • Current Problem • Introduction to Apache Flume • Flume Features • The Flume Architecture • Data Flow in Flume • Flume Goals • Reliability • Scalability • Extensibility • Manageability • Twitter Data Collection
Data Collection ⚫ Data collection plays the most important role in the Big Data cycle. The Internet provides almost unlimited sources of data for a variety of topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data and combine those with their transactional data. ⚫ For example, let’s assume we would like to build a system that recommends restaurants. The first step would be to gather data, in this case, reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data main technologies, but in order to implement a big data application, we simply need to make it work in real time. ⚫ Big data describes voluminous amounts of structured, semi-structured and unstructured data collected by organizations. But because it takes a lot of time and money to load big data into a traditional relational database for analysis, new approaches for collecting and analyzing data have emerged. To gather and then mine big data for information, raw data with extended metadata is aggregated in a data lake. From there, machine learning and artificial intelligence programs use complex algorithms to look for repeatable patterns.
Current Problem • Situation: Each and every day, due to the ease of access in the digital world, we are generating more and more data everyday from our smartphone, laptop, tablet, etc. • Problem: Companies want to analyse, gather insights from this huge amount of data for business related goals. We need a reliable, scalable, extensible and manageable way to gather the data where we can efficiently process it!
Introduction to Apache Flume ⚫ Apache Flume is a highly reliable,distributed and configurable streaming tool for aggregating and transporting large amounts of streaming data such as log files, events from various sources to a centralized data store (like HDFS and HBASE). It was developed by Cloudera.
Flume Featues Flume collects data efficiently, aggregates and moves large amounts of log ⚫ data from many different sources to a centralized store (HDFS,HBASE). Simple and flexible architecture, and provides a streaming of data flows ⚫ and leverages data movement from multiple machines in within an enterprise into Hadoop. Flume isn’t restricted to log data aggregation and it can transport massive ⚫ quantities of event data including but not limited to network traffic data, social-media-generated data,email messages and pretty much any data source possible. Built-in support for several Sources and destination platforms to integrate ⚫ with.
The Flume Architecture ⚫ Data generators (such as Facebook, Twitter) generate data which gets collected by individual Flume agents running on them. ⚫ Thereafter, a data collector (which is also an agent) collects the data from the agents which is aggregated and pushed into a centralized store such as HDFS or HBase.
The Flume Architecture • Event: Event is the single log entry or basic unit of data which we transport further.An event is composed of zero or more headers and a body. • Logfile: In computing, a log file is a file that records either occurances in an operating system or other software runs, or messages between different users of a communication software. Logging is the act of keeping a log. In the simplest case, messages are written to a single log file. • Agent : It’s a JVM process which receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent. It has 3 components -Flume Source, Flume Channel and Flume Sink. • Processor: Intermediate Processing (aggregation). This process is optional. • Collector: Write data to permanent storage.
Components of Agent -Source ➢ Source ➢ Flume Source is configured within an agent and it listens for events from an external source(eg:web server) it reads data, translates events, and handles failure situations. ➢ But source doesn’t know how to store the event. So, after receiving enough data to produce a Flume event, it sends events to the channel to which the source is connected. ➢ The external source sends events to Flume in a format that is recognised by the target Flume source. ➢ Spooling Directory Source ➢ Exec Source ➢ Netcat Source ➢ HTTP Source ➢ Twitter Source ➢ Avro Source
Component of Flume- Channel Channel ⚫ Channels are communication bridges between sources and sinks within an agent. Once a Flume source receives an agent, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. ⚫ Memory channel stores the events from an in-memory queue and from there events will be accessed by sink. Because of software or hardware failure, if the agent process dies in the middle, then all the events currently in the memory channel are lost forever. ⚫ The file channel is another example – It is backed by the local file system. Unlike, memory channel, file channel writes the contents to a file on the file system that is deleted only after successful delivery to the sink. Notes: The Memory channel is the fastest but has the risk of data loss. The file channels are typically much slower but effectively provide guaranteed delivery to the sink.
Components of Agent - Sink ⚫ Sink removes the event from the channel and puts it into an external repository like HDFS or forwards it to the Flume source of the next Flume agent in the flow. The source and Sink within the given agent run asynchronously with the events staged in the channel. 1) HDFS Sink 2) Logger Sink 3) File Roll Sink 4) HbaseSInk 5) MorphineSolr Sink 6)Avro Sink
Data Flow in Flume ⚫ Multi-hop Flow : Basically, before reaching the final destination there can be multiple agents and an event may travel through more than one agent, within Flume. ⚫ Fan-out Flow : In very simple language when data transfers or the data flow from one source to multiple channels that is what we call fan-out flow. Basically, in Flume Data flow, it is of two categories − ⚫ 1. Replicating: It is the data flow where the data will be replicated in all the configured channels. ⚫ 2. Multiplexing: On defining Multiplexing we can say the data flow where the data will be sent to a selected channel which is mentioned in the header of the event. ⚫ Fan-in Flow: The data will be transferred from many sources to one channel.
Flume: Failure Handling ⚫ For each event, there are two transactions which take place. ⚫ One at the sender and one at the receiver. ⚫ Basically, the sender sends events to the receiver. ⚫ Although, the receiver commits its own transaction and sends a “received” signal to the sender, soon after receiving the data. ⚫ Thus, the sender commits its transaction just after receiving the signal.
Flume Goals: Reliability ⚫ Flume uses transactional approach to data flow,by default. • Best Effort • Store on Failure and Retry • End to End Reliability
Flume Goals: Scalability • Horizontally Scalable Data Path • Load Balancing
Flume Goals: Scalability • Horizontally Scalable Control Path
Flume Goals: Extensibility • Simple Source and Sink API • Event streaming and composition of simple operation • Plug in Architecture • Add your own sources, sinks, decorators
Flume Goals: Manageability • Centralized Data Flow Management Interface
Flume Goals: Manageability • Configuring Flume Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs : //namenode/user/flume”)}] ; • Output Bucketing /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
Twitter Data Collection Fetching Twitter Application A webserver generates log data and this data is collected by an agent in Flume. The channel buffers this data to a sink, which finally pushes it to centralized stores. To fetch Twitter data, we will have to follow the steps given below:- ⚫ Create a twitter Application ⚫ Install / Start HDFS ⚫ Configure Flume
Twitter Data Collection ⚫ To create a Twitter application, click on the following link https://apps.twitter.com/. Sign in to your Twitter account. You will have a Twitter Application Management window where you can create, delete, and manage Twitter Apps. ⚫ Click on the Create New App button. You will be redirected to a window where you will get an application form in which you have to fill in your details in order to create the App. While filling the website address, give the complete URL pattern, for example, http://example.com.
Recommend
More recommend