Logging with Log4j and log aggregation with Apache flume
By Arivoli.K,MDS201903 Naveen Kumar Reddy,MDS201909 Saager Babu NG,MDS201917 Suman Polley,MDS201935 Avinash Kumar, MDS201907
Logging with Log4j and log aggregation with Apache flume By - - PowerPoint PPT Presentation
Logging with Log4j and log aggregation with Apache flume By Arivoli.K,MDS201903 Naveen Kumar Reddy,MDS201909 Saager Babu NG,MDS201917 Suman Polley,MDS201935 Avinash Kumar, MDS201907 Why Logging is necessary? Here comes log4j Overview
By Arivoli.K,MDS201903 Naveen Kumar Reddy,MDS201909 Saager Babu NG,MDS201917 Suman Polley,MDS201935 Avinash Kumar, MDS201907
2) Filter Object : The Filter object is used to analyze logging information and make further decisions on whether that information should be logged or not. 3)ObjectRenderer : The ObjectRenderer object is specialized in providing a String representation
4) LogManager: The LogManager object manages the logging framework.
agents, using a round-robin or random strategy. These log4j appenders come bundled up with flume and doesn’t require us to write any code which is another reason why it is extremely popular.
PHILOSOPHY:
use.Flume acts as a buffer between source and destination.Thus by balancing out any inconsistency in data Flume maintains a smooth flow of data.
The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository.This ensures reliable data transfer and ensures recoverability.
Solution:
By connecting multiple agents to each other Flume creates a data pipeline . It is possible to scale down the no of servers that write to the HDFS by adding intermediate Flume agents. This structure has its own problems!! If n-th tier has same volume as (n-1)th tier then n-th tier will easily
the outermost tier.
flow converges.
greatest in the innermost tier.
The flow using file channel or other stable channel will resume processing events where it left off.
These buffers have a fixed capacity, and once that capacity is full it will create back pressure on earlier points in the flow. If this pressure propagates to the source of the flow, Flume will become unavailable and may lose data. Rule of Thumb: Event volume must be equal to worst case data ingestion rate (max data ingestion rate )sustained
WHat if the single node goes down? Adding another Flume agent balance the load and it gets better at downstream faliure handling.
time log aggregator.
has evolved to handle many type of streaming data.
application beyond logging (like IoT, Instant messaging service).
NAME NODE CLOGGING
What if all the web servers collecting log data tried to connect to hdfs and write at the same time?
Map Reduce
Spark I m p a l a NAME NODE
1
An Event is the fundamental unit of data transported by flume from its point of origination to its final destination.
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes.
2
Avro is a row-based data format slash a data serialization system released by Hadoop working group in 2009. The data schema is stored as JSON in the header while the rest of the data is stored in binary format. One shining point of Avro is its robust support for schema evolution. Row-based data formats are overall better for storing write-intensive data because appending new records is easier. An Avro Object Container File consists of:
A file header consists of:
0x4F 0x62 0x6A 0x01).
3
Ref: https://en.wikipedia.org/wiki/Apache_Avro
An entity that generates events and sends them to one or more Agents.
4
A container for hosting Sources, channels, Sinks and other components that enable the transportation of events from one place to another place. It is self-contained JVM process. Connecting multiple Flume agents to each establishes a flow. This flow moves data. Each Flume agent has three components: the source, the channel, and the sink. The source is responsible for getting events into the Flume agent, while the sink is responsible for removing the events from the agent and forwarding them to the next agent in the topology, or to HDFS. The channel is a buffer that stores data that the source has received, until a sink has successfully written the data out to the next hop or the eventual destination.
5
An active component that receives events from a specialized location or mechanism and places it on one or more Channels. Sources are active components that receive data from some other application that is producing the data. There are sources that produce data themselves, though such sources are mostly used for testing purposes. Sources can listen to one or more network ports to receive data or can read data from the local file system. Each source must be connected to at least one channel. A source can write to several channels, replicating the events to all or some of the channels, based on some criteria. Flume’s primary RPC source is the Avro Source. The Avro Source is designed to be a highly scalable RPC server that accepts data into a Flume agent, from another Flume agent’s Avro Sink or from a client application that uses Flume’s SDK to send data. The Avro Source together with the Avro Sink represents Flume’s internal communication mechanism (between Flume agents). With the scalability of the Avro Source combined with the channels that act as a buffer, Flume agents can handle significant load spikes.
6
A source named usingFlumeSource of type avro, running in an agent started with the name usingFlume, would be configured with a file that looks like:
7
Channels are passive components that buffer data that has been received by the agent, but not yet written
Channels behave like queues, with sources writing to them and sinks reading from them. Multiple sources can write to the same channel safely, and multiple sinks can read from the same channel. Each sink, though, can read from only exactly one channel. Channels allow sources and sinks to operate at different rates. Having a channel operating as a buffer between sources and sinks has several advantages. Having a buffer in between the sources and the sinks also allows them to operate at different rates, since the writes happen at the tail of the buffer and reads happen
sinks are unable to drain the channels immediately. Channels allow multiple sources and sinks to operate on them. Channels are transactional in nature. Each write to a channel and each read from a channel happens within the context of a transaction. Only once a write transaction is committed will the events from that transaction be readable by any sinks. Also, if a sink has successfully taken an event, the event is not available for other sinks to take until and unless the sink rolls back the transaction.
8
ALL OR NOTHING
9
The following configuration shows a Memory Channel configured to hold up to 100,000 events, with each transaction being able to hold up to 1,000 events. The total memory occupied by all events in the channel can be a maximum of approximately 5 GB of space. Of this 5 GB, the channel considers 10% to be reserved for event headers (as defined by the byteCapacityBufferPercentage parameter), making 4.5 GB available for event bodies:
10
The component that removes data from a Flume agent and writes it to another agent or a data store or some
could be one of the sinks that comes bundled with Flume. Sinks are the components in a Flume agent that keep draining the channel, so that the sources can continue receiving events and writing to the channel. Sinks continuously poll the channel for events and remove them in batches. These batches of events are either written out to a storage or indexing system, or sent to another Flume agent. Sinks are fully transactional. Each sink starts a transaction with the channel before removing events in batches from it. Once the batch of events is successfully written out to storage or to the next Flume agent, the sink commits the transaction with the channel. Once the transaction is committed, the channel removes the events from its own internal buffers.
11
Multi agent flow Where the data goes through multiple agents or hops. The sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.
12
Typical scenario of log aggregation. Multiple web servers producing log data. Logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster. Configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent . This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.
13
Multiplexing Flow Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels. The above example shows a source from agent “foo” fanning out the flow to three different channels. This fan
14
15
1. USING FLUME BY HARI SHREEDHARAN 2. FLUME USER GUIDE https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html 3. REAL TIME DATA INGEST INTO HADOOP USING FLUME https://www.youtube.com/watch?v=SR__hkCINNc
16