Apache Flume Getting data into Hadoop Problem Getting data into - - PowerPoint PPT Presentation
Apache Flume Getting data into Hadoop Problem Getting data into - - PowerPoint PPT Presentation
Apache Flume Getting data into Hadoop Problem Getting data into HDFS is not difficult: % hadoop fs --put data.csv . works great when data is neatly packaged and ready to upload Unfortunately, e.g. a webserver is creating data
Problem
- Getting data into HDFS is not difficult:
– % hadoop fs --put data.csv . – works great when data is neatly packaged and ready to upload
- Unfortunately, e.g. a webserver is creating data
all the time
– How often should a batch load data to HDFS happen? Daily? Hourly?
- The real need is a solution that can deal with
streaming logs/data
1.9.2016 2
Solution: Apache Flume
- Introduced in Cloudera's CDH3 distribution
- versions 0.x: flume, 1.x: flume-ng
1.9.2016 3
Overview
- Stream data (events, not files) from clients to
sinks
- Clients: files, syslog, avro, …
- Sinks: HDFS files, HBase, …
- Configurable reliability levels
– Best effort: “Fast and loose” – Guaranteed delivery: “Deliver no matter what”
- Configurable routing / topology
1.9.2016 4
Architecture
Component Function Agent The JVM running Flume. One per machine. Runs many sources and sinks. Client Produces data in the form of events. Runs in a separate thread. Sink Receives events from a channel. Runs in a separate thread. Channel Connects sources to sinks (like a queue). Implements the reliability semantics. Event A single datum; a log record, an avro object, etc. Normally around ~4KB.
1.9.2016 5
Events
- Payload of the data is called an event
– composed of zero or more headers and a body
- Headers are key/value pairs
– making routing decisions or – carry other structured information
1.9.2016 6
Channels
- Provides a buffer for in-flight events
– after they are read from sources – until they can be written to sinks in the data processing pipelines
- Two (three) primary types are
– a memory-backed/nondurable channel – a local-filesystem-backed/durable channel – (hybrid)
1.9.2016 7
Channels
- The writing rate of the sink should faster than
the ingest rare from the sources
– ChannelException might lead to data loss
1.9.2016 8
Interceptors
- An interceptor is a point in data flow where
events can be inspected and altered
- zero or more interceptors can be chained
after a source creates an event
1.9.2016 9
Channel Selectors
- Responsible for how data moves from a
source to one or more channels
- Flume comes with two selectors
– replicating channel selector (the default) puts a copy of the event into each channel – multiplexing channel selector writes to different channels depending on headers
- Combined with interceptors forms the
foundation for routing
1.9.2016 10
Sinks
- Flume supports a set of sinks
– HDFS, ElasticSearch, Solr, HBase, IRC, MongoDB, Cassandra, RabbitMQ, Redis, …
- HDFS Sink continuously
– open a file in HDFS, – stream data into it, – at some point, close that file – start a new one agent.sinks.k1.type=hdfs agent.sinks.k1.hdfs.path=/path/in/hdfs
1.9.2016 11
Sources
- Flume source consumes events delivered to
it by an external source
– like a web server
1.9.2016 12
Tiered Collection
- Send events from agents to another tier of
agents to aggregate
- Use an Avro sink (really just a client) to send
events to an Avro source (really just a server) in another machine
- Failover supported
- Load balancing (soon)
- Transactions guarantee handoff
1.9.2016 13
Tiered Collection Handoff
- Agent 1: Tx begin
- Agent 1: Channel take event
- Agent 1: Sink send
- Agent 2: Tx begin
- Agent 2: Channel put
- Agent 2: Tx commit, respond OK
- Agent 1: Tx commit (or rollback)
1.9.2016 14
Tiered Data Collection
1.9.2016 15
Apache Flume
- A source writes events to one or more
channels
- A channel is the holding area as events are
passed from a source to a sink
- A sink receives events from one channel
- nly
- An agent can have many channels
1.9.2016 16
Flume Configuration File
- Simple Java property file of key/value pairs
- Several agents can be configured in a single file
– agents are identified by agent identifier (called a name)
- Each agent is configured, starting with three
parameters:
agent.sources=<list of sources> agent.channels=<list of channels> agent.sinks=<list of sinks>
1.9.2016 17
- Each source, channel and sink has a unique
name within the context of that agent
– Prefix for channel named access agent.channels.access
- Each item has a type
– E.g. in-memory channel is memory agent.channels.access.type=memory
1.9.2016 18
Hello, World!
agent.sources=s1 agent.channels=c1 agent.sinks=k1 agent.sources.s1.type = spooldir agent.sources.s1.spoolDir = /etc/spool … agent.sinks.k1.type = hdfs agent.sinks.k1.hdfs.path = hdfs://localhost:9001/user/hduser/log-data …
1.9.2016 19
Hello, World!
- Config has one agent (called agent) with
– a source named s1 – a channel named c1 – a sink named k1
- The s1 source's type is spooldir
– Files appearing in /etc/spool are ingested
- The type of the sink named k1 is hdfs
– writes data files to log-data
1.9.2016 20
Command Line Usage
% flume-ng help Usage: /usr/local/flume/apache-flume-1.6.0- bin/bin/flume-ng <command> [options]... commands: help display this help text agent run a Flume agent avro-client run an avro Flume client version show Flume version info …
1.9.2016 21
Command Line Usage
- The agent command requires 2 parameters
– a configuration file to use and – the agent name
- Example
% flume-ng agent -n agent -f myConf.conf …
- Test
% cp log-data/* /etc/spool
1.9.2016 22
Sources Code
public class MySource implements PollableSource { public Status process() { // Do something to create an Event.. Event e = EventBuilder.withBody(…).build(); // A channel instance is injected by Flume. Transaction tx = channel.getTransaction(); tx.begin(); try { channel.put(e); tx.commit(); } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; } }
1.9.2016 23
Sinks Code
public class MySink implements PollableSink { public Status process() { Transaction tx = channel.getTransaction(); tx.begin(); try { Event e = channel.take(); if (e != null) { // … tx.commit(); } else { return Status.BACKOFF; } } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; } }
1.9.2016 24