Apache Hadoop Ingestion & Dispersal Framework Danny Chen - - PowerPoint PPT Presentation

apache hadoop ingestion dispersal framework
SMART_READER_LITE
LIVE PREVIEW

Apache Hadoop Ingestion & Dispersal Framework Danny Chen - - PowerPoint PPT Presentation

Strata NY 2018 September 12, 2018 Apache Hadoop Ingestion & Dispersal Framework Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Agenda Mission


slide-1
SLIDE 1

Apache Hadoop Ingestion & Dispersal Framework

Danny Chen dannyc@uber.com, Omkar Joshi omkar@uber.com Eric Sayle esayle@uber.com Uber Hadoop Platform Team Strata NY 2018 September 12, 2018

slide-2
SLIDE 2

Agenda

  • Mission
  • Overview
  • Need for Hadoop ingestion &

dispersal framework

  • Deep Dive

○ High Level Architecture ○ Abstractions and Building Blocks

  • Configuration & Monitoring of Jobs
  • Completeness & Data Deletion
  • Learnings
slide-3
SLIDE 3

Uber Apache Hadoop Platform Team Mission

Build products to support reliable, scalable, easy-to-use, compliant, and efficient data transfer (both ingestion & dispersal) as well as data storage leveraging the Hadoop ecosystem.

slide-4
SLIDE 4

Overview

  • Any Source to Any Sink
  • Ease of onboarding
  • Business impact & importance of

data & data store location

  • Suite of Hadoop ecosystem tools
slide-5
SLIDE 5

Introducing

slide-6
SLIDE 6

Open Sourced in September 2018 https://github.com/uber/marmaray Blog Post: https://eng.uber.com/marmaray-hadoop-ingestion-open-source/

slide-7
SLIDE 7

Marmaray (Ingestion): Why?

  • Raw data needed in Hadoop data lake
  • Ingested raw data -> Derived Datasets
  • Reliable and correct schematized data
  • Maintenance of multiple data pipelines
slide-8
SLIDE 8

Marmaray (Dispersal): Why?

  • Derived datasets in Hive
  • Need arose to serve live

traffic

  • Duplicate and ad hoc

dispersal pipelines

  • Future dispersal needs
slide-9
SLIDE 9

Marmaray: Main Features

  • Release to production end of 2017
  • Automated schema management
  • Integration w/ monitoring & alerting

systems

  • Fully integrated with workflow
  • rchestration tool
  • Extensible architecture
  • Open sourced
slide-10
SLIDE 10

Marmary: Uber Eats Use Case

slide-11
SLIDE 11

Hadoop Data Ecosystem at Uber

slide-12
SLIDE 12

Hadoop Data Ecosystem at Uber

Marmaray Ingestion Marmaray Dispersal Hadoop Data Lake

Schemaless Analytical Processing

slide-13
SLIDE 13

High-Level Architecture & Technical Deep Dive

slide-14
SLIDE 14

Chain of converters

High-Level Architecture

Schema Service

Input Storage System

Source Connector M3 Monitoring & Alerting System

Work Unit Calculator

Metadata Manager (Checkpoint store) Converter1 Converter 2

Sink Connector

Output Storage System

Error Tables Datafeed Config Store

slide-15
SLIDE 15

Chain of converters

High-Level Architecture

Schema Service

Input Storage System

Source Connector M3 Monitoring System

Work Unit Calculator

Metadata Manager (Checkpoint store) Converter1 Converter 2

Sink Connector

Output Storage System

Error Tables Datafeed Config Store

slide-16
SLIDE 16

Schema Service

Schema Service Get Schema by Name & version Get Schema Service Reader Reader / Decoder

Binary Data Generic Record

Get Schema Service Writer Writer / Encoder

Generic Data Binary Data

slide-17
SLIDE 17

Chain of converters

High-Level Architecture

Schema Service

Input Storage System

Source Connector M3 Monitoring System

Work Unit Calculator

Metadata Manager (Checkpoint store) Converter1 Converter 2

Sink Connector

Output Storage System

Error Tables Topic Config Store

slide-18
SLIDE 18

Metadata Manager

Persistent Storage (ex.HDFS) In-Memory Copy Metadata Manager init() Called on Job start Different Job DAG Components persist() Called after Job finish

Set (key, value) called 0 or more times Get(key) -> value called 0 or more times

slide-19
SLIDE 19

Fork Operator - Why is it needed?

Input Records Schema Conforming records Error Records

  • Avoid reprocessing input

records

  • Avoid re-reading input

records (or in Spark, re-executing input transformations)

slide-20
SLIDE 20

Fork Operator & Fork Function

Input Records Schema Conforming records Error Records

Fork Function Tagged Records

r1, S/F r2, S/F rx, S/F

Success Filter function Failure Filter function

Persisted using Spark’s disk/ memory persistence level

slide-21
SLIDE 21

Easy to Add Support for new Source & Sink

Data lake with GenericRecord

Kafka Hive S3 New Source Cassandra

slide-22
SLIDE 22

Support for Writing into Multiple Systems

Data lake with GenericRecord

Kafka

Hive Table 1 Hive Table 2

slide-23
SLIDE 23

JobDag & JobDagActions

JobDAG Report metrics for monitoring Register table in Hive Job Dag Actions

slide-24
SLIDE 24

Need for running multiple JobDags together

  • Frequency of data arrival
  • Number of messages
  • Avg record size & complexity of schema
  • Spark job has Driver + executors (1 or more)
  • Not efficient model to handle spikes
  • Too many topics to ingest. 2000+
slide-25
SLIDE 25

JobManager

  • Single Spark job for running

ingestion for 300+ topics

  • Executes multiple JobDAGs
  • Manages execution ordering for

multiple JobDAGs

  • Manages shared Spark context
  • Enables job and tier-level

locking

Job Mgr 1 Spark Job

Ingesting kafka-topic 1 (JobDAG 1) Ingesting kafka-topic N (JobDAG N) Waiting Q for JobDAGs

slide-26
SLIDE 26

Completeness

Source (Kafka) 10 min buckets Latest Bucket Sink (Hive) 10 min buckets Latest Bucket

slide-27
SLIDE 27

Completeness contd..

  • Why not run queries on source and sink dataset periodically?

○ Possible for very small datasets ○ Won’t work for billions of records; very expensive!!

  • Bucketizing records

○ How about creating time based buckets say for every 2min or 10min. ○ Count records at source and sink during every runs ■ Does it give 100% guarantee?? No but w.h.p. it is close to it.

slide-28
SLIDE 28

Completeness - High level approach

Kafka Hoodie (Hive) Marmaray

Src Converter Sink Converter

Error Table Input Record (IR) Input Success Record (ISR) Input Error Record (IER) Output Error Record (OER) Output Records (OR)

IR IER OER OR

slide-29
SLIDE 29

Hadoop old way of storing kafka data

Kafka topic1 2014 2015 2018 01 02 08 01 02 06 Latest Date Partition Stitched parquet files (~4GB) (~400 files per partition) Non-stitched parquet files (~40MB) (~20-40K files per partition)

slide-30
SLIDE 30

Data Deletion (Kafka)

  • Old architecture is designed to be append/read only
  • No indexes

○ Need to scan entire partition to find out if record is present or not

  • Only way to update is to rewrite entire partition

○ Re-writing entire partition for

  • GDPR requires all data to be cleaned up once user requests deletion
  • This is a big architectural change and many companies are struggling to

solve this

slide-31
SLIDE 31

Marmaray + HUDI (hoodie) to rescue

slide-32
SLIDE 32

Hoodie Data layout

Kafka Topic 2014 2015 2018 01 02 08 01 02 06 .hoodie Hoodie metadata ts1.commit ts2.commit ts3.commit

f1_ts1.parquet f2_ts1.parquet f4_ts1.parquet f3_ts1.parquet f5_ts2.parquet f6_ts2.parquet f7_ts2.parquet f1_ts3.parquet f8_ts3.parquet

Updates

slide-33
SLIDE 33

Configuration

common: hadoop: fs.defaultFS: "hdfs://namenode/" hoodie: table_name: "mydb.table1" base_path: "/path/to/my.db/table1" metrics_prefix: "marmaray" enable_metrics: true parallelism: 64 kafka: conn: bootstrap.servers: "kafkanode1:9092,kafkanode2:9092" fetch.wait.max.ms: 1000 socket.receive.buffer.bytes: 5242880 fetch.message.max.bytes: 20971520 auto.commit.enable: false fetch.min.bytes: 5242880 source: topic_name: "topic1" max_messages: 1024 read_parallelism: 64 error_table: enabled: true dest_path: "/path/to/my.db/table1/.error" date_partitioned: true

slide-34
SLIDE 34

Monitoring & Alerting

slide-35
SLIDE 35

Learnings

  • Spark
  • Off heap memory usage of spark and YARN killing our

containers

  • External shuffle server overloading
  • Parquet
  • Better record compression with column alignments
  • Kafka
  • Be gentle while reading from kafka brokers
  • Cassandra
  • Cassandra SSTable streaming (no throttling) , no monitoring
  • No backfill for dispersal
slide-36
SLIDE 36

External Acknowledgments

https://gobblin.readthedocs.io/en/latest/

slide-37
SLIDE 37

Other Relevant Talks

Your 5 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber - Wed 11:20am Hudi: Unifying storage and serving for batch and near-real-time analytics - Wed 5:25 pm

slide-38
SLIDE 38

We are hiring!

Positions available: Seattle, Palo Alto & San Francisco email : hadoop-platform-jobs@uber.com

slide-39
SLIDE 39

Useful links

  • https://github.com/uber/marmaray
  • https://eng.uber.com/marmaray-hadoop-ingestion-open-sour

ce/

  • https://github.com/uber/hudi
  • https://eng.uber.com/michelangelo/
  • https://eng.uber.com/m3/
slide-40
SLIDE 40

Q & A?

slide-41
SLIDE 41

Follow our Facebook page: www.facebook.com/uberopensource