Migrating from Oracle to Espresso David Max Senior Software - - PowerPoint PPT Presentation

migrating from oracle to espresso
SMART_READER_LITE
LIVE PREVIEW

Migrating from Oracle to Espresso David Max Senior Software - - PowerPoint PPT Presentation

Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn About LinkedIn New York Engineering Located in Empire State Building Approximately 100 engineers and 1000 employees total New York Multiple teams, front


slide-1
SLIDE 1

Migrating from Oracle to Espresso

David Max

Senior Software Engineer LinkedIn

slide-2
SLIDE 2

About LinkedIn New York Engineering

  • Located in Empire State Building
  • Approximately 100 engineers and

1000 employees total

  • Multiple teams, front end, back

end, and data science

New York Engineering

slide-3
SLIDE 3

About Me

  • Software Engineer at LinkedIn NYC

since 2015

  • Content Ingestion team
  • Office Hours –

Thursday 11:30-12:00

David Max

Senior Software Engineer LinkedIn

www.linkedin.com/in/davidpmax/

slide-4
SLIDE 4

What is Content Ingestion?

Content Ingestion Babylonia

slide-5
SLIDE 5

Content Ingestion

Babylonia

slide-6
SLIDE 6

Content Ingestion

Babylonia

slide-7
SLIDE 7

Content Ingestion

Babylonia

url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

slide-8
SLIDE 8

Content Ingestion

Babylonia

slide-9
SLIDE 9

What is Content Ingestion?

Content Ingestion

Babylonia

  • Extracts metadata from web pages
  • Source of Truth for 3rd party content
  • Also contains metadata for some

public 1st party content

  • Used by LinkedIn services for sharing,

decorating, and embedding content

  • Data also feeds into content

understanding and relevance models

slide-10
SLIDE 10

Babylonia Datasets

Database HDFS ETL

Content Ingestion

Babylonia

Data Change Events

slide-11
SLIDE 11

Downstream and Upstream Datasets

Database HDFS ETL

Near Line Offline

Data Change Events Content Ingestion

Babylonia

slide-12
SLIDE 12

Babylonia use of Oracle (before migration)

  • Schema – Metadata extracted from

each URL stored in individual rows

  • Client –Babylonia the main (but not
  • nly) client to directly execute

queries on Oracle DB

  • Rest.li – Most online interaction with

dataset in Oracle via Babylonia’s Rest.li API

  • RDBMS – Relational Database

Management System

  • Databus – Platform for streaming

data change events to near line consumers

  • Offline – ETL to HDFS for offline

consumers

slide-13
SLIDE 13

What is Espresso?

Espresso is LinkedIn’s strategic distributed, fault-tolerant NoSQL database that powers many of LinkedIn’s services

  • ~100 clusters in use*
  • ~420TB of SoT data*
  • ~2 million qps at peak load*

* as of August 1, 2017

slide-14
SLIDE 14

What is Espresso?

  • Document – A table is a container for

documents of the same schema (defined in Avro)

  • Keys – Documents index by key

fields, which are defined in the table schema

  • NoSQL – Non relational
  • Distributed – A single database can

be distributed over a cluster of machines

  • Scalable – Able to scale clusters

horizontally by adding more nodes

slide-15
SLIDE 15

Why Migrate?

  • Integration – Support for Espresso

integrated with other tools and systems at LinkedIn

  • Rest.li – Espresso’s API is based on

Rest.li, which makes it easier to treat Espresso endpoints like other LinkedIn Rest.li endpoints

  • Schema Evolution – Supported with

zero downtime and no coordination with DBA teams

  • Maintenance – Babylonia’s Oracle

tables required periodic jobs to be run that involved downtime for each server

  • Cost – Oracle more expensive to run
  • Strategy – Espresso is the preferred

platform at LinkedIn for data of this type

  • Support – Espresso team part of

LinkedIn

slide-16
SLIDE 16

Data Formats (Oracle)

Oracle Database HDFS ETL

Near Line Offline

Oracle Databus Events Content Ingestion

Babylonia

Rest.li Endpoints

Oracle Row Pegasus Object Pegasus Data Oracle Row Oracle Row Oracle Row

  • Complex transformation

between Oracle format and Pegasus format

slide-17
SLIDE 17

Pegasus and Avro

Pegasus Schema Avro Schema

Java Objects Java Objects

  • Both can be used to

generate Java objects with very similar interfaces

  • Pegasus schema can be

used to auto-generate the Avro schema

  • Pegasus and Avro schema

definitions are very similar

slide-18
SLIDE 18

Data Formats (Espresso)

Espresso Database HDFS ETL

Near Line Offline

Espresso Brooklin Events Content Ingestion

Babylonia

Rest.li Endpoints

Espresso Avro Pegasus Object Pegasus Data Espresso Avro Espresso Avro Espresso Avro

  • Simple transformation

between Avro format and Pegasus format

slide-19
SLIDE 19

Why Migrate? Schema Evolution

  • ALTER TABLE
  • Not tied to code deployment – need to

coordinate with DBAs

  • Schema change involves server

downtime

  • In practice, developers go to great

lengths to avoid the hassle

  • Schema accumulates tech debt
  • Document schema auto-registration
  • Schema changes are registered

automatically as part of the Babylonia deployment process

  • Backwards compatibility is enforced –

existing data does not need to be transformed

  • Avro schema more natural fit with

Rest.li Pegasus schema

Espresso

slide-20
SLIDE 20

Goals for Migration Process

  • Zero down time
  • Transparent to Rest.li clients
  • Give offline and nearline

consumers time to migrate

  • Validate each step
  • Mirroring in real time
slide-21
SLIDE 21

Pre-Migration State of Babylonia

Oracle Database HDFS ETL

Near Line Offline

Oracle Databus Events Content Ingestion

Babylonia

slide-22
SLIDE 22

Pre-Migration State of Babylonia

Oracle Database

Oracle Databus Events Rest.li Endpoints

Other Services

Rest.li Calls

slide-23
SLIDE 23

Pre-Migration Cleanup

Oracle Database

Oracle Databus Events Rest.li Endpoints

Other Services

Rest.li Calls

  • Identify code that is

tightly-coupled to the database

  • Decide which code should

be reimplemented for Espresso, and which code should be decoupled or eliminated.

  • Reduce number of code

paths to migrate

The easiest lines of code to migrate are the lines of code that don’t exist

slide-24
SLIDE 24

Bootstrap Espresso Database

Oracle Database HDFS ETL

Offline Convert Job

Espresso Database

Espresso Bulk Loader Avro Data File

slide-25
SLIDE 25

Bootstrap Espresso Database

Oracle Database HDFS ETL Espresso Database

slide-26
SLIDE 26

Shadow Read Validation

Databus Listener, Shadow Read Validation

Oracle Database

Oracle Databus Events

Espresso Database Databus Listener

slide-27
SLIDE 27

Direct Writes to Espresso

Oracle Database

Oracle Databus Events

Espresso Database Databus Listener

Shadow Read Validation Direct Write

slide-28
SLIDE 28

Resolving Write Conflicts

Oracle Databus Events

Espresso Database Databus Listener

Direct Write

  • Dual Write Conflict – Databus Listener

and Babylonia updating same record

  • Migration Control – optional

field added to scheme indicating which process wrote the record: Bulk Loader, Databus listener, or Babylonia

slide-29
SLIDE 29

Espresso New SoT

Oracle Database

Oracle Databus Events

Espresso Database

Direct Read/Write Dual Writes Espresso Brooklin Events Deprecated

slide-30
SLIDE 30

Oracle Turnoff

Espresso Database

Direct Read/Write Espresso Brooklin Events

slide-31
SLIDE 31

Thank you