Migrating from Oracle to Espresso David Max Senior Software - - PowerPoint PPT Presentation

▶

Dec 10, 2022 149 likes •470 views

Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn About LinkedIn New York Engineering Located in Empire State Building Approximately 100 engineers and 1000 employees total New York Multiple teams, front

SLIDE 1

Migrating from Oracle to Espresso

David Max

Senior Software Engineer LinkedIn

SLIDE 2

About LinkedIn New York Engineering

Located in Empire State Building
Approximately 100 engineers and

1000 employees total

Multiple teams, front end, back

end, and data science

New York Engineering

SLIDE 3

About Me

Software Engineer at LinkedIn NYC

since 2015

Content Ingestion team
Office Hours –

Thursday 11:30-12:00

David Max

Senior Software Engineer LinkedIn

www.linkedin.com/in/davidpmax/

SLIDE 4

What is Content Ingestion?

Content Ingestion Babylonia

SLIDE 5

Content Ingestion

Babylonia

SLIDE 6

Content Ingestion

Babylonia

SLIDE 7

Content Ingestion

Babylonia

url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

SLIDE 8

Content Ingestion

Babylonia

SLIDE 9

What is Content Ingestion?

Content Ingestion

Babylonia

Extracts metadata from web pages
Source of Truth for 3rd party content
Also contains metadata for some

public 1st party content

Used by LinkedIn services for sharing,

decorating, and embedding content

Data also feeds into content

understanding and relevance models

SLIDE 10

Babylonia Datasets

Database HDFS ETL

Content Ingestion

Babylonia

Data Change Events

SLIDE 11

Downstream and Upstream Datasets

Database HDFS ETL

Near Line Offline

Data Change Events Content Ingestion

Babylonia

SLIDE 12

Babylonia use of Oracle (before migration)

Schema – Metadata extracted from

each URL stored in individual rows

Client –Babylonia the main (but not
nly) client to directly execute

queries on Oracle DB

Rest.li – Most online interaction with

dataset in Oracle via Babylonia’s Rest.li API

RDBMS – Relational Database

Management System

Databus – Platform for streaming

data change events to near line consumers

Offline – ETL to HDFS for offline

consumers

SLIDE 13

What is Espresso?

Espresso is LinkedIn’s strategic distributed, fault-tolerant NoSQL database that powers many of LinkedIn’s services

~100 clusters in use*
~420TB of SoT data*
~2 million qps at peak load*

* as of August 1, 2017

SLIDE 14

What is Espresso?

Document – A table is a container for

documents of the same schema (defined in Avro)

Keys – Documents index by key

fields, which are defined in the table schema

NoSQL – Non relational
Distributed – A single database can

be distributed over a cluster of machines

Scalable – Able to scale clusters

horizontally by adding more nodes

SLIDE 15

Why Migrate?

Integration – Support for Espresso

integrated with other tools and systems at LinkedIn

Rest.li – Espresso’s API is based on

Rest.li, which makes it easier to treat Espresso endpoints like other LinkedIn Rest.li endpoints

Schema Evolution – Supported with

zero downtime and no coordination with DBA teams

Maintenance – Babylonia’s Oracle

tables required periodic jobs to be run that involved downtime for each server

Cost – Oracle more expensive to run
Strategy – Espresso is the preferred

platform at LinkedIn for data of this type

Support – Espresso team part of

SLIDE 16

Data Formats (Oracle)

Oracle Database HDFS ETL

Near Line Offline

Oracle Databus Events Content Ingestion

Babylonia

Rest.li Endpoints

Oracle Row Pegasus Object Pegasus Data Oracle Row Oracle Row Oracle Row

Complex transformation

between Oracle format and Pegasus format

SLIDE 17

Pegasus and Avro

Pegasus Schema Avro Schema

Java Objects Java Objects

Both can be used to

generate Java objects with very similar interfaces

Pegasus schema can be

used to auto-generate the Avro schema

Pegasus and Avro schema

definitions are very similar

SLIDE 18

Data Formats (Espresso)

Espresso Database HDFS ETL

Near Line Offline

Espresso Brooklin Events Content Ingestion

Babylonia

Rest.li Endpoints

Espresso Avro Pegasus Object Pegasus Data Espresso Avro Espresso Avro Espresso Avro

Simple transformation

between Avro format and Pegasus format

SLIDE 19

Why Migrate? Schema Evolution

ALTER TABLE
Not tied to code deployment – need to

coordinate with DBAs

Schema change involves server

downtime

In practice, developers go to great

lengths to avoid the hassle

Schema accumulates tech debt
Document schema auto-registration
Schema changes are registered

automatically as part of the Babylonia deployment process

Backwards compatibility is enforced –

existing data does not need to be transformed

Avro schema more natural fit with

Rest.li Pegasus schema

Espresso

SLIDE 20

Goals for Migration Process

Zero down time
Transparent to Rest.li clients
Give offline and nearline

consumers time to migrate

Validate each step
Mirroring in real time

SLIDE 21

Pre-Migration State of Babylonia

Oracle Database HDFS ETL

Near Line Offline

Oracle Databus Events Content Ingestion

Babylonia

SLIDE 22

Pre-Migration State of Babylonia

Oracle Database

Oracle Databus Events Rest.li Endpoints

Other Services

Rest.li Calls

SLIDE 23

Pre-Migration Cleanup

Oracle Database

Oracle Databus Events Rest.li Endpoints

Other Services

Rest.li Calls

Identify code that is

tightly-coupled to the database

Decide which code should

be reimplemented for Espresso, and which code should be decoupled or eliminated.

Reduce number of code

paths to migrate

The easiest lines of code to migrate are the lines of code that don’t exist

SLIDE 24

Bootstrap Espresso Database

Oracle Database HDFS ETL

Offline Convert Job

Espresso Database

Espresso Bulk Loader Avro Data File

SLIDE 25

Bootstrap Espresso Database

Oracle Database HDFS ETL Espresso Database

SLIDE 26

Shadow Read Validation

Databus Listener, Shadow Read Validation

Oracle Database

Oracle Databus Events

Espresso Database Databus Listener

SLIDE 27

Direct Writes to Espresso

Oracle Database

Oracle Databus Events

Espresso Database Databus Listener

Shadow Read Validation Direct Write

SLIDE 28

Resolving Write Conflicts

Oracle Databus Events

Espresso Database Databus Listener

Direct Write

Dual Write Conflict – Databus Listener

and Babylonia updating same record

Migration Control – optional

field added to scheme indicating which process wrote the record: Bulk Loader, Databus listener, or Babylonia

SLIDE 29

Espresso New SoT

Oracle Database

Oracle Databus Events

Espresso Database

Direct Read/Write Dual Writes Espresso Brooklin Events Deprecated

SLIDE 30

Oracle Turnoff

Espresso Database

Direct Read/Write Espresso Brooklin Events

SLIDE 31

Migrating from Oracle to Espresso

About LinkedIn New York Engineering

New York Engineering

About Me

What is Content Ingestion?

What is Content Ingestion?

Babylonia Datasets

Downstream and Upstream Datasets

Babylonia use of Oracle (before migration)

What is Espresso?

Espresso is LinkedIn’s strategic distributed, fault-tolerant NoSQL database that powers many of LinkedIn’s services

What is Espresso?

Why Migrate?

Data Formats (Oracle)

Pegasus and Avro

Pegasus Schema Avro Schema

Data Formats (Espresso)

Why Migrate? Schema Evolution

Goals for Migration Process

Pre-Migration State of Babylonia

Pre-Migration State of Babylonia

Other Services

Pre-Migration Cleanup

Other Services

Bootstrap Espresso Database

Bootstrap Espresso Database

Databus Listener, Shadow Read Validation

Direct Writes to Espresso

Resolving Write Conflicts

Espresso New SoT

Oracle Turnoff

Thank you