Handling Personal Information in LinkedIns Content Ingestion System - - PowerPoint PPT Presentation

handling personal information in linkedin s content
SMART_READER_LITE
LIVE PREVIEW

Handling Personal Information in LinkedIns Content Ingestion System - - PowerPoint PPT Presentation

Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software Engineer LinkedIn About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00


slide-1
SLIDE 1

Handling Personal Information in LinkedIn’s Content Ingestion System

David Max

Senior Software Engineer LinkedIn

slide-2
SLIDE 2

About Me

  • Software Engineer at LinkedIn NYC

since 2015

  • Content Ingestion team
  • Office Hours –

Thursday 11:30-12:00

David Max

Senior Software Engineer LinkedIn

www.linkedin.com/in/davidpmax/

slide-3
SLIDE 3

About LinkedIn New York Engineering

  • Located in Empire State Building
  • Approximately 100 engineers and

1000 employees total

  • Multiple teams, front end, back

end, and data science

New York Engineering

slide-4
SLIDE 4

Disclaimers

  • I’m not a lawyer
  • Some details omitted
  • I am not a spokesperson for official LinkedIn

policy

slide-5
SLIDE 5

O U R M I S S I O N

Create economic opportunity for every member

  • f the global workforce
slide-6
SLIDE 6

LinkedIn

>546M >70%

  • World’s largest professional

network

members

  • f members reside outside the U.S.
  • More than 200 countries and

territories worldwide

slide-7
SLIDE 7

General Data Protection Regulation

  • Applies to all companies worldwide that

process personal data of EU citizens.

  • Widens definition of personal data.
  • Introduces restrictive data handling

principles.

  • Enforceable from Ma

May 25, 2018.

slide-8
SLIDE 8

Handling Personally Identifiable Information (PII)

Limit personal data collection, storage, usage

Data Minimization

Cannot use collected data for a different purpose

Consent

Do not hold data longer then necessary

Retention

Must delete data upon request

Deletion

slide-9
SLIDE 9

Handling PII in Content Ingestion

Content Ingestion Data Protection Babylonia

Data Minimization Consent Retention Deletion

slide-10
SLIDE 10

What is Content Ingestion?

Content Ingestion Babylonia

slide-11
SLIDE 11

Content Ingestion

Babylonia

slide-12
SLIDE 12

Content Ingestion

Babylonia

slide-13
SLIDE 13

Content Ingestion

Babylonia

url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

slide-14
SLIDE 14

Content Ingestion

Babylonia

slide-15
SLIDE 15

What is Content Ingestion?

Content Ingestion

Babylonia

  • Extracts metadata from web pages
  • Source of Truth for 3rd party content
  • Also contains metadata for some

public 1st party content

  • Used by LinkedIn services for sharing,

decorating, and embedding content

  • Data also feeds into content

understanding and relevance models

slide-16
SLIDE 16

How does PII get into Babylonia?

slide-17
SLIDE 17

Ingesting 1st party pages containing publicly viewable member PII

  • Profile pages
  • Publish posts
  • SlideShare content
slide-18
SLIDE 18

When a Member Account is Closed

  • Remove scraped data relating to

the member pages that have been taken down

  • Notify downstream systems that

might be holding a copy of the data

  • Babylonia (along with other

systems) is notified that the member’s account is closed

  • Other systems take down the

member’s content (i.e. public profile page, publish posts, etc.)

What happens What Babylonia needs to do

slide-19
SLIDE 19

Babylonia Datasets

Espresso Database HDFS ETL

Brooklin Data Change Events Datasets Content Ingestion

Babylonia

slide-20
SLIDE 20

Downstream and Upstream Datasets

Espresso Database HDFS ETL

Brooklin Data Change Events

1st party web page

profile job article

publishing profile

Online Service Near Line Offline

slide-21
SLIDE 21
  • Need to identify URLs that

contain a member’s PII.

  • My

My post might contain yo your ur PII

  • Connection between member

and the URL resides in the upstream system

Challenges of member PII in Babylonia

slide-22
SLIDE 22

Option #1: Require Upstream Systems to Notify Babylonia

  • Simple – Babylonia waits to be told

specifically which URLs should be purged

  • Babylonia only does extra work when a URL

needs to be purged

  • Puts responsibility where the knowledge is

Pros Cons

  • Requires additional work by every system

that exposes PII in publicly accessible web pages

  • If the notification is missed, how will

Babylonia know?

  • 1st party URLs sometimes change as

upstream systems are changed – need to correctly handle old URLs too

slide-23
SLIDE 23

Option #2: Actively Refetch Every 1st Party URL

  • Simple logic: Page gone? Purge the page.
  • Requires little additional work from

upstream systems

  • Works also for old 1st party URLs

Pros Cons

  • There are a lot of 1st party URLs in

Babylonia

  • Continuous polling of all 1st party URLs

consumes a lot of resources just for the sake of the very few URLs that are actually affected

  • Extra work to avoid false positives or false

negatives

slide-24
SLIDE 24

Option #3: Eliminate Member PII in Babylonia

  • The easiest data to delete is data that isn’t

in your system to begin with

  • Gets closer to Single Source of Truth (SSOT)

for all 1st party content – better for consistency, not only for compliance

Pros Cons

  • Babylonia is relied upon by numerous systems

to have content for URLs – excluding 1st party content will affect member experience

  • No substitute currently available
  • Difficult to achieve based on URL – can’t always

tell by looking at a URL if it resolves to 1st party content (eg. shortlinks)

slide-25
SLIDE 25

Blended Approach

  • Opt

ption 1 - Having upstream systems notify is

best, but might miss some pages

  • Opt

ption 2 - Active refetch is thorough but

  • expensive. Must use to catch pages that

won’t support notifications

  • Opt

ption 3 - Some pages won’t work with active

  • refetch. For example, pages that still return

an HTTP status code 200 even when the data has been removed. These must be blocked

slide-26
SLIDE 26

Classification of Ingested URLs

URL 3rd Party 1st Party Blocked Whitelisted Actively Refetched Notified by Upstream

slide-27
SLIDE 27

Option 1 – Upstream Notification

  • Upstream system sends a

Kafka message

  • Babylonia consumes message

and purges data

  • Ope

pen source -

kafka.apache.org

slide-28
SLIDE 28

Option 2 – Active Refetching

Espresso Database HDFS ETL

Refetch URL table Refetch URL table

Offline job

Refetch messages

Kafka Push job Refetch process

UPDATE Takedown Requests for deleted pages

slide-29
SLIDE 29

Option 3 – Whitelist

  • Block all 1st party URLs that

can’t meet minimal requirements

  • Mainly must return a 404 for an

invalid or deleted URL

  • Ensures new 1st party URLs are
  • nboarded before being

ingested

slide-30
SLIDE 30

Managing PII in Datasets

HDFS ETL

Offline Datasets

Espresso Database

slide-31
SLIDE 31

Espresso Datasets

Espresso Datasets

Espresso Database

  • LinkedIn distributed

NoSQL database

  • Data stored in Avro

format (JSON)

  • Indexed by specific

primary key fields What is Espresso? Challenges

  • Reference to PII not

always in the key

  • ETL snapshots of

Espresso Dataset become offline Datasets

slide-32
SLIDE 32

Offline (HDFS) Datasets

HDFS ETL

Offline Datasets

  • Files of Avro (JSON) records
  • Need to read whole record to see if

it has PII

  • Files not conducive to removing
  • ne record from the middle
  • Dataset can be source for

downstream jobs that also need to be purged

Challenges

slide-33
SLIDE 33

Which datasets contain member PII?

Data Discovery

  • Data discovery and lineage tool
  • Central location for all schema
  • Document meanings of each column
  • Trace downstream/upstream lineage
  • f datasets
  • Tag every column that can contain

member reference or PII.

  • Open Source -

github.com/linkedin/wherehows

WhereHows

slide-34
SLIDE 34
  • Interface for accessing datasets
  • Combines dataset schema with

WhereHows metadata

  • Defines output virtual dataset while

preserving data tags

  • Supports defining virtual datasets

where PII is excluded or obfuscated

Dali (Data Access at LinkedIn)

Raw Dataset

WhereHows Metadata

Dali Reader

slide-35
SLIDE 35

Only systems that handle PII properly are allowed access

Access Control

  • Controls access to PII data to known

list of authorized systems

  • We only approve access to systems

that it can handle PII properly

  • Ensures that member PII can’t leak

into untracked systems/datasets

  • Acts as a list of downstream services

Access Control List (ACL)

slide-36
SLIDE 36

Keeping Track of Personal Information in Babylonia

  • Field tagging for fields

containing PII

  • Know where the PII is

WhereHows Dali ACL

  • Downstreams use Dali,

which preserves the WhereHows tagging on new virtual datasets

  • Keeps tags with the

data as it moves from

  • ne dataset to another
  • Control the spread of

PII data only to authorized readers

  • Serves as a list of

current downstream systems to notify when data is purged

slide-37
SLIDE 37

Apache Gobblin

  • Framework for transforming large

datasets

  • Data lifecycle management
  • Uses WhereHows tags to identify data

in our Espresso or offline datasets that need to be purged

  • Ope

Open sou

  • urce - gobblin.apache.org
slide-38
SLIDE 38
  • Created tags representing

ingested content URLs in WhereHows

  • Enables downstream systems to
  • nboard with Espresso auto

purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)

Tagging in WhereHows

WhereHows and Gobblin

slide-39
SLIDE 39
  • Choose an implementation where

restriction is the default until proven safe

  • Whitelisting ensures all allowed 1st

party URLs meets a minimum technical bar for ingestion

  • Simplicity of active refetching helps

keep the bar low enough to include most content safely

Compliance Comes First

slide-40
SLIDE 40
  • Added constraints to the

system

  • Developer restrictions
  • Made certain kinds of things

harder to do

Constraints

Bigger Picture

slide-41
SLIDE 41

“Constraints can act as guide rails that point a system where you want it to go.”

G E O R G E F A I R B A N K S

slide-42
SLIDE 42
  • A constrained system is easier

to predict and control

  • Make the wrong things harder

to do

  • Give guidance to all developers

how things are supposed to be done

Constraints / Guide Rails

Bigger Picture

slide-43
SLIDE 43
  • Constraints should manifest in some

explicit way

  • Counter-Example: “No backwards

incompatible schema changes”

  • Hard to tell what developers refrained

from doing

  • WhereHows, Dali, and ACLs make

metadata and the rules explicit and thus easier to perpetuate

Manifest Guide Rails in the Code

Bigger Picture

slide-44
SLIDE 44

A design technique where the responsibility for a guide rail is moved away from developer vigilance into code, with the goal of achieving a global property on the system.

Architecture Hoisting

Bigger Picture

slide-45
SLIDE 45

Architecture Hoisting

Bigger Picture

  • Make use of the framework to manage

PII

  • Requires developers to think about PII

concerns up front to access the data

  • Once set up, developers can focus less
  • n managing PII because the

architecture is handling it

  • Users of the framework can

automatically benefit from future enhancements

slide-46
SLIDE 46

Thank you