Handling Personal Information in LinkedIn’s Content Ingestion System
David Max
Senior Software Engineer LinkedIn
Handling Personal Information in LinkedIns Content Ingestion System - - PowerPoint PPT Presentation
Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software Engineer LinkedIn About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00
David Max
Senior Software Engineer LinkedIn
since 2015
Thursday 11:30-12:00
David Max
Senior Software Engineer LinkedIn
www.linkedin.com/in/davidpmax/
1000 employees total
end, and data science
policy
O U R M I S S I O N
network
members
territories worldwide
process personal data of EU citizens.
principles.
May 25, 2018.
Limit personal data collection, storage, usage
Data Minimization
Cannot use collected data for a different purpose
Consent
Do not hold data longer then necessary
Retention
Must delete data upon request
Deletion
Content Ingestion Data Protection Babylonia
Data Minimization Consent Retention Deletion
Content Ingestion Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
Content Ingestion
Babylonia
url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg
Content Ingestion
Babylonia
Content Ingestion
Babylonia
public 1st party content
decorating, and embedding content
understanding and relevance models
the member pages that have been taken down
might be holding a copy of the data
systems) is notified that the member’s account is closed
member’s content (i.e. public profile page, publish posts, etc.)
What happens What Babylonia needs to do
Espresso Database HDFS ETL
Brooklin Data Change Events Datasets Content Ingestion
Babylonia
Espresso Database HDFS ETL
Brooklin Data Change Events
1st party web page
profile job article
publishing profile
Online Service Near Line Offline
contain a member’s PII.
My post might contain yo your ur PII
and the URL resides in the upstream system
specifically which URLs should be purged
needs to be purged
Pros Cons
that exposes PII in publicly accessible web pages
Babylonia know?
upstream systems are changed – need to correctly handle old URLs too
upstream systems
Pros Cons
Babylonia
consumes a lot of resources just for the sake of the very few URLs that are actually affected
negatives
in your system to begin with
for all 1st party content – better for consistency, not only for compliance
Pros Cons
to have content for URLs – excluding 1st party content will affect member experience
tell by looking at a URL if it resolves to 1st party content (eg. shortlinks)
ption 1 - Having upstream systems notify is
best, but might miss some pages
ption 2 - Active refetch is thorough but
won’t support notifications
ption 3 - Some pages won’t work with active
an HTTP status code 200 even when the data has been removed. These must be blocked
Kafka message
and purges data
pen source -
kafka.apache.org
Espresso Database HDFS ETL
Refetch URL table Refetch URL table
Offline job
Refetch messages
Kafka Push job Refetch process
UPDATE Takedown Requests for deleted pages
can’t meet minimal requirements
invalid or deleted URL
ingested
HDFS ETL
Offline Datasets
Espresso Database
Espresso Datasets
Espresso Database
NoSQL database
format (JSON)
primary key fields What is Espresso? Challenges
always in the key
Espresso Dataset become offline Datasets
HDFS ETL
Offline Datasets
it has PII
downstream jobs that also need to be purged
Challenges
Which datasets contain member PII?
Data Discovery
member reference or PII.
github.com/linkedin/wherehows
WhereHows metadata
preserving data tags
where PII is excluded or obfuscated
Raw Dataset
WhereHows Metadata
Only systems that handle PII properly are allowed access
Access Control
list of authorized systems
that it can handle PII properly
into untracked systems/datasets
containing PII
WhereHows Dali ACL
which preserves the WhereHows tagging on new virtual datasets
data as it moves from
PII data only to authorized readers
current downstream systems to notify when data is purged
datasets
in our Espresso or offline datasets that need to be purged
Open sou
ingested content URLs in WhereHows
purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)
restriction is the default until proven safe
party URLs meets a minimum technical bar for ingestion
keep the bar low enough to include most content safely
system
harder to do
G E O R G E F A I R B A N K S
to predict and control
to do
how things are supposed to be done
explicit way
incompatible schema changes”
from doing
metadata and the rules explicit and thus easier to perpetuate
PII
concerns up front to access the data
architecture is handling it
automatically benefit from future enhancements