Handling Personal Information in LinkedIns Content Ingestion System - PowerPoint PPT Presentation

Handling Personal Information in LinkedIn’s Content Ingestion System David Max Senior Software Engineer LinkedIn

About Me • Software Engineer at LinkedIn NYC since 2015 • Content Ingestion team • Office Hours – Thursday 11:30-12:00 David Max Senior Software Engineer LinkedIn www.linkedin.com/in/davidpmax/

About LinkedIn New York Engineering • Located in Empire State Building • Approximately 100 engineers and 1000 employees total New York • Multiple teams, front end, back Engineering end, and data science

Disclaimers • I’m not a lawyer • Some details omitted • I am not a spokesperson for official LinkedIn policy

O U R M I S S I O N Create economic opportunity for every member of the global workforce

LinkedIn > 546 M > 70 % members of members reside outside the U.S. • World’s largest professional • More than 200 countries and network territories worldwide

General Data Protection Regulation • Applies to all companies worldwide that process personal data of EU citizens. • Widens definition of personal data. • Introduces restrictive data handling principles. • Enforceable from Ma May 25, 2018 .

Handling Personally Identifiable Information (PII) Data Minimization Consent Retention Deletion Limit personal data Cannot use collected Do not hold data Must delete data upon collection, storage, data for a different longer then necessary request usage purpose

Handling PII in Content Ingestion Content Ingestion Data Protection Babylonia Data Minimization Consent Retention Deletion

What is Content Ingestion? Content Ingestion Babylonia

Babylonia Content Ingestion

url: https://www.youtube.com/watch?v=MS3c9hz0bRg title: "SATURN 2017 Keynote: Software is Details” image: Babylonia https://i.ytimg.com/vi/MS3c9hz0bRg/hqdefault.jpg?sq poaymwEYCKgBEF5IVfKriqkDCwgBFQAAiEIYAXAB\\u00 Content Ingestion 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

Babylonia Content Ingestion

What is Content Ingestion? • Extracts metadata from web pages • Source of Truth for 3 rd party content • Also contains metadata for some public 1 st party content Babylonia • Used by LinkedIn services for sharing, decorating, and embedding content Content Ingestion • Data also feeds into content understanding and relevance models

How does PII get into Babylonia?

Ingesting 1 st party pages containing publicly viewable member PII • Profile pages • Publish posts • SlideShare content

When a Member Account is Closed What happens What Babylonia needs to do • Babylonia (along with other • Remove scraped data relating to systems) is notified that the the member pages that have been member’s account is closed taken down • Other systems take down the • Notify downstream systems that member’s content might be holding a copy of the (i.e. public profile page, publish data posts, etc.)

Babylonia Datasets Espresso HDFS Datasets Database ETL Babylonia Brooklin Data Content Ingestion Change Events

Downstream and Upstream Datasets Online Service Espresso HDFS Offline Database ETL Brooklin Data Change Events profile profile Near job Line 1 st party article web page publishing

• Need to identify URLs that Challenges of contain a member’s PII. member PII in My post might contain yo ur PII • My your Babylonia • Connection between member and the URL resides in the upstream system

Option #1: Require Upstream Systems to Notify Babylonia Pros Cons • Simple – Babylonia waits to be told • Requires additional work by every system specifically which URLs should be purged that exposes PII in publicly accessible web pages • Babylonia only does extra work when a URL needs to be purged • If the notification is missed, how will Babylonia know? • Puts responsibility where the knowledge is • 1 st party URLs sometimes change as upstream systems are changed – need to correctly handle old URLs too

Option #2: Actively Refetch Every 1 st Party URL Pros Cons • There are a lot of 1 st party URLs in • Simple logic: Page gone? Purge the page. Babylonia • Requires little additional work from • Continuous polling of all 1 st party URLs upstream systems consumes a lot of resources just for the • Works also for old 1 st party URLs sake of the very few URLs that are actually affected • Extra work to avoid false positives or false negatives

Option #3: Eliminate Member PII in Babylonia Pros Cons • The easiest data to delete is data that isn’t • Babylonia is relied upon by numerous systems to have content for URLs – excluding 1 st party in your system to begin with content will affect member experience • Gets closer to Single Source of Truth (SSOT) • No substitute currently available for all 1 st party content – better for consistency, not only for compliance • Difficult to achieve based on URL – can’t always tell by looking at a URL if it resolves to 1 st party content (eg. shortlinks)

Blended Approach • Opt ption 1 - Having upstream systems notify is best, but might miss some pages ption 2 - Active refetch is thorough but • Opt expensive. Must use to catch pages that won’t support notifications • Opt ption 3 - Some pages won’t work with active refetch. For example, pages that still return an HTTP status code 200 even when the data has been removed. These must be blocked

Classification of Ingested URLs 3 rd Party URL Blocked Actively 1 st Party Refetched Whitelisted Notified by Upstream

Option 1 – Upstream Notification • Upstream system sends a Kafka message • Babylonia consumes message and purges data pen source - • Ope kafka.apache.org

Option 2 – Active Refetching Offline Espresso HDFS job Database ETL Refetch Refetch URL table URL table Takedown UPDATE Requests for deleted pages Kafka Refetch Refetch Push messages process job

Option 3 – Whitelist • Block all 1 st party URLs that can’t meet minimal requirements • Mainly must return a 404 for an invalid or deleted URL • Ensures new 1 st party URLs are onboarded before being ingested

Managing PII in Datasets Espresso HDFS Offline Database ETL Datasets

Espresso Datasets What is Espresso? Challenges Espresso Espresso • LinkedIn distributed • Reference to PII not Database Datasets NoSQL database always in the key • Data stored in Avro • ETL snapshots of format (JSON) Espresso Dataset become offline • Indexed by specific Datasets primary key fields

Offline (HDFS) Datasets Challenges HDFS Offline • Files of Avro (JSON) records ETL Datasets • Need to read whole record to see if it has PII • Files not conducive to removing one record from the middle • Dataset can be source for downstream jobs that also need to be purged

WhereHows • Data discovery and lineage tool • Central location for all schema • Document meanings of each column • Trace downstream/upstream lineage of datasets Data Discovery • Tag every column that can contain Which datasets contain member PII? member reference or PII. • Open Source - github.com/linkedin/wherehows

Dali (Data Access at LinkedIn) • Interface for accessing datasets WhereHows Metadata • Combines dataset schema with WhereHows metadata Dali • Defines output virtual dataset while preserving data tags Reader • Supports defining virtual datasets where PII is excluded or obfuscated Raw Dataset

Access Control List (ACL) • Controls access to PII data to known list of authorized systems • We only approve access to systems that it can handle PII properly Access Control • Ensures that member PII can’t leak into untracked systems/datasets Only systems that handle PII properly are allowed access • Acts as a list of downstream services

Keeping Track of Personal Information in Babylonia WhereHows Dali ACL • Field tagging for fields • Downstreams use Dali, • Control the spread of containing PII which preserves the PII data only to WhereHows tagging on authorized readers • Know where the PII is new virtual datasets • Serves as a list of • Keeps tags with the current downstream data as it moves from systems to notify when one dataset to another data is purged

Apache Gobblin • Framework for transforming large datasets • Data lifecycle management • Uses WhereHows tags to identify data in our Espresso or offline datasets that need to be purged ource - gobblin.apache.org • Ope Open sou

WhereHows and Gobblin • Created tags representing ingested content URLs in WhereHows Tagging in • Enables downstream systems to onboard with Espresso auto WhereHows purge and Gobblin by tagging columns in their tables as containing a URL or Ingested Content URN (Uniform Resource Name)

Compliance Comes First • Choose an implementation where restriction is the default until proven safe • Whitelisting ensures all allowed 1 st party URLs meets a minimum technical bar for ingestion • Simplicity of active refetching helps keep the bar low enough to include most content safely

Bigger Picture • Added constraints to the system Constraints • Developer restrictions • Made certain kinds of things harder to do

Handling Personal Information in LinkedIns Content Ingestion System - PowerPoint PPT Presentation

Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software Engineer LinkedIn About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00

For personal use only For personal use only For personal use only For personal use only For

Material Handling Chapter 5 Designing material handling systems Overview of material

Data-Driven Reserve Prices for Social Advertising Auctions at LinkedIn Tingting Cui Lijun Peng

How to Stand Out from the Crowd on How to Stand Out from the Crowd on LinkedIn LinkedIn Maureen

How to Get Started with Advertising on LinkedIn Mallory Fahy Sammy Elazab Head of Client

Getting The Most From LinkedIn Voltron- Sourcing Highlights From Session 5 Of LinkedIn Xtreme

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

For personal use only nextdc.com 1 For personal use only nextdc.com 2 For personal use only

PERSONAL PERSONAL PREPAREDNESS PREP AREDNESS PERSONAL PERSONAL Know the hazards around

Social Media: A Powerful Personal & Professional Connector How to utilize LinkedIN, Facebook

Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn About LinkedIn

MASTERING LINKEDIN MODULE 8 Module 08: A thorough presentation of the background and other

How to Use LinkedIn Recruiting Services of Arizona (RSA) Why use LinkedIn? To meet potential

The Value of Freight Economy in Minnesota Lee Munnich and Tom Horan | September 20, 2013 Study

The Raritan Express Corridor Joint Meeting of NYMTC and NJTPA March 8, 2006 Presented by: In

DA SUBMI MISSIO SSION N ON RAIL L IN SA SUBMISSION BY MANNY DE FREITAS, MP (SHADOW MINISTER OF

Freight Best Practice Multi-modal Solutions Geoff Clarke 12 th November 2009 The Road to

Heartlands Amended Outline Planning Application Presentation to PMRA General Meeting and comments

FREIGHT TRANSPORTATION FRAMEWORK STUDY Examining Freight and Multimodal Opportunities in the Sun

NEWMARK GROUP, INC. GENERAL INVESTOR PRESENTATION - MAY 2020 1 DISCLAIMER Discussion of

Commercial Vehicle Information Exchange Window (CVIEW) & Roadside Inspection Program

Handling Personal Information in LinkedIns Content Ingestion System - PowerPoint PPT Presentation

Handling Personal Information in LinkedIns Content Ingestion System David Max Senior Software Engineer LinkedIn About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00

For personal use only For personal use only For personal use only For personal use only For

Material Handling Chapter 5 Designing material handling systems Overview of material

Data-Driven Reserve Prices for Social Advertising Auctions at LinkedIn Tingting Cui Lijun Peng

How to Stand Out from the Crowd on How to Stand Out from the Crowd on LinkedIn LinkedIn Maureen

How to Get Started with Advertising on LinkedIn Mallory Fahy Sammy Elazab Head of Client

Getting The Most From LinkedIn Voltron- Sourcing Highlights From Session 5 Of LinkedIn Xtreme

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

For personal use only nextdc.com 1 For personal use only nextdc.com 2 For personal use only

PERSONAL PERSONAL PREPAREDNESS PREP AREDNESS PERSONAL PERSONAL Know the hazards around

Social Media: A Powerful Personal &amp; Professional Connector How to utilize LinkedIN, Facebook

Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn About LinkedIn

MASTERING LINKEDIN MODULE 8 Module 08: A thorough presentation of the background and other

How to Use LinkedIn Recruiting Services of Arizona (RSA) Why use LinkedIn? To meet potential

The Value of Freight Economy in Minnesota Lee Munnich and Tom Horan | September 20, 2013 Study

The Raritan Express Corridor Joint Meeting of NYMTC and NJTPA March 8, 2006 Presented by: In

DA SUBMI MISSIO SSION N ON RAIL L IN SA SUBMISSION BY MANNY DE FREITAS, MP (SHADOW MINISTER OF

Freight Best Practice Multi-modal Solutions Geoff Clarke 12 th November 2009 The Road to

Heartlands Amended Outline Planning Application Presentation to PMRA General Meeting and comments

FREIGHT TRANSPORTATION FRAMEWORK STUDY Examining Freight and Multimodal Opportunities in the Sun

NEWMARK GROUP, INC. GENERAL INVESTOR PRESENTATION - MAY 2020 1 DISCLAIMER Discussion of

Commercial Vehicle Information Exchange Window (CVIEW) &amp; Roadside Inspection Program

Social Media: A Powerful Personal & Professional Connector How to utilize LinkedIN, Facebook

Commercial Vehicle Information Exchange Window (CVIEW) & Roadside Inspection Program