Parquet Modular Encryption Gidon Gershinsky IBM Research Haifa Lab - - PowerPoint PPT Presentation

parquet modular encryption
SMART_READER_LITE
LIVE PREVIEW

Parquet Modular Encryption Gidon Gershinsky IBM Research Haifa Lab - - PowerPoint PPT Presentation

Parquet Modular Encryption Gidon Gershinsky IBM Research Haifa Lab Speaker Senior Architect at IBM Research Haifa Lab gidon@il.ibm.com Leading role in Apache Parquet work on definition of encryption format and its implementation


slide-1
SLIDE 1

Parquet Modular Encryption

Gidon Gershinsky

IBM Research – Haifa Lab

slide-2
SLIDE 2

Speaker

Senior Architect at IBM Research – Haifa Lab gidon@il.ibm.com Leading role in Apache Parquet work on definition of encryption format and its implementation

  • community work, folks from many companies are involved

Number of projects on secure analytics on encrypted data

  • connected car and healthcare usecases
  • Apache Spark with Parquet encryption
  • Spark&AI Summit talk, 2018
slide-3
SLIDE 3

Overview

  • Goals of this technology
  • Parquet encryption – Features
  • Sample usecases
  • How to use Parquet encryption API
  • Basic integration with Apache Spark
  • Performance implications
  • Roadmap
slide-4
SLIDE 4

Apache Parquet

Popular columnar storage format Encoding, compression Advanced data filtering

  • columnar projection: skip columns
  • predicate push down: skip files, or row groups,
  • r data pages

Performance benefits

  • less data to fetch from storage: I/O, latency
  • less data to process: CPU, latency

How to protect sensitive Parquet data?

  • in any storage - keeping projection/predicates, supporting column access control, data tamper-

proofing etc.

slide-5
SLIDE 5

Parquet Encryption: Goals

Protect sensitive data-at-rest (in storage)

  • data privacy/confidentiality: encryption - hiding sensitive information
  • data integrity: tamper-proofing sensitive information
  • in any storage - untrusted, cloud or private, file system, object store, archives

Preserve performance of analytic engines

  • full Parquet capabilities (columnar projection, predicate pushdown, etc)

with encrypted data

Leverage encryption for fine-grained access control

  • per-column encryption keys
  • key-based access in any storage: private -> cloud -> archive
slide-6
SLIDE 6

Parquet Encryption: Features

Privacy: Hiding sensitive information

  • Full encryption: all data and metadata modules
  • min/max values, schema, encryption key ids, list of sensitive

columns, etc

  • Separate keys for sensitive columns
  • column data and metadata
  • column access control
  • Separate key for file-wide metadata
  • Parquet file footer – encrypted with footer key
  • Storage server / admin never sees encryption keys or

unencrypted data

  • “client-side” encryption
slide-7
SLIDE 7

Parquet Encryption: Features

Privacy: Hiding sensitive information (continued)

  • Multiple encryption algorithms
  • different security and performance trade-offs
  • currently two algorithms are defined and implemented
  • AES_GCM: encrypts and tamper-proofs everything (data and metadata)
  • AES_GCM_CTR: encrypts everything, tamper-proofs metadata only

could be useful in platforms without AES hardware acceleration, like Java 8

  • if you need a new one, talk to us
  • Optional plaintext footer mode for legacy readers
  • any (old) Parquet reader can access unencrypted columns
  • footer is unencrypted – but tamper-proofed
  • signed with footer key
slide-8
SLIDE 8

Parquet Encryption: Features

Data integrity verification

  • File data and metadata are not tampered with
  • modifying data page contents
  • replacing one data page with another
  • File not replaced with wrong file
  • unmodified - but e.g. outdated
  • sign file contents and file id
  • Example: altering customer / billing data
  • Example: altering healthcare data (!) - patient

record or medical sensor readings

  • AES GCM: “authenticated encryption”
  • implemented in hardware

customers-jan-2014.part0.parquet customers-sept-2019.part0.parquet

slide-9
SLIDE 9

Current Status

  • Apache Parquet community work
  • Encryption specification approved in January 2019
  • signed-off by PMC
  • Specification and Thrift format merged
  • in apache/parquet-format master
  • part of parquet-format-2.7.0 release pull request (merged too)
  • Implementation
  • C++ and Java code
  • pull requests being reviewed, some already merged
  • implementation and API that closely follows the encryption specification
slide-10
SLIDE 10

Parquet Encryption Usecases

Same as “Parquet Usecases” – with sensitive column data

  • Data queries, analytic applications - in any industry
  • Spark/Hive/Presto with Parquet: horizontal platform, not a vertical solution
  • Protect data privacy / confidentiality
  • personal data privacy
  • sensitive business data
  • regulations
  • Protect data integrity
  • business processes
  • wrong billing due to tampering with e.g. customer data
  • personal health
  • wrong treatment due to tampering with patient records or sensor readings
slide-11
SLIDE 11

Connected Car Usecase

“RestAssured” –

EU Horizon 2020 research project (N 731678)

Project partners

IBM, Adaptant, OCC, Thales, UDE, IT Innovation

Project usecases

  • usage-based car insurance, social services
  • encryption: protect personal data
  • integrity: prevent billing tampering

Spark&AI Summit EU 2018: demo shots with Spark/Parquet Encryption

slide-12
SLIDE 12

Healthcare Usecase

“ProTego” –

EU Horizon 2020 research project (N 826284)

Project partners

St Raffaele hospital, Marina Salud hospital, IBM, GFI, ITI, UAH, IMEC, KUL, ICE

Project usecases

  • Queries / analytics on sensitive healthcare data
  • HL7 FHIR standard: maps nicely to Parquet
  • encryption: protect personal data
  • integrity: prevent tampering with diagnosis and

treatment

slide-13
SLIDE 13

Encryption API

  • Parquet API - without encryption

ParquetFileWriter fileWriter = new ParquetFileWriter(file_path, schema, …);

  • then write data

ParquetFileReader fileReader = ParquetFileReader.open(file_path, options);

  • then read data
  • Parquet API - with encryption

ParquetFileWriter fileWriter = new ParquetFileWriter(file_path, schema, …, fileEncryptionProperties);

  • then write data (just like before)

ParquetFileReader fileReader = ParquetFileReader.open(file_path, options, fileDecryptionProperties);

  • then read data (just like before)
slide-14
SLIDE 14

File Encryption Properties

Trivial

  • encrypt all columns (and footer) with key0
  • tamper-proof encrypted content
  • enable columnar projection, predicate pushdown, etc

byte[] key0 = … // e.g. 128 bit key – 16 bytes FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0).build();

slide-15
SLIDE 15

File Encryption Properties

Basic

  • encrypt columnA with key1, columnB with key2 (and footer with key0)
  • differential column access control
  • assign key IDs (key metadata) for simplified key retrieval
  • tamper-proof encrypted content
  • enable columnar projection, predicate pushdown, etc
slide-16
SLIDE 16

File Encryption Properties

Basic

  • encrypt columnA with key1, column with key2 (and footer with key0)

byte[] key1 = … // e.g. 128 bit key – 16 bytes ColumnEncryptionProperties encrColumnA = ColumnEncryptionProperties .builder(“columnA") .withKey(key1) .withKeyID(”key1”) .build(); same for column B. Then file properties: FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withEncryptedColumns(encryptedColumns) // list (map) of column encryption properties .build();

slide-17
SLIDE 17

File Encryption Properties

Advanced

  • Protect against file replacement attacks
  • Replacement with untampered but e.g. outdated file (table partition)

String fileID = “customers-sept-2019.part0”; byte[] aadPrefix = fileID.getBytes(); FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withAADPrefix(aadPrefix) .withEncryptedColumns(encryptedColumns) .build();

slide-18
SLIDE 18

File Encryption Properties

Advanced

  • Allow legacy clients to read unencrypted columns in encrypted files
  • plaintext (unencrypted) footer mode
  • visible file metadata (schema, names of secret columns and of their keys, etc)
  • tamper-proof (sign) file metadata with footer key

FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withPlaintextFooter() .withEncryptedColumns(encryptedColumns) .build();

slide-19
SLIDE 19

File Encryption Properties

Advanced

  • Use alternative encryption algorithm
  • better performance in old Java versions
  • tamper-proofing metadata only (not data)

FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withAlgorithm(ParquetCipher.AES_GCM_CTR_V1) .withEncryptedColumns(encryptedColumns) .build();

slide-20
SLIDE 20

File Decryption Properties

Simpler than encryption properties

  • most of details are specified in file metadata

StringKeyIdRetriever keyRetriever = new StringKeyIdRetriever(); keyRetriever.putKey(“key0”, key0); keyRetriever.putKey(“key1”, key1); keyRetriever.putKey(“key2”, key2); FileDecryptionProperties fileDecryptionProps = FileDecryptionProperties.builder() .withKeyRetriever(keyRetriever) .build();

slide-21
SLIDE 21

File Decryption Properties

Advanced

  • Protect against file replacement attacks

String fileID = “customers-sept-2019.part0”; byte[] aadPrefix = fileID.getBytes(); FileDecryptionProperties fileDecryptionProps = FileDecryptionProperties.builder() .withKeyRetriever(keyRetriever) .withAADPrefix(aadPrefix) .build();

slide-22
SLIDE 22

Beyond Low Level API

Low level API – full power of Parquet encryption

  • directly implements the approved specification features
  • enables any key management scheme
  • work with KMS instead of explicit keys
  • you need to build one – choosing from many options for KMS, Auth, envelope

encryption (data key wrapping)

  • if you know how – Parquet low level encryption API is all you need
  • no one-size-fits-all solution for KMS/Auth/Wrapping
slide-23
SLIDE 23

Beyond Low Level API

Low level API – full power of Parquet encryption In addition, helper tools (*) on top, for tasks like

  • Work with KMS
  • Key management system service, on-prem or in any cloud
  • Envelope encryption
  • Data encrypted with random DEKs, DEKs encrypted (wrapped) with Master keys (kept in KMS)
  • Even better: double envelope encryption (minimize KMS interaction)
  • Rotation of Master keys
  • Encryption setup via Hadoop properties, instead of API
  • (*) open PR code – functional but not merged yet, subject to change
  • Feel free to use as examples for your key management code
slide-24
SLIDE 24

Hadoop Encryption Parameters

Hide low level API (* prototype – subject to change)

Mandatory parameters

  • "encryption.column.keys"
  • list of columns to encrypt, with master key IDs.
  • "encryption.footer.key"
  • master key ID for footer encryption/signing
  • "encryption.kms.client.class"
  • name of class implementing KmsClient interface
  • "encryption.key.access.token"
  • auth token that will be passed to KMS

Optional parameters

  • "encryption.algorithm"
  • "encryption.file.id"
  • file replacement protection
  • "encryption.plaintext.footer"
  • "<masterKeyID>:<colName>,<colName>;<masterKeyID>:<colName>, ..“
  • jointly defined for Parquet and ORC column encryption

HIVE-21848

slide-25
SLIDE 25

Hadoop Decryption Parameters

Less parameters than for encryption

Mandatory parameters

  • "encryption.kms.client.class"
  • name of class implementing KmsClient interface
  • "encryption.key.access.token"
  • auth token that will be passed to KMS

Optional parameters

  • "encryption.file.id“
  • file replacement protection
slide-26
SLIDE 26

KMS Client Interface

public interface KmsClient { // get encryption key byte[] getKeyFromServer(String keyIdentifier) // OR: // encrypt data key with master key (envelope encryption) String wrapDataKeyInServer(byte[] dataKey, String masterKeyIdentifier) // decrypt data key byte[] unwrapDataKeyInServer(String wrappedDataKey, String masterKeyIdentifier) }

slide-27
SLIDE 27

Spark with Parquet Encryption

No changes in Spark code

  • For example: Spark 2.3.0 - replace Parquet-1.8.2

with Parquet-1.8.2-E (a couple of jar files)

Writing Parquet files in standard encryption format!

  • signed-off by community

Invoke encryption via Hadoop parameters

  • Hadoop configuration already passed from Spark to

Parquet

KMS and envelope encryption supported Spark Parquet

Spark Client

AUTH

Token

KMS

slide-28
SLIDE 28

Parquet and Spark in IBM Cloud

IBM Analytics Engine

  • n-demand Spark (and Hadoop) clusters in IBM cloud

Watson Studio Spark Environments

  • cloud tools for data scientists and application developers
  • dedicated Spark cluster per Notebook

DB2 Event Store

  • rapidly ingest and analyze streaming data for time-

series, event-driven and IoT use cases

  • store and retrieve with C++ Parquet

SQL Query Service

  • SQL aaS on TBs on object storage. Uses Spark & data

skipping index & extenders for timeseries & location data

  • SQL-based ETL & analytics on TBs of object storage

data & automated data pipelines

slide-29
SLIDE 29

Performance Effect of Encryption

AES ciphers implemented in CPU hardware (AES-NI)

  • Gigabyte(s) per second
  • Order(s) of magnitude faster than “application stack”

(App/Framework/Parquet/compression/IO)

C++

  • OpenSSL EVP libraries tap into AES-NI directly

Java

  • AES-NI support in HotSpot since Java 9
  • Java 11.0.4 – enhanced AES GCM decryption
  • Thank you Java folks!

Sensitive columns: ~ one in ten in a typical table

  • further reduction in encryption time

Benchmark example

  • Java 11.0.4
  • Intel Core i7
  • Parquet with SNAPPY compression
  • AES_GCM algorithm
  • Decryption overhead
  • all (19) columns encrypted: 3.6%
  • 2 columns encrypted: 0.7%
  • Reader app that does nothing (blackhole)
  • real apps: lower overhead!

Bottom line: Encryption won’t be your bottleneck

  • app workload, data I/O, encoding, compression
slide-30
SLIDE 30

Roadmap

  • Complete Java and C++ open source implementations
  • Key management tools (PARQUET-1373)
  • High level interface to Parquet encryption (PARQUET-1568)
  • Assist in Spark, Hive, Presto integration
  • Data obfuscation / anonymization (PARQUET-1376)
slide-31
SLIDE 31

Questions?

slide-32
SLIDE 32

Backup

slide-33
SLIDE 33

Existing Solutions