Parquet Modular Encryption
Gidon Gershinsky
IBM Research – Haifa Lab
Parquet Modular Encryption Gidon Gershinsky IBM Research Haifa Lab - - PowerPoint PPT Presentation
Parquet Modular Encryption Gidon Gershinsky IBM Research Haifa Lab Speaker Senior Architect at IBM Research Haifa Lab gidon@il.ibm.com Leading role in Apache Parquet work on definition of encryption format and its implementation
IBM Research – Haifa Lab
Senior Architect at IBM Research – Haifa Lab gidon@il.ibm.com Leading role in Apache Parquet work on definition of encryption format and its implementation
Number of projects on secure analytics on encrypted data
Popular columnar storage format Encoding, compression Advanced data filtering
Performance benefits
How to protect sensitive Parquet data?
proofing etc.
Protect sensitive data-at-rest (in storage)
Preserve performance of analytic engines
with encrypted data
Leverage encryption for fine-grained access control
Privacy: Hiding sensitive information
columns, etc
unencrypted data
Privacy: Hiding sensitive information (continued)
could be useful in platforms without AES hardware acceleration, like Java 8
Data integrity verification
record or medical sensor readings
customers-jan-2014.part0.parquet customers-sept-2019.part0.parquet
Same as “Parquet Usecases” – with sensitive column data
“RestAssured” –
EU Horizon 2020 research project (N 731678)
Project partners
IBM, Adaptant, OCC, Thales, UDE, IT Innovation
Project usecases
Spark&AI Summit EU 2018: demo shots with Spark/Parquet Encryption
“ProTego” –
EU Horizon 2020 research project (N 826284)
Project partners
St Raffaele hospital, Marina Salud hospital, IBM, GFI, ITI, UAH, IMEC, KUL, ICE
Project usecases
treatment
ParquetFileWriter fileWriter = new ParquetFileWriter(file_path, schema, …);
ParquetFileReader fileReader = ParquetFileReader.open(file_path, options);
ParquetFileWriter fileWriter = new ParquetFileWriter(file_path, schema, …, fileEncryptionProperties);
ParquetFileReader fileReader = ParquetFileReader.open(file_path, options, fileDecryptionProperties);
Trivial
byte[] key0 = … // e.g. 128 bit key – 16 bytes FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0).build();
Basic
Basic
byte[] key1 = … // e.g. 128 bit key – 16 bytes ColumnEncryptionProperties encrColumnA = ColumnEncryptionProperties .builder(“columnA") .withKey(key1) .withKeyID(”key1”) .build(); same for column B. Then file properties: FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withEncryptedColumns(encryptedColumns) // list (map) of column encryption properties .build();
Advanced
String fileID = “customers-sept-2019.part0”; byte[] aadPrefix = fileID.getBytes(); FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withAADPrefix(aadPrefix) .withEncryptedColumns(encryptedColumns) .build();
Advanced
FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withPlaintextFooter() .withEncryptedColumns(encryptedColumns) .build();
Advanced
FileEncryptionProperties fileEncryptionProps = FileEncryptionProperties.builder(key0) .withFooterKeyID(“key0”) .withAlgorithm(ParquetCipher.AES_GCM_CTR_V1) .withEncryptedColumns(encryptedColumns) .build();
Simpler than encryption properties
StringKeyIdRetriever keyRetriever = new StringKeyIdRetriever(); keyRetriever.putKey(“key0”, key0); keyRetriever.putKey(“key1”, key1); keyRetriever.putKey(“key2”, key2); FileDecryptionProperties fileDecryptionProps = FileDecryptionProperties.builder() .withKeyRetriever(keyRetriever) .build();
Advanced
String fileID = “customers-sept-2019.part0”; byte[] aadPrefix = fileID.getBytes(); FileDecryptionProperties fileDecryptionProps = FileDecryptionProperties.builder() .withKeyRetriever(keyRetriever) .withAADPrefix(aadPrefix) .build();
Low level API – full power of Parquet encryption
encryption (data key wrapping)
Low level API – full power of Parquet encryption In addition, helper tools (*) on top, for tasks like
Hide low level API (* prototype – subject to change)
Mandatory parameters
Optional parameters
HIVE-21848
Less parameters than for encryption
Mandatory parameters
Optional parameters
public interface KmsClient { // get encryption key byte[] getKeyFromServer(String keyIdentifier) // OR: // encrypt data key with master key (envelope encryption) String wrapDataKeyInServer(byte[] dataKey, String masterKeyIdentifier) // decrypt data key byte[] unwrapDataKeyInServer(String wrappedDataKey, String masterKeyIdentifier) }
No changes in Spark code
with Parquet-1.8.2-E (a couple of jar files)
Writing Parquet files in standard encryption format!
Invoke encryption via Hadoop parameters
Parquet
KMS and envelope encryption supported Spark Parquet
Spark Client
AUTH
Token
KMS
IBM Analytics Engine
Watson Studio Spark Environments
DB2 Event Store
series, event-driven and IoT use cases
SQL Query Service
skipping index & extenders for timeseries & location data
data & automated data pipelines
AES ciphers implemented in CPU hardware (AES-NI)
(App/Framework/Parquet/compression/IO)
C++
Java
Sensitive columns: ~ one in ten in a typical table
Benchmark example
Bottom line: Encryption won’t be your bottleneck