EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET Ranganathan Balashanmugam, Aconex Apache: Big Data North America 2017
“The Tables Have Turned.” Apache: Big Data North America 2017
“The Tables Have Turned.” - 90° Apache: Big Data North America 2017
Context ANALY TI CAL OV ER TRAN SAC TI ON AL Apache: Big Data North America 2017
SI M PLE STR UC TUR E row A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 columnar a3 b3 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 Apache: Big Data North America 2017
“Optimizing the disk seeks.” Apache: Big Data North America 2017
“The Tables Have Turned.” - 90° Efficient writes Efficient reads Apache: Big Data North America 2017
SI M PLE STR UC TUR E row A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 columnar a3 b3 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 Apache: Big Data North America 2017
N ES TED A N D R EPEATED S TR U C TU R ES How to preserve in column store? message Document { DocId: 10 required int64 DocId; Links: optional group Links { Forward: 20 Forward: 40 repeated int64 Backward; Forward: 60 repeated int64 Forward; Name: Language: } Code: ‘en-us’ repeated group Name { Country: ‘us’ repeated group Language { Language: Code: ‘en’ required string Code; Url: ‘http://a' optional string Country; Name: Url: ‘http://b' } Name: optional string Url; Language: } Code: ‘en-gb’ Country: ‘gb’ } Apache: Big Data North America 2017
r a n m u l o C Apache: Big Data North America 2017
m t e n l b e o m r p e t a t s “Get these performance benefits for nested structures into Hadoop ecosystem.” Apache: Big Data North America 2017
M OTI VATI ON Allow complex nested data * structures Very efficient compression and * encoding schemes Support many frameworks * Apache: Big Data North America 2017
“Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.” Apache: Big Data North America 2017
D ES I G N G OAL S Interoperability * Space efficiency * Query efficiency * Apache: Big Data North America 2017
object models scalding …. avro thrift protobuf pig hive object parquet model converters avro thrift protobuf pig hive …. column readers storage parquet binary file format src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ Apache: Big Data North America 2017
parquet file header - Magic number (4 bytes) : “PAR1” * group of rows * max size buffered while writing row group 0 * 50 MB - 1GB column a column b column c * good enough for compression efficiency Page 1 Page 1 Page 1 * 8KB - 1 MB * data of one column in row group * good enough to read * can be read independently Page 2 Page 2 Page 2 … Page n … Page n … Page n row group 1 …row group n footer src: https://github.com/Parquet/parquet-format Apache: Big Data North America 2017
footer file metadata ( ThriftCompactProtocol ) - version - schema row group 0 metadata row group … - total byte size … - total rows column 0 column 0 type/path/encodings/codec * num of values * … * compressed/uncompressed size * data page offset * index page offset * column 1 column 1 … … * * footer length (4 bytes) Magic number (4 bytes): “PAR1” src: https://github.com/Parquet/parquet-format Apache: Big Data North America 2017
page page header ( ThriftCompactProtocol ) repetition levels d d e s e definition levels s d e o r c p n m values e o c d n a src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ Apache: Big Data North America 2017
Schema message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Apache: Big Data North America 2017
Document 2 Document 1 Documents DocId: 10 DocId: 20 Links: Links: message Document { Forward: 20 Backward: 10 required int64 DocId; Forward: 40 Backward: 30 optional group Links { Forward: 60 Forward: 80 repeated int64 Backward; Name: Name: repeated int64 Forward; Language: Url: ‘http://c' } Code: ‘en-us’ repeated group Name { Country: ‘us’ repeated group Language { Language: required string Code; Code: ‘en’ optional string Country; Url: ‘http://a' } Name: optional string Url; Url: ‘http://b' } Name: } Language: Code: ‘en-gb’ Country: ‘gb’ Apache: Big Data North America 2017
m t e n l b e o m r p e t a t s “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Apache: Big Data North America 2017
“Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Dremel encoding Apache: Big Data North America 2017
Document 1 Document 2 DocId: 10 DocId: 20 Links: Links: fill all the nulls message Document { Forward: 20 Backward: 10 Forward: 40 Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: Language: <Code>: NULL repeated int64 Forward; Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: <Code>: NULL optional string Url; <Country>: NULL } Url: ‘http://b' } Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017
Document 1 Document 2 In the In the DocId: 10 DocId: 20 Links: Links: path to the path to the message Document { Forward: 20 Backward: 10 Forward: 40 field, what field, how Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { is the last many <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: repeated defined Language: <Code>: NULL repeated int64 Forward; field ? fields ? Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: C O L U M N S R D VAL U E <Code>: NULL optional string Url; 0 2 20 Links.Forward <Country>: NULL } Url: ‘http://b' 1 2 40 Links.Forward[1] } Name: 1 2 60 Links.Forward[2] Language: 0 2 80 Links.Forward Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017
Document 1 Document 2 In the In the DocId: 10 DocId: 20 Links: Links: path to the path to the message Document { Forward: 20 Backward: 10 Forward: 40 field, what field, how Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { is the last many <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: repeated defined Language: <Code>: NULL repeated int64 Forward; field ? fields ? Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: C O L U M N S R D VAL U E <Code>: NULL optional string Url; Name.Language.Code en-us <Country>: NULL } Name.Language[1].Code en Url: ‘http://b' } Name[1].[Language].[Code] null Name: Name[2].Language.Code en-gb Language: Code: ‘en-gb’ Name.Language.Code null Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017
Document 1 Document 2 In the In the DocId: 10 DocId: 20 Links: Links: path to the path to the message Document { Forward: 20 Backward: 10 Forward: 40 field, what field, how Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { is the last many <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: repeated defined Language: <Code>: NULL repeated int64 Forward; field ? fields ? Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: C O L U M N S R D VAL U E <Code>: NULL optional string Url; Name.Language.Code 0 2 en-us <Country>: NULL } Name.Language[1].Code en Url: ‘http://b' } Name[1].[Language].[Code] null Name: Name[2].Language.Code en-gb Language: Code: ‘en-gb’ Name.Language.Code null Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017
Recommend
More recommend