effi c i en t c ol u m n ar storag e w i th apach e parqu
play

EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET - PowerPoint PPT Presentation

EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET Ranganathan Balashanmugam, Aconex Apache: Big Data North America 2017 The Tables Have Turned. Apache: Big Data North America 2017 The Tables Have Turned. - 90 Apache:


  1. EFFI C I EN T C OL U M N AR STORAG E W I TH APACH E PARQU ET Ranganathan Balashanmugam, Aconex Apache: Big Data North America 2017

  2. “The Tables Have Turned.” Apache: Big Data North America 2017

  3. “The Tables Have Turned.” - 90° Apache: Big Data North America 2017

  4. Context ANALY TI CAL OV ER TRAN SAC TI ON AL Apache: Big Data North America 2017

  5. SI M PLE STR UC TUR E row A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 columnar a3 b3 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 Apache: Big Data North America 2017

  6. “Optimizing the disk seeks.” Apache: Big Data North America 2017

  7. “The Tables Have Turned.” - 90° Efficient writes Efficient reads Apache: Big Data North America 2017

  8. SI M PLE STR UC TUR E row A B C a1 b1 c1 a2 b2 c2 a3 b3 c3 a1 b1 c1 a2 b2 c2 columnar a3 b3 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 Apache: Big Data North America 2017

  9. N ES TED A N D R EPEATED S TR U C TU R ES How to preserve in column store? message Document { DocId: 10 required int64 DocId; Links: optional group Links { Forward: 20 Forward: 40 repeated int64 Backward; Forward: 60 repeated int64 Forward; Name: Language: } Code: ‘en-us’ repeated group Name { Country: ‘us’ repeated group Language { Language: Code: ‘en’ required string Code; Url: ‘http://a' optional string Country; Name: Url: ‘http://b' } Name: optional string Url; Language: } Code: ‘en-gb’ Country: ‘gb’ } Apache: Big Data North America 2017

  10. r a n m u l o C Apache: Big Data North America 2017

  11. m t e n l b e o m r p e t a t s “Get these performance benefits for nested structures into Hadoop ecosystem.” Apache: Big Data North America 2017

  12. M OTI VATI ON Allow complex nested data * structures Very efficient compression and * encoding schemes Support many frameworks * Apache: Big Data North America 2017

  13. “Columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.” Apache: Big Data North America 2017

  14. D ES I G N G OAL S Interoperability * Space efficiency * Query efficiency * Apache: Big Data North America 2017

  15. object models scalding …. avro thrift protobuf pig hive object parquet model converters avro thrift protobuf pig hive …. column readers storage parquet binary file format src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ Apache: Big Data North America 2017

  16. parquet file header - Magic number (4 bytes) : “PAR1” * group of rows * max size buffered while writing row group 0 * 50 MB - 1GB column a column b column c * good enough for compression efficiency Page 1 Page 1 Page 1 * 8KB - 1 MB * data of one column in row group * good enough to read * can be read independently Page 2 Page 2 Page 2 … Page n … Page n … Page n row group 1 …row group n footer src: https://github.com/Parquet/parquet-format Apache: Big Data North America 2017

  17. footer file metadata ( ThriftCompactProtocol ) - version - schema row group 0 metadata row group … - total byte size … - total rows column 0 column 0 type/path/encodings/codec * num of values * … * compressed/uncompressed size * data page offset * index page offset * column 1 column 1 … … * * footer length (4 bytes) Magic number (4 bytes): “PAR1” src: https://github.com/Parquet/parquet-format Apache: Big Data North America 2017

  18. page page header ( ThriftCompactProtocol ) repetition levels d d e s e definition levels s d e o r c p n m values e o c d n a src: https://techmagie.wordpress.com/2016/07/15/data-storage-and-modelling-in-hadoop/ Apache: Big Data North America 2017

  19. Schema message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; } optional string Url; } } Apache: Big Data North America 2017

  20. Document 2 Document 1 Documents DocId: 10 DocId: 20 Links: Links: message Document { Forward: 20 Backward: 10 required int64 DocId; Forward: 40 Backward: 30 optional group Links { Forward: 60 Forward: 80 repeated int64 Backward; Name: Name: repeated int64 Forward; Language: Url: ‘http://c' } Code: ‘en-us’ repeated group Name { Country: ‘us’ repeated group Language { Language: required string Code; Code: ‘en’ optional string Country; Url: ‘http://a' } Name: optional string Url; Url: ‘http://b' } Name: } Language: Code: ‘en-gb’ Country: ‘gb’ Apache: Big Data North America 2017

  21. m t e n l b e o m r p e t a t s “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Apache: Big Data North America 2017

  22. “Can we represent it in columnar former efficiently and read them back to their original nested data structure?” Dremel encoding Apache: Big Data North America 2017

  23. Document 1 Document 2 DocId: 10 DocId: 20 Links: Links: fill all the nulls message Document { Forward: 20 Backward: 10 Forward: 40 Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: Language: <Code>: NULL repeated int64 Forward; Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: <Code>: NULL optional string Url; <Country>: NULL } Url: ‘http://b' } Name: Language: Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017

  24. Document 1 Document 2 In the In the DocId: 10 DocId: 20 Links: Links: path to the path to the message Document { Forward: 20 Backward: 10 Forward: 40 field, what field, how Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { is the last many <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: repeated defined Language: <Code>: NULL repeated int64 Forward; field ? fields ? Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: C O L U M N S R D VAL U E <Code>: NULL optional string Url; 0 2 20 Links.Forward <Country>: NULL } Url: ‘http://b' 1 2 40 Links.Forward[1] } Name: 1 2 60 Links.Forward[2] Language: 0 2 80 Links.Forward Code: ‘en-gb’ Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017

  25. Document 1 Document 2 In the In the DocId: 10 DocId: 20 Links: Links: path to the path to the message Document { Forward: 20 Backward: 10 Forward: 40 field, what field, how Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { is the last many <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: repeated defined Language: <Code>: NULL repeated int64 Forward; field ? fields ? Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: C O L U M N S R D VAL U E <Code>: NULL optional string Url; Name.Language.Code en-us <Country>: NULL } Name.Language[1].Code en Url: ‘http://b' } Name[1].[Language].[Code] null Name: Name[2].Language.Code en-gb Language: Code: ‘en-gb’ Name.Language.Code null Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017

  26. Document 1 Document 2 In the In the DocId: 10 DocId: 20 Links: Links: path to the path to the message Document { Forward: 20 Backward: 10 Forward: 40 field, what field, how Backward: 30 required int64 DocId; Forward: 60 Forward: 80 optional group Links { is the last many <Backward>: NULL Name: repeated int64 Backward; Name: <Language>: repeated defined Language: <Code>: NULL repeated int64 Forward; field ? fields ? Code: ‘en-us’ <Country>: NULL } Country: ‘us’ Url: ‘http://c' repeated group Name { Language: repeated group Language { Code: ‘en’ <Country>: NULL required string Code; Url: ‘http://a' optional string Country; Name: } <Language>: C O L U M N S R D VAL U E <Code>: NULL optional string Url; Name.Language.Code 0 2 en-us <Country>: NULL } Name.Language[1].Code en Url: ‘http://b' } Name[1].[Language].[Code] null Name: Name[2].Language.Code en-gb Language: Code: ‘en-gb’ Name.Language.Code null Country: ‘gb’ <Url>: NULL Apache: Big Data North America 2017

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend