zedstore
play

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep - PowerPoint PPT Presentation

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum 1 Agenda Goals Design internals with demo Performance Open areas of work Get involved! 2 Goals A column store


  1. Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum 1

  2. Agenda ● Goals ● Design internals with demo ● Performance ● Open areas of work ● Get involved! 2

  3. Goals ● A column store that every Postgres user can use ● Fully leverage the on-disk adjacency of column-wise storage ○ Efficient and extensive compression ○ Queries on subsets of columns should be fast ● Optimized for OLAP workloads and bulk data ingestion ● Reasonable OLTP performance and feature parity with heap ● Fully MVCC, crash-safe and supports replication 3

  4. Design 4

  5. Design ● Leverages the Table AM API ● Leverages PostgreSQL infrastructure ○ 8KB (BLCKSZ) fixed block size ○ Buffer manager ○ WAL logging - with custom WAL records ● Each column is a B-Tree 5

  6. Forest of B-Trees Attribute Tree Attribute Tree Attribute Tree Tid Tree attnum = 1 attnum = 2 attnum = n attnum = 0 A block on disk 6

  7. ZSTid & B-Tree pages ● 64-bit tuple identifier, key of the B-Trees block 9 (lokey=1, hikey=MaxZSTid) ● 1-1 mapping with ItemPointer: 1 15922 24616 ○ ZSTidFromItemPointer(blk, off) = blk * 128 + off ○ Only 48 bits is used ● Purely logical ○ Does not tie tuple to physical location block 6 block 8 block 12 (lokey=1, (lokey=15922, (lokey=24616, ○ A tuple’s tid never changes hikey=15922) hikey=24616) hikey=MaxZSTid)

  8. Demo 8

  9. Demo 9

  10. Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 10

  11. Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 11

  12. Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 12

  13. Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 13

  14. INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 14

  15. INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) zsbt_tid_multi_insert(..,1,..) 5 Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 15

  16. INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 16

  17. INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) data cursor len tuple_buffer attstream_buffer2 tid firsttid, lasttid, attlen, relid (hash reservation attbuffers attbyval key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 17

  18. INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) data cursor len tuple_buffer attstream_buffer2 tid firsttid, lasttid, attlen, relid (hash reservation attbuffers attbyval key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums zsbt_attbuffer_flush() zsbt_attbuffer_flush() Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 18

  19. SELECT i, j FROM foo zedstoream_beginscan(...) zedstoream_beginscan_with_column_projection(...) zedstoream_getnextslot(...) zsbt_tid_begin_scan(...) zsbt_attr_begin_scan(...) zsbt_tid_scan_next(...) For all attributes in col list: zsbt_attr_fetch(...) /* fill in visibility info */ return slot 19

  20. SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() zsbt_tid_begin_scan(...) zsbt_attr_begin_scan(...) zsbt_tid_scan_next(...) For all attributes in col list: zsbt_attr_fetch(...) /* fill in visibility info */ return slot 20

  21. SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() Attr tree zsbt_tid_begin_scan(...) Attnum = 1 ‘a’ Attname = i zsbt_attr_begin_scan(...) Type = char(1) zsbt_tid_scan_next(...) zsbt_attr_fetch(5, 1) For all attributes in col list: 4-6 ‘a’ zsbt_attr_fetch(...) /* fill in visibility info */ return slot 21

  22. SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() Attr tree zsbt_tid_begin_scan(...) Attnum = 1 ‘a’ Attname = i zsbt_attr_begin_scan(...) Type = char(1) 42 zsbt_tid_scan_next(...) zsbt_attr_fetch(5, 1) For all attributes in col list: 4-6 ‘a’ zsbt_attr_fetch(...) /* fill in visibility info */ return slot Attr tree zsbt_attr_fetch(5, 7) Attnum = 7 Attname = j Type = int 4-6 42 22

  23. TID tree leaf page layout ZSTidArrayItem PageHeaderData Header TIDs: UNDO slots UNDO slotwords ItemId ItemId ItemId ... - t_firsttid, t_lasttid - 64 bit simple-8b codeword - 0-2 unique UNDO - 64 bit ints packed with pd_lower - Sizes of other sections - Delta encoded pointers to UNDO pages each TID’s UNDO pointer - size of item itself - Typically 1-10 bits per TID - 2 bits per TID Free Space pd_upper ... Item Item Item Logical content: Physical content: pd_special ZSBtreePageOpaque TID | UNDO pointer TIDs | UNDO pointers | UNDO indexes 119 | ALL_VISIBLE 119, 1, 1, 2, 1, 1 | 7599293, 8000222 | V, 0, D, 0, 1, 1 Tid leaf page 120 | 7599293 121 | DEAD 123 | 7599293 124 | 8000222 125 | 8000222 23

  24. Attribute tree leaf page layout ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Header compressed chunks Header Content: - size - flags (compressed or not) - decompressed_size - t_lasttid Attr leaf page 24

  25. Attribute tree leaf page layout ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Header compressed chunks TIDs: Header Content: - 1-60 TIDs - size Datum 0 Datum 1 ... Datum N - 4-bit mode selector encodes - flags (compressed or not) nullity, TID lengths & datum lengths - decompressed_size - t_lasttid Logical content: Physical content: TID | column values TIDs | column values 31 | 31 31, 1, 1, 1,..., 1 | 31, 32, 33, 34, … 60 Attr leaf page 32 | 32 33 | 33 34 | 34 … 60 | 60 25

  26. Demo 26

  27. Attribute tree leaf w/ oversized datum ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Raw datum cannot fit in a block An inline-compressed chunk Datum 0 toast_compress_datum() Yes 1 48-bit zstid zs_toast_header_inline mode selector: toast PGLZ compressed - 32-bit compressed_size (varlena_modes[14]) datum - 32-bit rawsize toast type: inline PGLZ compressed datum fit in a block? An externally toasted chunk Datum 0 1 48-bit zstid No mode selector: toast zs_toast_header_external Toast Pages (varlena_modes[14]) (first toast block number) toast type: toast page 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend