Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep - PowerPoint PPT Presentation

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum 1

Agenda ● Goals ● Design internals with demo ● Performance ● Open areas of work ● Get involved! 2

Goals ● A column store that every Postgres user can use ● Fully leverage the on-disk adjacency of column-wise storage ○ Efficient and extensive compression ○ Queries on subsets of columns should be fast ● Optimized for OLAP workloads and bulk data ingestion ● Reasonable OLTP performance and feature parity with heap ● Fully MVCC, crash-safe and supports replication 3

Design 4

Design ● Leverages the Table AM API ● Leverages PostgreSQL infrastructure ○ 8KB (BLCKSZ) fixed block size ○ Buffer manager ○ WAL logging - with custom WAL records ● Each column is a B-Tree 5

Forest of B-Trees Attribute Tree Attribute Tree Attribute Tree Tid Tree attnum = 1 attnum = 2 attnum = n attnum = 0 A block on disk 6

ZSTid & B-Tree pages ● 64-bit tuple identifier, key of the B-Trees block 9 (lokey=1, hikey=MaxZSTid) ● 1-1 mapping with ItemPointer: 1 15922 24616 ○ ZSTidFromItemPointer(blk, off) = blk * 128 + off ○ Only 48 bits is used ● Purely logical ○ Does not tie tuple to physical location block 6 block 8 block 12 (lokey=1, (lokey=15922, (lokey=24616, ○ A tuple’s tid never changes hikey=15922) hikey=24616) hikey=MaxZSTid)

Demo 8

Demo 9

Page types Meta Page Free Pages 0 2 5 10 Blocks layout for table foo B-Tree Pages 0 1 2 3 4 5 6 7 Tid Tree Attribute Tree Attribute Tree 9 4 15 attnum = 0 attnum = 1 8 9 11 attnum = 2 10 12 13 14 15 16 17 18 19 1 3 7 11 13 14 16 29 6 8 12 20 21 22 23 { 17-28 24 25 26 27 28 29 5 10 2 UNDO Pages Toast Pages 10

INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 14

INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) zsbt_tid_multi_insert(..,1,..) 5 Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 15

INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) tuple_buffer tid relid (hash reservation attbuffers key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 16

INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) data cursor len tuple_buffer attstream_buffer2 tid firsttid, lasttid, attlen, relid (hash reservation attbuffers attbyval key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 17

INSERT INTO foo VALUES(‘hello’, 42, …) Backend private memory zedstoream_insert_internal() ‘foo’::regclass: (‘hello’, 42, …) data cursor len tuple_buffer attstream_buffer2 tid firsttid, lasttid, attlen, relid (hash reservation attbuffers attbyval key) zsbt_attbuffer_spool(5, 42) zsbt_attbuffer_spool(5, “hello”) attbuff1 attbuff2 zsbt_tid_multi_insert(..,1,..) 5 buffer[60] buffer[60] tids and attstream_buffer tids and attstream_buffer datums datums zsbt_attbuffer_flush() zsbt_attbuffer_flush() Shared buffers Tid tree Attr tree Attr tree Attnum = 0 Attnum = 1 Attnum = 2 18

SELECT i, j FROM foo zedstoream_beginscan(...) zedstoream_beginscan_with_column_projection(...) zedstoream_getnextslot(...) zsbt_tid_begin_scan(...) zsbt_attr_begin_scan(...) zsbt_tid_scan_next(...) For all attributes in col list: zsbt_attr_fetch(...) /* fill in visibility info */ return slot 19

SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() zsbt_tid_begin_scan(...) zsbt_attr_begin_scan(...) zsbt_tid_scan_next(...) For all attributes in col list: zsbt_attr_fetch(...) /* fill in visibility info */ return slot 20

SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() Attr tree zsbt_tid_begin_scan(...) Attnum = 1 ‘a’ Attname = i zsbt_attr_begin_scan(...) Type = char(1) zsbt_tid_scan_next(...) zsbt_attr_fetch(5, 1) For all attributes in col list: 4-6 ‘a’ zsbt_attr_fetch(...) /* fill in visibility info */ return slot 21

SELECT i, j FROM foo Tid tree Attnum = 0 zedstoream_beginscan(...) zsbt_tid_scan_next() 1-10 11-20 zedstoream_beginscan_with_column_projection(...) 5 21-30 31-40 zedstoream_getnextslot(...) zedstoream_getnextslot() Attr tree zsbt_tid_begin_scan(...) Attnum = 1 ‘a’ Attname = i zsbt_attr_begin_scan(...) Type = char(1) 42 zsbt_tid_scan_next(...) zsbt_attr_fetch(5, 1) For all attributes in col list: 4-6 ‘a’ zsbt_attr_fetch(...) /* fill in visibility info */ return slot Attr tree zsbt_attr_fetch(5, 7) Attnum = 7 Attname = j Type = int 4-6 42 22

TID tree leaf page layout ZSTidArrayItem PageHeaderData Header TIDs: UNDO slots UNDO slotwords ItemId ItemId ItemId ... - t_firsttid, t_lasttid - 64 bit simple-8b codeword - 0-2 unique UNDO - 64 bit ints packed with pd_lower - Sizes of other sections - Delta encoded pointers to UNDO pages each TID’s UNDO pointer - size of item itself - Typically 1-10 bits per TID - 2 bits per TID Free Space pd_upper ... Item Item Item Logical content: Physical content: pd_special ZSBtreePageOpaque TID | UNDO pointer TIDs | UNDO pointers | UNDO indexes 119 | ALL_VISIBLE 119, 1, 1, 2, 1, 1 | 7599293, 8000222 | V, 0, D, 0, 1, 1 Tid leaf page 120 | 7599293 121 | DEAD 123 | 7599293 124 | 8000222 125 | 8000222 23

Attribute tree leaf page layout ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Header compressed chunks Header Content: - size - flags (compressed or not) - decompressed_size - t_lasttid Attr leaf page 24

Attribute tree leaf page layout ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Header compressed chunks TIDs: Header Content: - 1-60 TIDs - size Datum 0 Datum 1 ... Datum N - 4-bit mode selector encodes - flags (compressed or not) nullity, TID lengths & datum lengths - decompressed_size - t_lasttid Logical content: Physical content: TID | column values TIDs | column values 31 | 31 31, 1, 1, 1,..., 1 | 31, 32, 33, 34, … 60 Attr leaf page 32 | 32 33 | 33 34 | 34 … 60 | 60 25

Demo 26

Attribute tree leaf w/ oversized datum ZSAttStream Header Chunk 0 Chunk 1 ... chunk N-1 Chunk N Raw datum cannot fit in a block An inline-compressed chunk Datum 0 toast_compress_datum() Yes 1 48-bit zstid zs_toast_header_inline mode selector: toast PGLZ compressed - 32-bit compressed_size (varlena_modes[14]) datum - 32-bit rawsize toast type: inline PGLZ compressed datum fit in a block? An externally toasted chunk Datum 0 1 48-bit zstid No mode selector: toast zs_toast_header_external Toast Pages (varlena_modes[14]) (first toast block number) toast type: toast page 27

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep - PowerPoint PPT Presentation

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum 1 Agenda Goals Design internals with demo Performance Open areas of work Get involved! 2 Goals A column store

R-Tree An R-tree is a depth-balanced tree Each node corresponds to a disk page Leaf

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu

Virtual banking beyond borders 2 M a r k e t C h a l l e n g e 68 MIL ILLION REFUGEES

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Chapter 4 Tree-Structured Indexing ISAM and B + -trees Binary Search ISAM Architecture and

CS 61A Lecture 11: Trees Tammy Nguyen (tammynguyen@berkeley.edu) Thursday, 07/07 Announcements

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela

Modeling and Infering a Spatio-temporal Dynamic For Apple Scab in Orchards R. Senoussi * , C.

Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and

Minimizing MAXLIVE For Spill-Free Register Allocation Shashi Deepa Arcot, Henry Gordon Dietz,

From the Consent of the Routed: Improving the Transparency of the RPKI. Ethan Heilman Danny

FINANCIAL AID 101 PRESENTER: CHERYL SUSZYNSKI LEAF ADVISOR csuszynski@leaf-ohio.org AGENDA

priorityQueue-ADTLPO.li Stores a collection of pairs ( item , priority ) are from - Priorities

e-Science Introduction Eric Yen e-Science Workshop, March 2011 Outline Workshop Overview

Snabb: Open Source Meets Dataplane RIPE77, October 2018, Amsterdam Andy Wingo | wingo@igalia.com

Bridging Theory and Practice of GANs MIX+GAN Ian Goodfellow, Sta ff Research Scientist, Google

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Work-based Learning Why Now and Where to? Work rk-based Le Learnin ing: : Why Now? Interest

Tor and circumvention: Lessons learned Roger Dingledine The Tor Project https://torproject.org/

WEBINAR WEDNESDAY Negotiation and Navigation in a Post-Naviance World March 7, 2018 Todays

Simulation Data Mining for Supporting Bridge Design Ninth Australasian Data Mining Conference

Establishing Time for Professional Learning Transforming

Hardware accelerating Linux network functions Roopa Prabhu, Wilson Kok Proceedings of netdev

> > > > < > SANDALS CHURCH CAMPUS LEADERSHIP TRAINING > > >

Sambuz

Useful Links

Newsletter

Mail Us

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep - PowerPoint PPT Presentation

Zedstore Columnar storage for PostgreSQL Alexandra Wang, Soumyadeep Chakraborty VMware Greenplum 1 Agenda Goals Design internals with demo Performance Open areas of work Get involved! 2 Goals A column store

R-Tree An R-tree is a depth-balanced tree Each node corresponds to a disk page Leaf

Overview of Storage and Indexing Chapter 8 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu

Virtual banking beyond borders 2 M a r k e t C h a l l e n g e 68 MIL ILLION REFUGEES

Module 6: Planning Indexes Overview Introduction to Indexes Index Architecture How

Chapter 4 Tree-Structured Indexing ISAM and B + -trees Binary Search ISAM Architecture and

CS 61A Lecture 11: Trees Tammy Nguyen (tammynguyen@berkeley.edu) Thursday, 07/07 Announcements

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela

Modeling and Infering a Spatio-temporal Dynamic For Apple Scab in Orchards R. Senoussi * , C.

Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and

Minimizing MAXLIVE For Spill-Free Register Allocation Shashi Deepa Arcot, Henry Gordon Dietz,

From the Consent of the Routed: Improving the Transparency of the RPKI. Ethan Heilman Danny

FINANCIAL AID 101 PRESENTER: CHERYL SUSZYNSKI LEAF ADVISOR csuszynski@leaf-ohio.org AGENDA

priorityQueue-ADTLPO.li Stores a collection of pairs ( item , priority ) are from - Priorities

e-Science Introduction Eric Yen e-Science Workshop, March 2011 Outline Workshop Overview

Snabb: Open Source Meets Dataplane RIPE77, October 2018, Amsterdam Andy Wingo | wingo@igalia.com

Bridging Theory and Practice of GANs MIX+GAN Ian Goodfellow, Sta ff Research Scientist, Google

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Work-based Learning Why Now and Where to? Work rk-based Le Learnin ing: : Why Now? Interest

Tor and circumvention: Lessons learned Roger Dingledine The Tor Project https://torproject.org/

WEBINAR WEDNESDAY Negotiation and Navigation in a Post-Naviance World March 7, 2018 Todays

Simulation Data Mining for Supporting Bridge Design Ninth Australasian Data Mining Conference

Establishing Time for Professional Learning Transforming

Hardware accelerating Linux network functions Roopa Prabhu, Wilson Kok Proceedings of netdev

&gt; &gt; &gt; &gt; &lt; &gt; SANDALS CHURCH CAMPUS LEADERSHIP TRAINING &gt; &gt; &gt;

Sambuz

Useful Links

Newsletter

Mail Us

> > > > < > SANDALS CHURCH CAMPUS LEADERSHIP TRAINING > > >