Column-Stores vs. Row-Stores: How Different Are They Really? Daniel - PowerPoint PPT Presentation

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008 Daniel Abadi -- Yale University

Row vs. Column-Stores Row-Store Column-Store Last First Street Last First Street Name Name E-mail Phone # Address Name Name E-mail Phone # Address + Only need to read + Easy to add a new in relevant data record − Might read in − Tuple writes might require unnecessary multiple seeks data Daniel Abadi -- Yale University

Column-Stores • Really good for read-mostly data warehouses � Lot’s of column scans and aggregations � Writes tend to be in batch � [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this � Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores � Factor of 5 faster on performance � Factor of 2 superior on price/performance Daniel Abadi -- Yale University

Data Warehouse DBMS Software • $4.5 billion industry (out of total $16 billion DBMS software industry) • Growing 10% annually Daniel Abadi -- Yale University

Momentum • Right solution for growing market � $$$$ • Vertica, ParAccel, Kickfire, Calpont, Infobright, and Exasol new entrants • Sybase IQ’s profits rapidly increasing • Yahoo’s world largest (multi-petabyte) data warehouse is a column-store (from Mahat Technologies acquisition) Daniel Abadi -- Yale University

Paper Looks At Key Question • How much of the buzz around column- stores just marketing hype? � Do you really need to buy Sybase IQ or Vertica? � How far will your current row-store take you? � Can you get column-store performance from a row- store? � Can you simulate a column-store in a row-store? Daniel Abadi -- Yale University

Paper Methodology • Comparing row-store vs. column-store is dangerous/borderline meaningless • Instead, compare row-store vs. row-store and column-store vs. column-store � Simulate a column-store inside of a row-store � Remove column-oriented features from column-store until it behaves like a row-store Daniel Abadi -- Yale University

Simulate Column-Store Inside Row-Store Last First Street Name Name E-mail Phone # Address Option A: Option B: Vertical Partitioning Index Every Column Last First Name Name E-mail Last Name Index First Name Index 1 1 1 2 2 2 3 3 3 … Daniel Abadi -- Yale University

Experiments • Star Schema Benchmark (SSBM) � Fact table contains 17 columns and 60,000,000 rows � 4 dimension tables, biggest one has 80,000 rows � Queries perform 2-4 joins between fact table and dimension tables, aggregate 1-2 columns from fact table � [OOC06] • Implemented by professional DBA � Original row-store plus 2 column-store simulations on same row-store product Daniel Abadi -- Yale University

SSBM Averages �� !�"�� #��$�� %��&�� Daniel Abadi -- Yale University

What’s Going On? • Vertically Partitioned Case � Tuple Sizes � Horizontal Partitioning • All Indexes Case � Tuple Reconstruction Daniel Abadi -- Yale University

Tuple Size TID Column Tuple TID Column TID Column Data Header Data Data 1 1 1 2 2 2 3 3 3 • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns •Complete fact table takes up ~4 GB (compressed) •Vertically partitioned tables take up 0.7-1.1 GB (compressed) Daniel Abadi -- Yale University

Horizontal Partitioning • Fact table horizontally partitioned on year � Year is an element of the ‘Date’ dimension table � Most queries in SSBM have a predicate on year � Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them Daniel Abadi -- Yale University

What’s Going On? • Vertically Partitioned Case � Tuple Sizes � Horizontal Partitioning • All Indexes Case � Tuple Construction Daniel Abadi -- Yale University

Tuple Construction • Common type of query: � SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name Daniel Abadi -- Yale University

Tuple Construction • Result of lower part of query plan is a set of TIDs that passed all predicates • Need to extract SELECT attributes at these TIDs � BUT: index maps value to TID � You really want to map TID to value (i.e., a vertical partition) � � Tuple construction is SLOW Daniel Abadi -- Yale University

So…. • All indexes approach is a poor way to simulate a column-store • Problems with vertical partitioning are NOT fundamental � Store tuple header in a separate partition � Allow virtual TIDs � Allow HP using a foreign key on a different VP • So can row-stores simulate column- stores? Daniel Abadi -- Yale University

Row-Store vs. Column-Store '�� Row-Store Row-Store (M V) C-Store 25.7 1 1 .7 4.4 Average Daniel Abadi -- Yale University

Column-Store Experiments • Start with column-store (C-Store) • Remove column-store-specific performance optimizations • End with column-store with a row-oriented query executer Daniel Abadi -- Yale University

Compression Quarter Quarter • Higher data value locality Q1 (Q1, 1, 300) Q1 in column-stores (Q2, 301, 350) Q1 � Better ratio � reduced I/O Q1 (Q3, 651, 500) Q1 • Can use schemes like (Q4, 1151, 600) Q1 run-length encoding Q1 … � Easy to operate on directly Q2 for improved performance Q2 ([AMF06]) Q2 Q2 … Daniel Abadi -- Yale University

Early vs. Late Materialization QUERY: Select + Aggregate SELECT custID,SUM(price) 4 2 2 7 FROM table 4 1 3 13 WHERE (prodID = 4) AND (storeID = 1) AND 4 3 3 42 GROUP BY custID 4 1 3 80 • Early Materialization: create Construct rows first. But: 4 2 2 7 � Poor memory bandwidth 4 1 3 13 utilization 4 3 3 42 � Lose opportunity for 4 1 3 80 price vectorized operation prodID storeIDcustID Daniel Abadi -- Yale University

Other Column-Store Optimizations • Invisible join � Column-store specific join � Optimizations for star schemas � Similar to a semi-join • Block Processing Daniel Abadi -- Yale University

Simplified Version of Results �� (�� '�� C-Store, No C-St ore, Early Original C-St ore Compression Mat erializat ion 4.4 14.9 40.7 Average Daniel Abadi -- Yale University

Conclusion • Might be possible to simulate a row-store in a column-store, BUT: � Need better support for vertical partitioning at the storage layer � Need support for column-specific optimizations at the executer level • Working with HP Labs to find out Daniel Abadi -- Yale University

Come Join the Yale DB Group! Daniel Abadi -- Yale University

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel - PowerPoint PPT Presentation

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008 Daniel Abadi -- Yale University Row vs. Column-Stores Row-Store Column-Store Last

Thursday, March 5 Director's Row 1 Director's Row 2 Director's Row 3 Director's Row 4

Friday, March 6 Director's Row 1 Director's Row 2 Director's Row 3 Director's Row 4 Director's

Row: The Third Scott Kircher Volition, Inc. Saints Row: The Third Saints Row 2 vs. The Third

Positional Update Handling in Column Stores Sandor Heman, Marcin Zukowski, Niels Nes,

A game of matching pennies column L R row T 2,0 0,1 B 0,1 1,0 People last names

The Creation of Saints Row Saints Row 's Open World Cityscape: 's Open World Cityscape: The

Session 12: git Citations and Slides in Markdown P . S. Langeslag 17 January 2019 No Headings

Row together Row in the right direction Row faster Jason Yip @jchyip jcyip@thoughtworks.com

Orthogonal projections of row spaces P B denotes the operator that projects the row space of a

Column vs. Row Stores for Data Manipulation in Hardware Oblivious CPU/GPU Database Systems Iya

Linear Algebra Vectors A column vector is a list of numbers stored vertically. The dimen-

Vectors and Matrices Vectors Defn. A matrix with one column is called a (column) vector . We

Right of Way Clean-up JPB Citizens Advisory Committee December 18, 2019 Agenda Item #10 Right

When (Low ) Pow er Really Matters When (Low ) Pow er Really Matters When (Low ) Pow er Really

Query Execution in Column-Stores Atte Hinkka Seminar on Columnar Databases, Fall 2012 1

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

robot 1 Constructing a decision network for the mail delivery 2 What should the robot do? mail

What is i i KP KP? ? i KP i KP: : i i - -Key Key- -Protocol Protocol What is i

Small time global exact null controllability of the incompressible Navier-Stokes equation with

Routing on rails with OpenStreetMap Michael Reichert (Nakaner) photo: Andre de, Wikimedia

Spam Is Bad John R. Levine Chair, IRTF ASRG Chair@asrg.sp.am http://asrg.sp.am Why is spam

Getting AAI WPs funded Who, how, why? Ann Harding, Jens

Strata: A Cross Media File System Youngjin Kwon , Henrique Fingler, Tyler Hunt, Simon Peter,

Advanced Topics of Mail Service Deal with Malicious Mail, including Virus, Phishing, Spam,