column stores vs row stores how different are they really
play

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel - PowerPoint PPT Presentation

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008 Daniel Abadi -- Yale University Row vs. Column-Stores Row-Store Column-Store Last


  1. Column-Stores vs. Row-Stores: How Different Are They Really? Daniel Abadi (Yale), Samuel Madden (MIT), Nabil Hachem (AvantGarde Consulting) June 12 th , 2008 Daniel Abadi -- Yale University

  2. Row vs. Column-Stores Row-Store Column-Store Last First Street Last First Street Name Name E-mail Phone # Address Name Name E-mail Phone # Address + Only need to read + Easy to add a new in relevant data record − Might read in − Tuple writes might require unnecessary multiple seeks data Daniel Abadi -- Yale University

  3. Column-Stores • Really good for read-mostly data warehouses � Lot’s of column scans and aggregations � Writes tend to be in batch � [CK85], [SAB+05], [ZBN+05], [HLA+06], [SBC+07] all verify this � Top 3 in TPC-H rankings (Exasol, ParAccel, and Kickfire) are column-stores � Factor of 5 faster on performance � Factor of 2 superior on price/performance Daniel Abadi -- Yale University

  4. Data Warehouse DBMS Software • $4.5 billion industry (out of total $16 billion DBMS software industry) • Growing 10% annually Daniel Abadi -- Yale University

  5. Momentum • Right solution for growing market � $$$$ • Vertica, ParAccel, Kickfire, Calpont, Infobright, and Exasol new entrants • Sybase IQ’s profits rapidly increasing • Yahoo’s world largest (multi-petabyte) data warehouse is a column-store (from Mahat Technologies acquisition) Daniel Abadi -- Yale University

  6. Paper Looks At Key Question • How much of the buzz around column- stores just marketing hype? � Do you really need to buy Sybase IQ or Vertica? � How far will your current row-store take you? � Can you get column-store performance from a row- store? � Can you simulate a column-store in a row-store? Daniel Abadi -- Yale University

  7. Paper Methodology • Comparing row-store vs. column-store is dangerous/borderline meaningless • Instead, compare row-store vs. row-store and column-store vs. column-store � Simulate a column-store inside of a row-store � Remove column-oriented features from column-store until it behaves like a row-store Daniel Abadi -- Yale University

  8. Simulate Column-Store Inside Row-Store Last First Street Name Name E-mail Phone # Address Option A: Option B: Vertical Partitioning Index Every Column Last First Name Name E-mail Last Name Index First Name Index 1 1 1 2 2 2 3 3 3 … Daniel Abadi -- Yale University

  9. Experiments • Star Schema Benchmark (SSBM) � Fact table contains 17 columns and 60,000,000 rows � 4 dimension tables, biggest one has 80,000 rows � Queries perform 2-4 joins between fact table and dimension tables, aggregate 1-2 columns from fact table � [OOC06] • Implemented by professional DBA � Original row-store plus 2 column-store simulations on same row-store product Daniel Abadi -- Yale University

  10. SSBM Averages ����� ����� �������������� ����� ����� ���� ��� ��������!�"����������� ����������#��$����� ���������������� ��������� %���&�� ���� ���� ����� ������� Daniel Abadi -- Yale University

  11. What’s Going On? • Vertically Partitioned Case � Tuple Sizes � Horizontal Partitioning • All Indexes Case � Tuple Reconstruction Daniel Abadi -- Yale University

  12. Tuple Size TID Column Tuple TID Column TID Column Data Header Data Data 1 1 1 2 2 2 3 3 3 • Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns •Complete fact table takes up ~4 GB (compressed) •Vertically partitioned tables take up 0.7-1.1 GB (compressed) Daniel Abadi -- Yale University

  13. Horizontal Partitioning • Fact table horizontally partitioned on year � Year is an element of the ‘Date’ dimension table � Most queries in SSBM have a predicate on year � Since vertically partitioned tables do not contain the ‘Date’ foreign key, row-store could not similarly partition them Daniel Abadi -- Yale University

  14. What’s Going On? • Vertically Partitioned Case � Tuple Sizes � Horizontal Partitioning • All Indexes Case � Tuple Construction Daniel Abadi -- Yale University

  15. Tuple Construction • Common type of query: � SELECT store_name, SUM(revenue) FROM Facts, Stores WHERE fact.store_id = stores.store_id AND stores.country = “Canada” GROUP BY store_name Daniel Abadi -- Yale University

  16. Tuple Construction • Result of lower part of query plan is a set of TIDs that passed all predicates • Need to extract SELECT attributes at these TIDs � BUT: index maps value to TID � You really want to map TID to value (i.e., a vertical partition) � � Tuple construction is SLOW Daniel Abadi -- Yale University

  17. So…. • All indexes approach is a poor way to simulate a column-store • Problems with vertical partitioning are NOT fundamental � Store tuple header in a separate partition � Allow virtual TIDs � Allow HP using a foreign key on a different VP • So can row-stores simulate column- stores? Daniel Abadi -- Yale University

  18. Row-Store vs. Column-Store '��� ���� ���� �������������� ���� ���� ��� ��� Row-Store Row-Store (M V) C-Store 25.7 1 1 .7 4.4 Average Daniel Abadi -- Yale University

  19. Row-Store vs. Column-Store '��� ���� ���� �������������� ���� ���� ��� ��� Row-Store Row-Store (M V) C-Store 25.7 1 1 .7 4.4 Average Daniel Abadi -- Yale University

  20. Column-Store Experiments • Start with column-store (C-Store) • Remove column-store-specific performance optimizations • End with column-store with a row-oriented query executer Daniel Abadi -- Yale University

  21. Compression Quarter Quarter • Higher data value locality Q1 (Q1, 1, 300) Q1 in column-stores (Q2, 301, 350) Q1 � Better ratio � reduced I/O Q1 (Q3, 651, 500) Q1 • Can use schemes like (Q4, 1151, 600) Q1 run-length encoding Q1 … � Easy to operate on directly Q2 for improved performance Q2 ([AMF06]) Q2 Q2 … Daniel Abadi -- Yale University

  22. Early vs. Late Materialization QUERY: Select + Aggregate SELECT custID,SUM(price) 4 2 2 7 FROM table 4 1 3 13 WHERE (prodID = 4) AND (storeID = 1) AND 4 3 3 42 GROUP BY custID 4 1 3 80 • Early Materialization: create Construct rows first. But: 4 2 2 7 � Poor memory bandwidth 4 1 3 13 utilization 4 3 3 42 � Lose opportunity for 4 1 3 80 price vectorized operation prodID storeIDcustID Daniel Abadi -- Yale University

  23. Other Column-Store Optimizations • Invisible join � Column-store specific join � Optimizations for star schemas � Similar to a semi-join • Block Processing Daniel Abadi -- Yale University

  24. Simplified Version of Results ���� (��� �������������� '��� ���� ���� ��� C-Store, No C-St ore, Early Original C-St ore Compression Mat erializat ion 4.4 14.9 40.7 Average Daniel Abadi -- Yale University

  25. Conclusion • Might be possible to simulate a row-store in a column-store, BUT: � Need better support for vertical partitioning at the storage layer � Need support for column-specific optimizations at the executer level • Working with HP Labs to find out Daniel Abadi -- Yale University

  26. Come Join the Yale DB Group! Daniel Abadi -- Yale University

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend