Storage and Indexing 11/19/2018 1 Overview We covered storage of - PowerPoint PPT Presentation

Storage and Indexing 11/19/2018 1

Overview We covered storage of unstructured files in HDFS Partition into blocks Replicate to data nodes This lecture will cover the storage of structured and semi-structured data Row vs column formats Data-aware partitioning Dynamic indexing 11/19/2018 2

Challenges HDFS is write-once read-many file system Random access can be extremely slow as it might need to access data on another machine Data locality has to be taken into account to ensure the computation-to-data execution style Support nested data structures 11/19/2018 3

Row-oriented Stores Row … Field 1 Field 2 Field 3 CSV and JSON formats are examples of traditional row-oriented data formats How schema is stored in each one? How flexible is each one for adding additional fields? Hybrid format of fixed columns + extensible columns 11/19/2018 4

Extensible Row Format Header Name:type Name:type Name:type Row Value Value Value Name:type:value Name:type:value Name:type:value 11/19/2018 5

Traditional Column Stores Header ID:int Name:string Email:string Column1 … 1564 1567 1568 1569 1572 … Column2 Paul Xu Jyeshta Nora Alex Column3 paul@gmail.com xu@163.com nil nil alex@live.com 11/19/2018 6

Pros/Cons of Column Formats Pros Faster projection Column compression Efficient aggregation Cons Not extensible. Cannot easily add more fields Slower when combining multiple columns Slower joins 11/19/2018 7

Hybrid Row/Column Format Used in most big-data key-value stores Groups related columns together into column families to reduce the overhead of combining them Each column family is further partitioned horizontally into sets of rows Each set of rows is stored in a column- oriented format with appropriate compression and encoding 11/19/2018 8

Hybrid Row/Column Format ID Name ID Email 11/19/2018 9

Indexing A means for speeding up some queries Can help avoiding full scans Traditional DBMS indexes B+-tree R-tree Hash indexes Bitmap indexes Drawback of traditional indexes Existing implementations cannot scale to big data Use random reads/writes not supported in HDFS 11/19/2018 10

Clustered/Unclustered Indexes Clustered indexes Organize records to match the order of the index Good for both point and range queries Can only build one index per dataset Unclustered indexes Records are kept as-is Good only for point queries and very small ranges Supports multiple indexes per dataset Rely on random access Unclustered indexes are less useful in HDFS. Why? 11/19/2018 11

Distributed Indexes Big Data Global Index a.k.a. Partitioning Local Index Local Index Local Index Local Index Local Index HDFS Blocks 11/19/2018 12

Hash Partitioning Advantages Requires one scan over the data Flexible on number of partitions With a good hash function, provides a good load balance Drawbacks Supports only point queries 11/19/2018 13

Range Partitioning How to find partition boundaries? Traditionally, partition boundaries evolve as records are inserted Not possible in HDFS where random writes are not allowed A common solution Sample the input data (one scan) Calculate partition boundaries (driver machine) Partition the data (one scan) 11/19/2018 14

Dynamic Partitioning Very challenging in big data Cannot modify existing blocks How to insert a record into closed ranges? Common solution: Log-structured merge-tree (LSM-tree) 11/19/2018 15

LSM Tree Master Node New records Memory component Flushed Slave Node Slave Node Slave Node … Disk components Disk components Disk components Compact and merge (e.g., External merge sort) 11/19/2018 16

Local Indexing Relatively easier Computed locally in each block before it gets written to disk Appended/prepended to the data block Given the small size of the block, it can be completely constructed in main-memory before the block is written Examples Bloom filter Sorting 11/19/2018 17

Summary Two orthogonal problems in big-data storage File formats (row, column, or hybrid) Indexing (Global and local) File formats Row: Flexible but inefficient Column: Efficient for some queries but inflexible Hybrid: Tries to be flexible and efficient Indexing Global: Load-balanced partitioning Local: Additional metadata affixed to each block 11/19/2018 18

Storage and Indexing 11/19/2018 1 Overview We covered storage of - PowerPoint PPT Presentation

Storage and Indexing 11/19/2018 1 Overview We covered storage of unstructured files in HDFS Partition into blocks Replicate to data nodes This lecture will cover the storage of structured and semi-structured data Row vs column formats

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Storage and Indexing 1 Overview We covered storage of unstructured files in HDFS

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Storage and Indexing DBS Database Systems Reading: R&G Chapters 8, 9 & 10.1 Implementing

Overview of Storage and Indexing [R&G] Chapter 8 CS4320 1 Data on External Storage

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

Linux Tools 0.6 Release Review Planned Review Date: 2010-06-23 Communication Channel:

meet #HOMER @ F O S D E M 2 0 1 6 Written by: Alexandr Dubovikov, Lorenzo Mangani HOMER

Logging with Log4j and log aggregation with Apache flume By Arivoli.K,MDS201903 Naveen Kumar

Compsci 201 201 More o e on T Trees es a and d Compu puter er S Scien ence Par art 1

Thug: a low-interaction honeyclient Angelo Dell'Aera Speaker Chief Executive Officer @ Honeynet

SimpleSAMLphp EuroCAMP 2010 Olav Morken olav.morken@uninett.no SimpleSAMLphp Mainly a SAML

Whats New in the Community Books Since the ACL2-2018 Workshop Cuong Chau 1 , Alessandro Coglio

BTW/BigBIA 2017 Applicatjon and Testjng of Business Processes in the Energy Domain Kristof

Storage and Indexing 11/19/2018 1 Overview We covered storage of - PowerPoint PPT Presentation

Storage and Indexing 11/19/2018 1 Overview We covered storage of unstructured files in HDFS Partition into blocks Replicate to data nodes This lecture will cover the storage of structured and semi-structured data Row vs column formats

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing &amp; Searching 3

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Chapter 6 Hash-Based Indexing Efficient Support for Equality Search Hash-Based Indexing Static

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

Indexing December 12, 2008 Indexing Introduction New tuple is stored without any order next

&gt; SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Storage and Indexing 1 Overview We covered storage of unstructured files in HDFS

SUSE Enterprise Storage 6 Darren Soothill EMEA Storage Technical Strategist Agenda

Solar Plus Storage Solar Plus Storage Focus on Storage Benefits Focus on Storage Benefits by

Hybrid SAN &amp; Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

INF5470 Fall 2012 Lecture 10: Analog Storage Content Overview Volatile Short Term Storage

Storage and Indexing DBS Database Systems Reading: R&amp;G Chapters 8, 9 &amp; 10.1 Implementing

Overview of Storage and Indexing [R&amp;G] Chapter 8 CS4320 1 Data on External Storage

Audio Indexing and Retrieval IT6902; Semester B, 2004/2005; Leung Audio Indexing and Retrieval

Linux Tools 0.6 Release Review Planned Review Date: 2010-06-23 Communication Channel:

meet #HOMER @ F O S D E M 2 0 1 6 Written by: Alexandr Dubovikov, Lorenzo Mangani HOMER

Logging with Log4j and log aggregation with Apache flume By Arivoli.K,MDS201903 Naveen Kumar

Compsci 201 201 More o e on T Trees es a and d Compu puter er S Scien ence Par art 1

Thug: a low-interaction honeyclient Angelo Dell'Aera Speaker Chief Executive Officer @ Honeynet

SimpleSAMLphp EuroCAMP 2010 Olav Morken olav.morken@uninett.no SimpleSAMLphp Mainly a SAML

Whats New in the Community Books Since the ACL2-2018 Workshop Cuong Chau 1 , Alessandro Coglio

BTW/BigBIA 2017 Applicatjon and Testjng of Business Processes in the Energy Domain Kristof

Indexing and Searching Indexing and Searching TDT4215 TDT4215 Indexing & Searching 3

> SUN STORAGE 7000 UNIFIED STORAGE SYSTEMS ITS TIME TO CHANGE YOUR STORAGE

Hybrid SAN & Cluster Enterprise Network Storage Hikvision Enterprise Network Storage

Storage and Indexing DBS Database Systems Reading: R&G Chapters 8, 9 & 10.1 Implementing

Overview of Storage and Indexing [R&G] Chapter 8 CS4320 1 Data on External Storage