Storage and Indexing 11/19/2018 1 Overview We covered storage of - - PowerPoint PPT Presentation

storage and indexing
SMART_READER_LITE
LIVE PREVIEW

Storage and Indexing 11/19/2018 1 Overview We covered storage of - - PowerPoint PPT Presentation

Storage and Indexing 11/19/2018 1 Overview We covered storage of unstructured files in HDFS Partition into blocks Replicate to data nodes This lecture will cover the storage of structured and semi-structured data Row vs column formats


slide-1
SLIDE 1

Storage and Indexing

11/19/2018 1

slide-2
SLIDE 2

Overview

We covered storage of unstructured files in HDFS

Partition into blocks Replicate to data nodes

This lecture will cover the storage of structured and semi-structured data

Row vs column formats Data-aware partitioning Dynamic indexing

11/19/2018 2

slide-3
SLIDE 3

Challenges

HDFS is write-once read-many file system Random access can be extremely slow as it might need to access data on another machine Data locality has to be taken into account to ensure the computation-to-data execution style Support nested data structures

11/19/2018 3

slide-4
SLIDE 4

Row-oriented Stores

CSV and JSON formats are examples of traditional row-oriented data formats

How schema is stored in each one? How flexible is each one for adding additional fields?

Hybrid format of fixed columns + extensible columns

11/19/2018 4

Field 1

Row

Field 2 Field 3 …

slide-5
SLIDE 5

Extensible Row Format

11/19/2018 5

Name:type

Header

Name:type Name:type

Row

Value Value Value Name:type:value Name:type:value Name:type:value

slide-6
SLIDE 6

Traditional Column Stores

11/19/2018 6

ID:int

Header

Name:string Email:string

Column1

1564 1567 1568 1569 1572 …

Column2

Paul Xu Jyeshta Nora Alex …

Column3

paul@gmail.com xu@163.com nil alex@live.com nil

slide-7
SLIDE 7

Pros/Cons of Column Formats

Pros

Faster projection Column compression Efficient aggregation

Cons

Not extensible. Cannot easily add more fields Slower when combining multiple columns Slower joins

11/19/2018 7

slide-8
SLIDE 8

Hybrid Row/Column Format

Used in most big-data key-value stores Groups related columns together into column families to reduce the overhead of combining them Each column family is further partitioned horizontally into sets of rows Each set of rows is stored in a column-

  • riented format with appropriate compression

and encoding

11/19/2018 8

slide-9
SLIDE 9

Hybrid Row/Column Format

11/19/2018 9

ID Name ID Email

slide-10
SLIDE 10

Indexing

A means for speeding up some queries Can help avoiding full scans Traditional DBMS indexes

B+-tree R-tree Hash indexes Bitmap indexes

Drawback of traditional indexes

Existing implementations cannot scale to big data Use random reads/writes not supported in HDFS

11/19/2018 10

slide-11
SLIDE 11

Clustered/Unclustered Indexes

Clustered indexes

Organize records to match the order of the index Good for both point and range queries Can only build one index per dataset

Unclustered indexes

Records are kept as-is Good only for point queries and very small ranges Supports multiple indexes per dataset Rely on random access

Unclustered indexes are less useful in HDFS. Why?

11/19/2018 11

slide-12
SLIDE 12

Distributed Indexes

11/19/2018 12

HDFS Blocks

Big Data Global Index a.k.a. Partitioning Local Index Local Index Local Index Local Index Local Index

slide-13
SLIDE 13

Hash Partitioning

Advantages

Requires one scan over the data Flexible on number of partitions With a good hash function, provides a good load balance

Drawbacks

Supports only point queries

11/19/2018 13

slide-14
SLIDE 14

Range Partitioning

How to find partition boundaries? Traditionally, partition boundaries evolve as records are inserted Not possible in HDFS where random writes are not allowed A common solution

Sample the input data (one scan) Calculate partition boundaries (driver machine) Partition the data (one scan)

11/19/2018 14

slide-15
SLIDE 15

Dynamic Partitioning

Very challenging in big data Cannot modify existing blocks How to insert a record into closed ranges? Common solution: Log-structured merge-tree (LSM-tree)

11/19/2018 15

slide-16
SLIDE 16

LSM Tree

11/19/2018 16

Master Node Memory component Slave Node Disk components Slave Node Disk components Slave Node Disk components New records Flushed … Compact and merge (e.g., External merge sort)

slide-17
SLIDE 17

Local Indexing

Relatively easier Computed locally in each block before it gets written to disk Appended/prepended to the data block Given the small size of the block, it can be completely constructed in main-memory before the block is written Examples

Bloom filter Sorting

11/19/2018 17

slide-18
SLIDE 18

Summary

Two orthogonal problems in big-data storage

File formats (row, column, or hybrid) Indexing (Global and local)

File formats

Row: Flexible but inefficient Column: Efficient for some queries but inflexible Hybrid: Tries to be flexible and efficient

Indexing

Global: Load-balanced partitioning Local: Additional metadata affixed to each block

11/19/2018 18