Overview Carnegie Mellon Univ. Review Dept. of Computer Science - PowerPoint PPT Presentation

Faloutsos - Pavlo SCS 15-415/615 Overview Carnegie Mellon Univ. • Review Dept. of Computer Science • Index classification 15-415/615 – DB Applications • Cost estimation Faloutsos & Pavlo Lecture #10 (R&G ch8) File Organizations and Indexing 2 Faloutsos - Pavlo CMU SCS 15-415 Alternative File Organizations How to find records quickly? Many alternatives exist, each good for some • E.g., student.gpa = ‘ 3 ’ situations, and not so good in others: – Heap files: Suitable when typical access is a file scan retrieving all records. Q: On a heap organization, with B blocks, – Sorted Files: Best for retrieval in some order, how many disk accesses? or for retrieving a `range ’ of records. – Index File Organizations: (ISAM, or B+ trees) 3 4 Faloutsos - Pavlo Faloutsos - Pavlo 1

Faloutsos - Pavlo SCS 15-415/615 Heap File Implemented Using Lists How to find records quickly? • E.g., student.gpa = ‘ 3 ’ Data Data Data Full Pages Page Page Page Q: On a heap organization, with B blocks, Header Page how many disk accesses? Free Free Free Pages with Page Page Page A: B Free Space • The header page id and Heap file name must be stored someplace. • Each page contains 2 `pointers ’ plus data. 5 6 Faloutsos - Pavlo Faloutsos - Pavlo How to accelerate searches? Example: Simple Index on GPA • A: Indices, like: Directory 3 3.5 2 2.5 Data entries: 1.2* 1.7* 1.8* 1.9* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0* 2.2* 2.4* (Index File) (Data file) Data Records An index contains a collection of data entries , and supports efficient retrieval of records matching a given search condition 7 8 Faloutsos - Pavlo Faloutsos - Pavlo 2

Faloutsos - Pavlo SCS 15-415/615 Overview Details • Review • ‘ data entries ’ == what we store at the bottom of the index pages • Index classification • what would you use as data entries? – Representation of data entries in index – Clustered vs. Unclustered • (3 alternatives here) – Primary vs. Secondary – Dense vs. Sparse – Single Key vs. Composite – Indexing technique • Cost estimation 11 11 12 Faloutsos Faloutsos - Pavlo Faloutsos - Pavlo CMU SCS 15-415 Alternatives for Data Entry k* in Index Example: Simple Index on GPA 1. Actual data record (with key value k ) Directory 3 3.5 2 2.5 123 Smith; Main str; 412-999.9999 Data entries: 2. < k , rid of matching data record> 1.2* 1.7* 1.8* 1.9* 2.7* 2.7* 2.9* 3.2* 3.3* 3.3* 3.6* 3.8* 3.9* 4.0* 2.2* 2.4* $40 Rid-1 (Index File) (Data file) $40 Rid-2 … Data Records 3. < k , list of rids of matching data records> An index contains a collection of data entries , $40 Rid-1 Rid-2 … and supports efficient retrieval of records matching a given search condition 13 14 Faloutsos - Pavlo Faloutsos - Pavlo 3

Faloutsos - Pavlo SCS 15-415/615 Alternatives for Data Entry k* in Index Alternatives for Data Entries (Contd.) 123 1. Actual data record (with key value k ) Smith; Main str; 412-999.9999 Alternative 1: 2. < k , rid of matching data record> Actual data record (with key value k ) 3. < k , list of rids of matching data records> – Then, this is a clustering/sparse index, and • Choice is orthogonal to the indexing technique. constitutes a file organization (like Heap – Examples of indexing techniques: B+ trees, files or sorted files). hash-based structures, R trees, … – At most one index on a given collection of – Typically, index contains auxiliary info that data records can use Alternative 1. directs searches to the desired data entries – Saves pointer lookups but can be • Can have multiple (different) indexes per file. expensive to maintain with insertions and – E.g. file sorted on age , with a hash index on deletions. name and a B+tree index on salary . 15 16 Faloutsos - Pavlo Faloutsos - Pavlo Alternatives for Data Entries (Contd.) Overview Alternative 2 $40 Rid-1 • Review < k , rid of matching data record> $40 Rid-2 • Index classification and Alternative 3 $40 Rid-1 Rid-2 … < k , list of rids of matching data records> – Representation of data entries in index – Easier to maintain than Alternative 1. – Clustered vs. Unclustered – If more than one index is required on a given file, at – Primary vs. Secondary most one index can use Alternative 1; rest must use – Dense vs. Sparse Alternatives 2 or 3. – Single Key vs. Composite – Alternative 3 more compact than Alternative 2, but leads to variable sized data entries even if search keys – Indexing technique are of fixed length. • Cost estimation – Even worse, for large rid lists the data entry would have to span multiple pages! 17 18 18 Faloutsos - Pavlo Faloutsos Faloutsos - Pavlo CMU SCS 15-415 4

Faloutsos - Pavlo SCS 15-415/615 Indexing - clustered index example Indexing - non-clustered Clustering/sparse index on ssn Non-clustering / dense index 123 123 456 234 >=123 … 345 STUDENT Ssn Name Address 456 Ssn Name Address 345 tomson main str 567 123 smith main str 234 jones forbes ave >=456 234 jones forbes ave 567 smith forbes ave 345 tomson main str 456 stevens forbes ave 456 stevens forbes ave 123 smith main str 567 smith forbes ave 19 20 Faloutsos - Pavlo Faloutsos - Pavlo Index Classification - clustered Index Classification - clustered • Clustered vs. unclustered : If order of data • Clustered vs. unclustered : If order of data records is the same as, or `close to ’ , order of records is the same as, or `close to ’ , order of index data entries, then called clustered index . index data entries, then called clustered index . Index entries Index entries UNCLUSTERED UNCLUSTERED CLUSTERED direct search for CLUSTERED direct search for data entries data entries Data entries Data entries Data entries (Index File) (Index File) (Data file) (Data file) Data Records Data Records Data Records Data Records Data entries 21 22 Faloutsos - Pavlo Faloutsos - Pavlo 5

Faloutsos - Pavlo SCS 15-415/615 Index Classification - clustered Clustered vs. Unclustered Index • Cost of retrieving records found in range scan: – A file can have a clustered index on at most one search key. – Clustered: cost = – Unclustered: cost ≈ – Cost of retrieving data records through index varies greatly based on whether index is • What are the tradeoffs???? clustered! – Note: Alternative 1 implies clustered, but not vice-versa . But, for simplicity, you may think of them as equivalent.. 23 24 Faloutsos - Pavlo Faloutsos - Pavlo Clustered vs. Unclustered Index Clustered vs. Unclustered Index • Cost of retrieving records found in range scan: • Cost of retrieving records found in range scan: – Clustered: cost = # pages in file w/matching records – Clustered: cost = # pages in file w/matching records – Unclustered: cost ≈ # of matching index data entries – Unclustered: cost ≈ # of matching index data entries • What are the tradeoffs???? • What are the tradeoffs???? – Clustered Pros: • Efficient for range searches • May be able to do some types of compression – Clustered Cons: • Expensive to maintain (on the fly or sloppy with reorganization) 25 26 Faloutsos - Pavlo Faloutsos - Pavlo 6

Faloutsos - Pavlo SCS 15-415/615 Overview Primary vs. Secondary Index • Review • Primary : index key includes the file ’ s • Index classification primary key – Representation of data entries in index • Secondary : any other index – Clustered vs. Unclustered – Primary vs. Secondary – Sometimes confused with Alt. 1 vs. Alt. 2/3 – Dense vs. Sparse – Primary index never contains duplicates – Single Key vs. Composite – Secondary index may contain duplicates – Indexing technique • If index key contains a candidate key, no • Cost estimation duplicates => unique index 27 27 28 Faloutsos - Pavlo Faloutsos Faloutsos - Pavlo CMU SCS 15-415 Overview Dense vs. Sparse Index • Dense: at least one • Review data entry per key • Index classification value Ashby, 25, 3000 22 – Representation of data entries in index • Sparse: an entry per Basu, 33, 4003 25 Bristow, 30, 2007 30 – Clustered vs. Unclustered data page in file Ashby 33 Cass, 50, 5004 Cass – Primary vs. Secondary Smith Daniels, 22, 6003 – Every sparse index is 40 Jones, 40, 6003 44 clustered ! – Dense vs. Sparse 44 Smith, 44, 3000 50 – Sparse indexes are Tracy, 44, 5004 – Single Key vs. Composite smaller; however, some Sparse Index Dense Index – Indexing technique on on Data File useful optimizations are Name Age • Cost estimation based on dense indexes. 29 29 30 Faloutsos - Pavlo Faloutsos Faloutsos - Pavlo CMU SCS 15-415 7

Overview Carnegie Mellon Univ. Review Dept. of Computer Science - PowerPoint PPT Presentation

Faloutsos - Pavlo SCS 15-415/615 Overview Carnegie Mellon Univ. Review Dept. of Computer Science Index classification 15-415/615 DB Applications Cost estimation Faloutsos & Pavlo Lecture #10 (R&G ch8) File

Carnegie Mellon University Search TRECVID 2004 Workshop November 2004 Mike Christel, Jun

Brendan Meeder Carnegie Mellon University Christos Faloutsos Carnegie Mellon University Given a

A First Look Franz Franchetti Carnegie Mellon University in collaboration with Daniele G.

HookFinder: Identifying and Understanding Malware Hooking Behaviors Heng Yin Zhenkai Liang Dawn

Kensuke Kakiuchi (Nagoya Univ./ The Univ. of Tokyo) Collaborators: Takeru K. Suzuki (The Univ. of

Feature Selection Matters for Anchor-Free Object Detection Chenchen Zhu Carnegie Mellon

15-213 Recitation: Attack Lab Jenna MacCarley 28 Sep 2015 Carnegie Mellon Reminder Bomb lab

15-213 Recitation: Bomb Lab 21 Sep 2015 Monil Shah, Shelton DSouza Carnegie Mellon Agenda

Cache Lab Implementation and Blocking Slides courtesy of: Aditya Shah, CMU 1 Carnegie Mellon

From Carnegie Mellon to Kyoto: How Far Can We Go? Project Courses at Carnegie Mellon Involve

Running Incomplete Programs Ian Voysey Carnegie Mellon University Cyrus Omar Carnegie Mellon

SPIRAL, FFTX, and the Path to SpectralPACK Franz Franchetti Carnegie Mellon University

for BlueGene/P Franz Franchetti 1 , Yevgen Voronenko 2 , Gheorghe Almasi 3 1 Carnegie Mellon

More is Less? Non-parametric Language Models and Efficiency Graham Neubig Carnegie Mellon

current interactions on iron in the NINJA experiment Contents Toho Univ. , Nagoya Univ., Kobe

A First Look Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Tze Meng Low Carnegie

Alternative File Organizations Many alternatives exist, each ideal for some situation , and not so

Homestake Atmospheric Flux John LoSecco University of Notre Dame February, 2016 Honda 2014

CS 105 Lecture 9: Fun with Files Craig Zilles (Computer Science)

Command Line Arguments COMP 1002/1402 Purpose At times a program needs to know some

Logistics Job related news C++ Classes Co-op orientation Friday, Sept 16 th 12:00

CS 333 Introduction to Operating Systems Class 19 - Security Jonathan Walpole Computer Science

Disaster Recovery Waivers From Pla n to Im p lem enta tion: Lessons Lea rned Suggestions to

Parallel I/O for the CGNS system Thomas Hauser thomas.hauser@usu.edu Center for High