OverviewofStorageandIndexing Chapter8 - - PDF document

overview of storage and indexing
SMART_READER_LITE
LIVE PREVIEW

OverviewofStorageandIndexing Chapter8 - - PDF document

OverviewofStorageandIndexing Chapter8 Howindex-learningturnsnostudentpale Yetholdstheeelofsciencebythetail. --


slide-1
SLIDE 1

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 1

OverviewofStorageandIndexing

Chapter8

“Howindex-learningturnsnostudentpale Yetholdstheeelofsciencebythetail.”

  • - AlexanderPope(1688-1744)

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 2

DataonExternalStorage

  • Disks: Canretrieverandompageatfixedcost

Butreadingseveralconsecutivepagesismuchcheaperthan readingtheminrandomorder

  • Tapes: Canonlyreadpagesinsequence

Cheaperthandisks;usedforarchivalstorage

  • Fileorganization: Methodofarrangingafileofrecords
  • nexternalstorage.

Recordid(rid) issufficienttophysicallylocaterecord

Indexes aredatastructuresthatallowustofindtherecordids

  • frecordswithgivenvaluesinindexsearchkey fields
  • Architecture: Buffermanager stagespagesfromexternal

storagetomainmemorybufferpool.Fileandindex layersmakecallstothebuffermanager.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 3

AlternativeFileOrganizations

Manyalternativesexist,eachidealforsome situations,andnotsogoodinothers:

Heap(randomorder)files: Suitablewhentypical accessisafilescanretrievingallrecords.

SortedFiles: Bestifrecordsmustberetrievedin someorder,oronlya`range’ofrecordsisneeded.

Indexes: Datastructurestoorganizerecordsvia treesorhashing.

  • Likesortedfiles,theyspeedupsearchesforasubsetof

records,basedonvaluesincertain(“searchkey”)fields

  • Updatesaremuchfasterthaninsortedfiles.
slide-2
SLIDE 2

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 4

Indexes

  • Anindexonafilespeedsupselectionsonthe

searchkeyfieldsfortheindex.

Anysubsetofthefieldsofarelationcanbethe searchkeyforanindexontherelation.

Searchkeyisnot thesameaskey (minimalsetof fieldsthatuniquelyidentifyarecordinarelation).

  • Anindexcontainsacollectionofdataentries,

andsupportsefficientretrievalofalldata entriesk*withagivenkeyvaluek.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 5

AlternativesforDataEntryk*inIndex

  • Threealternatives:

Datarecordwithkeyvalue k

<k,ridofdatarecordwithsearchkeyvalue k>

<k,listofridsofdatarecordswithsearchkeyk>

  • Choiceofalternativefordataentriesis
  • rthogonaltotheindexingtechniqueusedto

locatedataentrieswithagivenkeyvaluek.

Examplesofindexingtechniques:B+trees,hash- basedstructures

Typically,indexcontainsauxiliaryinformationthat directssearchestothedesireddataentries

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 6

AlternativesforDataEntries(Contd.)

  • Alternative1:

Ifthisisused,indexstructureisafileorganization fordatarecords(insteadofaHeapfileorsorted file).

Atmostoneindexonagivencollectionofdata recordscanuseAlternative1.(Otherwise,data recordsareduplicated,leadingtoredundant storageandpotentialinconsistency.)

Ifdatarecordsareverylarge,#ofpages containingdataentriesishigh.Impliessizeof auxiliaryinformationintheindexisalsolarge, typically.

slide-3
SLIDE 3

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 7

AlternativesforDataEntries(Contd.)

  • Alternatives2and3:

Dataentriestypicallymuchsmallerthandata records.So,betterthanAlternative1withlarge datarecords,especiallyifsearchkeysaresmall. (Portionofindexstructureusedtodirectsearch, whichdependsonsizeofdataentries,ismuch smallerthanwithAlternative1.)

Alternative3morecompactthanAlternative2,but leadstovariablesizeddataentriesevenifsearch keysareoffixedlength.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 8

IndexClassification

  • Primary vs.secondary:Ifsearchkeycontains

primarykey,thencalledprimaryindex.

Unique index:Searchkeycontainsacandidatekey.

  • Clustered vs. unclustered:Iforderofdatarecords

isthesameas,or`closeto’,orderofdataentries, thencalledclusteredindex.

Alternative1impliesclustered;inpractice,clustered alsoimpliesAlternative1(sincesortedfilesarerare).

Afilecanbeclusteredonatmostonesearchkey.

Costofretrievingdatarecordsthroughindexvaries greatlybasedonwhetherindexisclusteredornot!

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 9

  • Clusteredvs. Unclustered Index
  • SupposethatAlternative(2)isusedfordataentries,

andthatthedatarecordsarestoredinaHeapfile.

Tobuildclusteredindex,firstsorttheHeapfile(with somefreespaceoneachpageforfutureinserts).

Overflowpagesmaybeneededforinserts.(Thus,orderof data recs is`closeto’,butnotidenticalto,thesortorder.)

Indexentries Dataentries directsearchfor (IndexFile) (Datafile) DataRecords dataentries Dataentries DataRecords CLUSTERED UNCLUSTERED

slide-4
SLIDE 4

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 10

Hash-BasedIndexes

  • Goodforequalityselections.
  • Indexisacollectionofbuckets.Bucket=primary

page pluszeroormore overflow pages.

  • Hashingfunction h:h(r)=bucketinwhich

recordr belongs.h looksatthesearchkey fields

  • fr.
  • IfAlternative(1) isused,thebucketscontain

thedatarecords;otherwise,theycontain<key, rid>or<key,rid-list>pairs.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 11

B+TreeIndexes

  • Leafpagescontain dataentries,andarechained(prev &next)
  • Non-leafpagescontainindexentries anddirectsearches:

P0 K 1 P 1 K 2 P 2 K m P m

indexentry

Non-leaf Pages Pages Leaf DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 12

ExampleB+Tree

  • Find28*?29*?All>15*and<30*
  • Insert/delete:Finddataentryinleaf,then

changeit.Needtoadjustparentsometimes.

Andchangesometimesbubblesupthetree

2* 3* Root

17

30 14* 16* 33* 34* 38* 39* 13 5 7* 5* 8* 22* 24* 27 27* 29*

Entries<=17 Entries>17

slide-5
SLIDE 5

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 13

CostModelforOurAnalysis

WeignoreCPUcosts,forsimplicity:

B:Thenumberofdatapages

R:Numberofrecordsperpage

D:(Average)timetoreadorwritediskpage

MeasuringnumberofpageI/O’signoresgainsof pre-fetchingasequenceofpages;thus,evenI/O costisonlyapproximated.

Average-caseanalysis;basedonseveralsimplistic assumptions. * Goodenoughtoshowtheoveralltrends!

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 14

ComparingFileOrganizations

  • Heapfiles(randomorder;insertat eof)
  • Sortedfiles,sortedon<age, sal>
  • ClusteredB+treefile,Alternative(1),search

key<age, sal>

  • Heapfilewith unclustered B+treeindexon

searchkey<age, sal>

  • Heapfilewith unclustered hashindexon

searchkey<age, sal>

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 15

OperationstoCompare

  • Scan:Fetchallrecordsfromdisk
  • Equalitysearch
  • Rangeselection
  • Insertarecord
  • Deletearecord
slide-6
SLIDE 6

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 16

AssumptionsinOurAnalysis

  • HeapFiles:

Equalityselectiononkey;exactlyonematch.

  • SortedFiles:

Filescompactedafterdeletions.

  • Indexes:

Alt(2),(3):dataentrysize=10%sizeofrecord

Hash:Nooverflowbuckets.

  • 80%pageoccupancy=>Filesize=1.25datasize

Tree:67%occupancy(thisistypical).

  • Impliesfilesize=1.5datasize

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 17

CostofOperations

  • (a)Scan

(b) Equality (c)Range (d)Insert (e)Delete (1)Heap

  • (2)Sorted
  • (3)Clustered
  • (4)Unclustered

Treeindex

  • (5)Unclustered

Hashindex

  • * Severalassumptionsunderliethese(rough)estimates!

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 18

CostofOperations

  • (a)Scan

(b)Equality (c)Range (d)Insert (e)Delete (1)Heap BD

0.5BD BD 2D Search +D

(2)Sorted BD

Dlog2B Dlog2B+ #matches Search +BD Search +BD

(3)Clustered 1.5BD

DlogF1.5B DlogF1.5B +# matches Search +D Search +D

(4)Unclustered Treeindex BD(R+0.15) D(1+

logF0.15B) DlogF0.15B +# matches D(3+ logF0.15B) Search +2D

(5)Unclustered Hashindex

BD(R+0.1 25) 2D BD 4D Search +2D

  • * Severalassumptionsunderliethese(rough)estimates!
slide-7
SLIDE 7

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 19

UnderstandingtheWorkload

  • Foreachqueryintheworkload:

Whichrelationsdoesitaccess?

Whichattributesareretrieved?

Whichattributesareinvolvedinselection/joinconditions? Howselectivearetheseconditionslikelytobe?

  • Foreachupdateintheworkload:

Whichattributesareinvolvedinselection/joinconditions? Howselectivearetheseconditionslikelytobe?

Thetypeofupdate(INSERT/DELETE/UPDATE),andthe attributesthatareaffected.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 20

ChoiceofIndexes

  • Whatindexesshouldwecreate?

Whichrelationsshouldhaveindexes?Whatfield(s) shouldbethesearchkey?Shouldwebuildseveral indexes?

  • Foreachindex,whatkindofanindexshouldit

be?

Clustered?Hash/tree?

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 21

ChoiceofIndexes(Contd.)

  • Oneapproach: Considerthemostimportantqueries

inturn.Considerthebestplanusingthecurrent indexes,andseeifabetterplanispossiblewithan additionalindex.Ifso,createit.

Obviously,thisimpliesthatwemustunderstandhowa DBMSevaluatesqueriesandcreatesqueryevaluationplans!

Fornow,wediscusssimple1-tablequeries.

  • Beforecreatinganindex,mustalsoconsiderthe

impactonupdatesintheworkload!

Trade-off: Indexescanmakequeriesgofaster,updates slower.Requirediskspace,too.

slide-8
SLIDE 8

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 22

IndexSelectionGuidelines

  • AttributesinWHEREclausearecandidatesforindexkeys.

Exactmatchconditionsuggestshashindex.

Rangequerysuggeststreeindex.

  • Clusteringisespeciallyusefulforrangequeries;canalsohelp on

equalityqueriesiftherearemanyduplicates.

  • Multi-attributesearchkeysshouldbeconsideredwhena

WHEREclausecontainsseveralconditions.

Orderofattributesisimportantforrangequeries.

Suchindexescansometimesenableindex-only strategiesfor importantqueries.

  • Forindex-onlystrategies,clusteringisnotimportant!
  • Trytochooseindexesthatbenefitasmanyqueriesas

possible.Sinceonlyoneindexcanbeclusteredperrelation, chooseitbasedonimportantqueriesthatwouldbenefitthe mostfromclustering.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 23

ExamplesofClusteredIndexes

  • B+treeindexonE.agecanbe

usedtogetqualifying tuples.

Howselectiveisthecondition?

Istheindexclustered?

  • ConsidertheGROUPBYquery.

Ifmany tuples haveE.age >10,using E.age indexandsortingtheretrieved tuples maybecostly.

ClusteredE.dno indexmaybebetter!

  • Equalityqueriesandduplicates:

ClusteringonE.hobby helps!

SELECT E.dno FROM Emp E WHERE E.age>40 SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>10 GROUPBYE.dno SELECT E.dno FROM Emp E WHERE E.hobby=Stamps

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 24

IndexeswithCompositeSearchKeys

  • CompositeSearchKeys:Search
  • nacombinationoffields.

Equalityquery:Everyfield valueisequaltoaconstant value.E.g. wrt <sal,age>index:

  • age=20and sal =75

Rangequery: Somefieldvalue isnotaconstant.E.g.:

  • age=20;orage=20and sal >10
  • Dataentriesinindexsorted

bysearchkeytosupport rangequeries.

Lexicographicorder,or

Spatialorder.

sue 13 75 bob cal joe 12 10 20 80 11 12 name age sal <sal,age> <age, sal> <age> <sal> 12,20 12,10 11,80 13,75 20,12 10,12 75,13 80,11 11 12 12 13 10 20 75 80

Datarecords sortedbyname Dataentriesinindex sortedby<sal,age> Dataentries sortedby<sal> Examplesofcompositekey indexesusinglexicographicorder.

slide-9
SLIDE 9

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 25

CompositeSearchKeys

  • Toretrieve Emp recordswithage=30AND sal=4000,

anindexon<age,sal>wouldbebetterthananindex

  • nage oranindexon sal.

Choiceofindexkeyorthogonaltoclusteringetc.

  • Ifconditionis:20<age<30AND 3000<sal<5000:

Clusteredtreeindexon<age,sal>or<sal,age>isbest.

  • Ifconditionis:age=30AND 3000<sal<5000:

Clustered<age,sal>indexmuchbetterthan<sal,age> index!

  • Compositeindexesarelarger,updatedmoreoften.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 26

Index-OnlyPlans

  • Anumberof

queriescanbe answered without retrievingany tuples fromone

  • rmoreofthe

relations involvedifa suitableindex isavailable.

SELECT D.mgr FROM DeptD, Emp E WHERE D.dno=E.dno SELECT D.mgr,E.eid FROM DeptD, Emp E WHERE D.dno=E.dno SELECT E.dno,COUNT(*) FROM Emp E GROUPBYE.dno SELECT E.dno,MIN(E.sal) FROM Emp E GROUPBYE.dno SELECT AVG(E.sal) FROM Emp E WHEREE.age=25AND

E.sal BETWEEN 3000AND5000 <E.dno> <E.dno,E.eid> Treeindex! <E.dno> <E.dno,E.sal> Treeindex! <E.age,E.sal>

  • r

<E.sal,E.age> Tree!

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 27

Index-OnlyPlans(Contd.)

  • Index-onlyplans

arepossibleifthe keyis<dno,age>

  • rwehaveatree

indexwithkey <age,dno>

Whichisbetter?

Whatifwe considerthe secondquery?

SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age=30 GROUPBYE.dno SELECT E.dno, COUNT (*) FROM Emp E WHERE E.age>30 GROUPBYE.dno

slide-10
SLIDE 10

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 28

Summary

  • Manyalternativefileorganizationsexist,each

appropriateinsomesituation.

  • Ifselectionqueriesarefrequent,sortingthe

fileorbuildinganindex isimportant.

Hash-basedindexesonlygoodforequalitysearch.

Sortedfilesandtree-basedindexesbestforrange search;alsogoodforequalitysearch.(Filesrarely keptsortedinpractice;B+treeindexisbetter.)

  • Indexisacollectionofdataentriesplusaway

toquicklyfindentrieswithgivenkeyvalues.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 29

Summary(Contd.)

  • Dataentriescanbeactualdatarecords,<key,

rid>pairs,or<key,rid-list>pairs.

Choiceorthogonaltoindexingtechniqueusedto locatedataentrieswithagivenkeyvalue.

  • Canhaveseveralindexesonagivenfileof

datarecords,eachwithadifferentsearchkey.

  • Indexescanbeclassifiedasclusteredvs.

unclustered,primaryvs.secondary,and densevs.sparse.Differenceshaveimportant consequencesforutility/performance.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 30

Summary(Contd.)

  • Understandingthenatureoftheworkload forthe

application,andtheperformancegoals,isessential todevelopingagooddesign.

Whataretheimportantqueriesandupdates?What attributes/relationsareinvolved?

  • Indexesmustbechosentospeedupimportant

queries(andperhapssomeupdates!).

Indexmaintenanceoverheadonupdatestokeyfields.

Chooseindexesthatcanhelpmanyqueries,ifpossible.

Buildindexestosupportindex-onlystrategies.

Clusteringisanimportantdecision;onlyoneindexona givenrelationcanbeclustered!

Orderoffieldsincompositeindexkeycanbeimportant.