ExternalSorting Chapter13 - - PDF document

external sorting
SMART_READER_LITE
LIVE PREVIEW

ExternalSorting Chapter13 - - PDF document

ExternalSorting Chapter13 DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 1 WhySort? Aclassicproblemincomputerscience!


slide-1
SLIDE 1

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 1

ExternalSorting

Chapter13

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 2

WhySort?

  • Aclassicproblemincomputerscience!
  • Datarequestedinsortedorder

e.g.,findstudentsinincreasing gpa order

  • SortingisfirststepinbulkloadingB+treeindex.
  • Sortingusefulforeliminatingduplicatecopiesina

collectionofrecords(Why?)

  • Sort-merge joinalgorithminvolvessorting.
  • Problem:sort1Gbofdatawith1MbofRAM.

whynotvirtualmemory?

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 3

2-WaySort:Requires3Buffers

  • Pass1:Readapage,sortit,writeit.
  • nlyonebufferpageisused
  • Pass2,3,…,etc.:

threebufferpagesused.

Mainmemorybuffers INPUT1 INPUT2 OUTPUT Disk Disk

slide-2
SLIDE 2

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 4

Two-WayExternalMergeSort

  • Eachpassweread+write

eachpageinfile.

  • Npagesinthefile=>the

numberofpasses

  • So toal costis:
  • Idea: Divideandconquer:

sort subfiles andmerge

✁ ✂

= + log2 1 N

✄ ☎

( )

2 1

2

N N log +

Inputfile 1-pageruns 2-pageruns 4-pageruns 8-pageruns PASS0 PASS1 PASS2 PASS3 9 3,4 6,2 9,4 8,7 5,6 3,1 2 3,4 5,6 2,6 4,9 7,8 1,3 2 2,3 4,6 4,7 8,9 1,3 5,6 2 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8 DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 5

GeneralExternalMergeSort

TosortafilewithN pagesusingB bufferpages:

Pass0:useBbufferpages.Producesortedrunsof B pageseach.

Pass2,…,etc.:mergeB-1runs.

✞ ✟

N B /

BMainmemorybuffers INPUT1 INPUTB-1 OUTPUT Disk Disk INPUT2

... ... ...

* Morethan3bufferpages.Howcanweutilizethem?

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 6

CostofExternalMergeSort

Numberofpasses:

Cost=2N*(#ofpasses)

E.g.,with5bufferpages,tosort108pagefile:

Pass0:=22sortedrunsof5pageseach (lastrunisonly3pages)

Pass1:=6sortedrunsof20pageseach (lastrunisonly8pages)

Pass2:2sortedruns,80pagesand28pages

Pass3:Sortedfileof108pages

✠ ✡ ☛ ☞

1

1

+

log /

B

N B

✌ ✍

108 5 /

✎ ✏

22 4 /

slide-3
SLIDE 3

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 7

NumberofPassesofExternalSort

N B=3 B=5 B=9 B=17 B=129 B=257 100 7 4 3 2 1 1 1,000 10 5 4 3 2 2 10,000 13 7 5 4 2 2 100,000 17 9 6 5 3 3 1,000,000 20 10 7 5 3 3 10,000,000 23 12 8 6 4 3 100,000,000 26 14 9 7 4 4 1,000,000,000 30 15 10 8 5 4

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 8

InternalSortAlgorithm

  • Quicksort isafastwaytosortinmemory.
  • Analternativeis“tournamentsort”(a.k.a.

“heapsort”)

Top: Readin Bblocks

Output: movesmallestrecordtooutputbuffer

Readinanewrecordr

insertrinto“heap”

ifrnotsmallest,thenGOTOOutput

elseremover from“heap”

  • utput“heap”inorder;GOTOTop

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 9

Moreon Heapsort

  • Fact:averagelengthofarunin heapsort is2B

The“snowplow”analogy

  • Worst-Case:

Whatisminlengthofarun?

Howdoesthisarise?

  • Best-Case:

Whatismaxlengthofarun?

Howdoesthisarise?

  • Quicksort isfaster,but...

B

slide-4
SLIDE 4

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 10

I/OforExternalMergeSort

  • …longerrunsoftenmeansfewerpasses!
  • Actually,doI/Oapageatatime
  • Infact,readablock ofpagessequentially!
  • Suggestsweshouldmakeeachbuffer

(input/output)beablock ofpages.

Butthiswillreducefan-outduringmergepasses!

Inpractice,mostfilesstillsortedin2-3passes.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 11

NumberofPassesofOptimizedSort

N B=1,000 B=5,000 B=10,000 100 1 1 1 1,000 1 1 1 10,000 2 2 1 100,000 3 2 2 1,000,000 3 2 2 10,000,000 4 3 3 100,000,000 5 3 3 1,000,000,000 5 4 3

* Blocksize=32,initialpassproducesrunsofsize2B.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 12

DoubleBuffering

  • ToreducewaittimeforI/Orequestto

complete,can prefetch into`shadowblock’.

Potentially,morepasses;inpractice,mostfilesstill sortedin2-3passes.

OUTPUT OUTPUT'

Disk Disk

INPUT1 INPUTk INPUT2 INPUT1' INPUT2' INPUTk' blocksize

b

Bmainmemorybuffers,k-waymerge

slide-5
SLIDE 5

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 13

SortingRecords!

  • Sortinghasbecomeabloodsport!

Parallelsortingisthenameofthegame...

  • Datamation:Sort1Mrecordsofsize100bytes

TypicalDBMS:15minutes

Worldrecord:3.5seconds

  • 12-CPUSGImachine,96disks,2GBofRAM
  • Newbenchmarksproposed:

MinuteSort:Howmanycanyousortin1minute?

DollarSort:Howmanycanyousortfor$1.00?

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 14

UsingB+TreesforSorting

  • Scenario:TabletobesortedhasB+treeindexon

sortingcolumn(s).

  • Idea:Canretrieverecordsinorderbytraversing

leafpages.

  • Isthisagoodidea?
  • Casestoconsider:

B+treeisclustered Goodidea!

B+treeisnotclustered Couldbeaverybadidea!

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 15

ClusteredB+TreeUsedforSorting

  • Cost:roottotheleft-

mostleaf,thenretrieve allleafpages (Alternative1)

  • IfAlternative2isused?

Additionalcostof retrievingdatarecords: eachpagefetchedjust

  • nce.

* Alwaysbetterthanexternalsorting!

✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄ ✄

(Directssearch) DataRecords Index DataEntries ("Sequenceset")

slide-6
SLIDE 6

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 16

UnclusteredB+TreeUsedforSorting

  • Alternative(2)fordataentries;eachdata

entrycontainsrid ofadatarecord.Ingeneral,

  • neI/Operdatarecord!
✁ ✁ ✁ ✁ ✁ ✁ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

(Directssearch) DataRecords Index DataEntries ("Sequenceset")

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 17

ExternalSortingvs.UnclusteredIndex

N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000 1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000

* p:#ofrecordsperpage * B=1,000andblocksize=32forsorting * p=100isthemorerealisticvalue.

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 18

Summary

  • Externalsortingisimportant;DBMSmaydedicate

partofbufferpoolforsorting!

  • ExternalmergesortminimizesdiskI/Ocost:

Pass0:Producessortedruns ofsizeB (#bufferpages). Laterpasses:merge runs.

#ofrunsmergedatatimedependsonB,andblocksize.

LargerblocksizemeanslessI/Ocostperpage.

Largerblocksizemeanssmaller#runsmerged.

Inpractice,#ofrunsrarelymorethan2or3.

slide-7
SLIDE 7

DatabaseManagementSystems3ed,R.RamakrishnanandJ.Gehrke 19

Summary,cont.

  • Choiceofinternalsortalgorithmmaymatter:

Quicksort:Quick!

Heap/tournamentsort:slower(2x),longerruns

  • Thebestsortsarewildlyfast:

Despite40+yearsofresearch,we’restill improving!

  • ClusteredB+treeisgoodforsorting;

unclusteredtreeisusuallyverybad.