Sorting Sorting: to arrange data in some sequential order Sorting - - PDF document

sorting sorting to arrange data in some sequential order
SMART_READER_LITE
LIVE PREVIEW

Sorting Sorting: to arrange data in some sequential order Sorting - - PDF document

Sorting Sorting: to arrange data in some sequential order Sorting occurs as a part in many applications Makes searching easier Lecture 10: Sorting algorithms Canonical form for sorted data? Should the sorted list be On


slide-1
SLIDE 1

1

Lecture 10: Sorting algorithms

2

Sorting

  • Sorting: to arrange data in some sequential order
  • Sorting occurs as a part in many applications

– Makes searching easier

  • Canonical form for sorted data?

– Should the sorted list be On one nod/ distributed?

Internal/ external?

Comparison-based/ non comparison-based?

3

Assumptions – in this course

  • Before the parallel sort

– The data (n elements) are distributed on the nodes – The data are positive integers – Sorted by their numerical value

  • After the parallel sort

– A sorted sublist on every node (distributed) – The sublists are sorted – The sublists are globally ordered in some way

  • For example in a ring: list0 < list1 < ..

4

What can we expect?

  • The best sequential algorithm (general data)
  • Efficiency: 100%
  • Can we do this? Well, hard to do with

”comparison-based” algorithms.

) log ( n n O

n T n p pT n n p T T p S E

p p p p

log log / 1

1

= ⇒ = = = = =

5

Rank sort

Sorting algorithm (not exactly a good sequential sorting algorithm). Parallel version, n processors for a[0]...a[n-1]:

  • Let all processors have all values
  • Parallel over i:

– Count all numbers less than a[i] – If the count is k, – then the given element is placed in b[k].

  • All CPUs have made n-1 comparisons
  • Complexity?

n numbers n processors O(n) Efficiency: log n/n

6

Rank sort

Parallel version, n2 processors: Really n*(n-1) processors

  • Parallel over i:

– Count all numbers less than a[i]

  • All CPU gets 0 or 1
  • Make n reductions on (n-1) processors in parallel

– If the count is k, – then the given element is placed in b[k].

  • All CPUs have made 1 comparison
  • Complexity?
  • Bad processor efficiency

n numbers n2 processors O(log n) Efficiency: 1/n

slide-2
SLIDE 2

7

Merge strategy

Bitonic-, Shell-, Quick- sort

1) Sort every sublist by using a fast sequential method (e.g., quicksort) 2) Exchange data between neighbors so that all elements in the list on a node are smaller than the other's (compare-exchange)

A < B >

8

Compare-and-exchange

Bubble sort on an array/ring, n CPUs on n elements:

  • 1. The left node sends its largest number to the right,

the right node send its smallest to the left

  • 2. Inserts the elements in their lists, puts an element
  • ver the processor border
  • 3. When they are not able to insert a number, ready
  • Communication is cheaper per exchange if larger

parts are sent

  • Sending larger parts

(the whole list ⇒ to much memory)

9

Bubble sort/Odd-even Transposition

  • The bubbling of the next element can be

started before the previous has been finished: pipeline

  • Odd-even Transposition sort:
  • Every other (odd/even) comparison can

be done in parallel

  • 1. Compare every odd element A2i-1 with the

next, even, element A2i

  • 2. Compare every even element A2i with the

next, odd, element A2i+1 …

  • 3. Last pass

n numbers n processors O(n) Efficiency: log n/n

10

Mergesort

Good sequentially But parallel? Results in a tree structure!

4 8 7 2 4 8 7 2 4 8 7 2 2 4 7 8 4 8 2 7 Unfortunately, we will not reach log(n). O(n) is the minimum n numbers n processors O(n) Efficiency: log(n)/n

11

2D sorting: Shearsort

”Snakesort”

1.

Sort the rows, every other ascending (according to the arrow), and every

  • ther descending

2.

Sort the columns, ascending

3.

Sort the rows

4.

Sort the columns O(log n). Finished 4 14 8 2 10 3 13 16 7 15 1 5 12 6 11 9

12

2D sorting: Shearsort

”Snakesort” with transposition

  • proc. 0
  • proc. 1
  • proc. 2
  • proc. 3

2 4 8 14 16 13 10 3 1 5 7 15 12 11 9 6

1.

Sort the rows, every other ascending (according to the arrow pilen), every

  • ther

descending Local sorting.

slide-3
SLIDE 3

13

Shearsort

  • proc. 0
  • proc. 1
  • proc. 2
  • proc. 3

2.

Transpose 2 16 1 12 4 13 5 11 8 10 7 9 14 3 15 6

14

Shearsort

3.

Sort the columns

  • proc. 0
  • proc. 1
  • proc. 2
  • proc. 3

1 2 12 16 4 5 11 13 7 8 9 10 3 6 14 15 n numbers processors Efficiency:

n

( ) ( )

) ( log log n T n n n O + ) ( log n T n n n +

15

Bitonic sort

A bitonic sequence is a sequence that, if you shift it, results in an increasing sequence followed by a decreasing:

5 6 8 9 7 5 2 3, is after shift: 2 3 5 6 8 9 7 5

The interesting thing is that we can, after a compare-exchange between ai and ai+n/2 for all i, 0 ≤ i < n / 2, get two bitonic sequences, where all numbers in the first is less or equal to the numbers in the second.

16

Bitonic sort

This we can explore recursively by then sorting the two resulting lists bitonically as well!!!

5 6 8 9 7 5 2 3 5 5 2 3 7 6 8 9 2 3 5 5 7 6 8 9 2 2 5 5 6 7 8 9

17

Bitonic sort

First we build bitonic sequences, then we split

  • them. Time: O(log2 n)

n numbers n processors O(log2 n) Efficiency: 1/log(n)

18

Sorting on a HC

001 011 000 100 111 110 101 010 001 011 000 100 111 110 101 010

Binary Reflected Gray Code RGC S1 = 0, 1 Sk = 0[Sk-1], 1[Sk-1]R 00,01, 11, 10 000, 001, 011, 010, 110, 111, 101, 100

slide-4
SLIDE 4

19

Shellsort

  • d = dimension of the HC
  • Explores that a list is almost

sorted after d compare- exchanges

  • The list will be sorted in ring order

1.Local quicksort on each sublist

  • 2. d compare-exchange and merge

in the direction pointed to by the d:th bit 3.Mopping up n·m numbers n processors O((k+log n)·m log m) Efficiency: log n/(k+log n)

20

Shellsort

21

QuickSort

  • Divide and conquer algorithm
  • Sorted in ring order or binary order
  • Results in a tree structure (cmp. Mergesort)
  • Sequentially:

1) Find a splitting key 2) Place all elements < key to the left, all larger to the right 3) Split the list 4) Goto 1

22

Parallel Quicksort 1 on a HC

Repeat (1, 2, 3) d times

  • 1. Find splitting key, k
  • 2. Send all elements > k to one subcube
  • 3. Send all elements < k to the other subcube
  • 4. Sort sequentially on the node

Disadvantages: – badly chosen k results in quadratic time in the sorting – badly chosen k results in load imbalance

23

Parallel Quicksort 2

Sample Sort

1) Let all nodes sample l elements randomly 2) Sort the l2d elements (shellsort) 3) Choose 2d-1 splitting keys 4) Broadcast all keys 5) Perform d splits 6) Sort each sublist

  • The right number of elements in the sampling is

important (depends on the length of the list)

  • Large l results in better load balance with more
  • verhead due to the sorting in step 2

24

Quicksort 3

1) Make a global guess on the median 2) Split local sublist according to the guess 3) Decide wrt. list length where the median is (one of the two global lists are longer!) Repeat 1,2,3 until median is found, then apply Parallel Quicksort 1

slide-5
SLIDE 5

25

Sorting

  • Why is sorting on parallel systems so

difficult?

  • Idea for an answer: the number of
  • perations you make on the data set are too

small to be easily parallelized without getting too large overhead – the relation between number of elements and number of

  • perations is not good!

26

Assignment 2

parallel quicksort with MPI

  • Implement a special variant of parallel quicksort with MPI (similar to

Quicksort I): 1 Divide the data into p equal parts, one per processor 2 Sort the data locally for each processor 3 Perform global sort 3.1 Select pivot element within each processor set 3.2 Locally in each processor, divide the data into two sets according to the pivot (smaller or larger) 3.3 Split the processors into two groups and exchange data pairwise between them so that all processors in one group get data less than the pivot and the others get data larger than the pivot. 3.4 Merge the two lists in each processor in one sorted list 4 Repeat 3.1 - 3.4 recursively for each half. The algorithm converges in log2 p steps. For steps 2 and 3.4 you should use a serial qsort routine. There you may choose the pivot as you wish. Do not time step 1!!!

27

Assignment 2

parallel quicksort with MPI

For the numerical tests implement the following pivot strategies:

  • 1. Select the median in one processor in each group
  • f processors
  • 2. Select the median of all medians in each processor

group

  • 3. Select the mean value of all medians in each

processor group

  • You are free to test other pivot selecting

techniques as well

28

Kontrollfrågor

  • Sortera följande lista genom att använda parallell

shell sort med mopp-up: 4, 5, 2, 1, 7, 9, 9, 3, 3, 5, 6, 4, 1, 2, 8, 0. Antag att du har en hyperkub av dimension 3 och att en initala datadistributionen är som följer:

node: 0 1 2 3 4 5 6 7 data: 4,5, 2,1, 7,9, 9,3, 3,5, 6,4, 1,2, 8,0

  • Sortera listan ovan mha shearsort på fyra

processorer

  • Sortera följande lista genom att använda bitonic sort

3, 5, 6, 4, 1, 2, 8, 0

29

Kontrollfrågor

  • Vad kostar en compare-exchange mellan två processorer,
  • m varje processor håller m element?
  • Vad blir komplexiteten på merge sort av m·p tal på p

processorer?

  • Vilken är den minsta möjliga komplexiteten för

jämförelsebaserad sortering av n element givet 1 processor? Varför? Givet godtyckligt antal processorer? Varför?

  • Hur lång tid tar transponering av p tal på en hyperkub av

dimension log p? På en mesh med dimensionen √p × √p?