The power of How fast is sorting? parallel computation Input: - - PowerPoint PPT Presentation

the power of how fast is sorting parallel computation
SMART_READER_LITE
LIVE PREVIEW

The power of How fast is sorting? parallel computation Input: - - PowerPoint PPT Presentation

The power of How fast is sorting? parallel computation Input: array of numbers. 2 , D. J. Bernstein Each number in 1 2 represented in binary. Thanks to: University of Illinois at Chicago Output:


slide-1
SLIDE 1

The power of parallel computation

  • D. J. Bernstein

Thanks to: University of Illinois at Chicago NSF CCR–9983950 Alfred P. Sloan Foundation How fast is sorting? Input: array of

  • numbers.

Each number in 1

✁ 2 ✁ ✂ ✂ ✂ ✁ 2 ,

represented in binary. Output: array of

  • numbers,

in increasing order, represented in binary; same multiset as input. A machine is given the input and computes the output. How much time does it use?

slide-2
SLIDE 2

computation Illinois at Chicago CCR–9983950 Foundation How fast is sorting? Input: array of

  • numbers.

Each number in 1

✁ 2 ✁ ✂ ✂ ✂ ✁ 2 ,

represented in binary. Output: array of

  • numbers,

in increasing order, represented in binary; same multiset as input. A machine is given the input and computes the output. How much time does it use? The answer depends how the machine w Possibility 1: The “1-tape Turing machine using selection sort.” Specifically: The machine a 1-dimensional arra containing Θ(

) “cells.”

Each cell stores Θ(lg

  • Input and output a

stored in these cells.

slide-3
SLIDE 3

How fast is sorting? Input: array of

  • numbers.

Each number in 1

✁ 2 ✁ ✂ ✂ ✂ ✁ 2 ,

represented in binary. Output: array of

  • numbers,

in increasing order, represented in binary; same multiset as input. A machine is given the input and computes the output. How much time does it use? The answer depends on how the machine works. Possibility 1: The machine is a “1-tape Turing machine using selection sort.” Specifically: The machine has a 1-dimensional array containing Θ(

) “cells.”

Each cell stores Θ(lg

) bits.

Input and output are stored in these cells.

slide-4
SLIDE 4

rting?

  • numbers.

1

✁ 2 ✁ ✂ ✂ ✂ ✁ 2 ,

binary.

  • f
  • numbers,

rder, binary; input. given the input the output. does it use? The answer depends on how the machine works. Possibility 1: The machine is a “1-tape Turing machine using selection sort.” Specifically: The machine has a 1-dimensional array containing Θ(

) “cells.”

Each cell stores Θ(lg

) bits.

Input and output are stored in these cells. The machine also “head” moving through Head contains Θ(1) Head can see the cell its current array position; perform arithmetic move to adjacent a Selection sort: Head looks at each array picks up the largest moves it to the end picks up the second etc.

slide-5
SLIDE 5

The answer depends on how the machine works. Possibility 1: The machine is a “1-tape Turing machine using selection sort.” Specifically: The machine has a 1-dimensional array containing Θ(

) “cells.”

Each cell stores Θ(lg

) bits.

Input and output are stored in these cells. The machine also has a “head” moving through array. Head contains Θ(1) cells. Head can see the cell at its current array position; perform arithmetic etc.; move to adjacent array position. Selection sort: Head looks at each array position, picks up the largest number, moves it to the end of the array, picks up the second largest, etc.

slide-6
SLIDE 6

ends on machine works. The machine is a machine sort.” machine has array

  • “cells.”

Θ(lg

) bits.
  • utput are

cells. The machine also has a “head” moving through array. Head contains Θ(1) cells. Head can see the cell at its current array position; perform arithmetic etc.; move to adjacent array position. Selection sort: Head looks at each array position, picks up the largest number, moves it to the end of the array, picks up the second largest, etc. Moving to adjacent takes

  • (1) seconds.

Moving a number takes

1+ (1) seconds.

Same for comparisons Total sorting time:

2+ (1) seconds.

Cost of machine:

1+ (1) Euros

for

1+ (1) cells.

Negligible extra cost

slide-7
SLIDE 7

The machine also has a “head” moving through array. Head contains Θ(1) cells. Head can see the cell at its current array position; perform arithmetic etc.; move to adjacent array position. Selection sort: Head looks at each array position, picks up the largest number, moves it to the end of the array, picks up the second largest, etc. Moving to adjacent array position takes

  • (1) seconds.

Moving a number to end of array takes

1+ (1) seconds.

Same for comparisons etc. Total sorting time:

2+ (1) seconds.

Cost of machine:

1+ (1) Euros

for

1+ (1) cells.

Negligible extra cost for head.

slide-8
SLIDE 8

also has a through array. Θ(1) cells. the cell at position; rithmetic etc.; adjacent array position. Head rray position, rgest number, end of the array, second largest, Moving to adjacent array position takes

  • (1) seconds.

Moving a number to end of array takes

1+ (1) seconds.

Same for comparisons etc. Total sorting time:

2+ (1) seconds.

Cost of machine:

1+ (1) Euros

for

1+ (1) cells.

Negligible extra cost for head. Possibility 2: The “2-dimensional RAM using merge sort.” Machine has Θ(

)

in a 2-dimensional Θ(

) rows, Θ(
  • Machine also has a

Merge sort: Head sorts first

  • 2
✁ numb

sorts last

  • 2
✄ numb

merges the sorted

slide-9
SLIDE 9

Moving to adjacent array position takes

  • (1) seconds.

Moving a number to end of array takes

1+ (1) seconds.

Same for comparisons etc. Total sorting time:

2+ (1) seconds.

Cost of machine:

1+ (1) Euros

for

1+ (1) cells.

Negligible extra cost for head. Possibility 2: The machine is a “2-dimensional RAM using merge sort.” Machine has Θ(

) cells

in a 2-dimensional array: Θ(

) rows, Θ( ) columns.

Machine also has a head. Merge sort: Head recursively sorts first

  • 2
✁ numbers;

sorts last

  • 2
✄ numbers;

merges the sorted lists.

slide-10
SLIDE 10

adjacent array position

  • seconds.

er to end of array

  • seconds.

risons etc. time:

  • seconds.

machine:

  • cells.

cost for head. Possibility 2: The machine is a “2-dimensional RAM using merge sort.” Machine has Θ(

) cells

in a 2-dimensional array: Θ(

) rows, Θ( ) columns.

Machine also has a head. Merge sort: Head recursively sorts first

  • 2
✁ numbers;

sorts last

  • 2
✄ numbers;

merges the sorted lists. Merging requires

  • to “random” array

Average jump:

  • to adjacent array p

Each move takes

  • Total sorting time:
1 5+ (1) seconds.

Cost of machine: once

1+ (1) Euros.
slide-11
SLIDE 11

Possibility 2: The machine is a “2-dimensional RAM using merge sort.” Machine has Θ(

) cells

in a 2-dimensional array: Θ(

) rows, Θ( ) columns.

Machine also has a head. Merge sort: Head recursively sorts first

  • 2
✁ numbers;

sorts last

  • 2
✄ numbers;

merges the sorted lists. Merging requires

1+ (1) jumps

to “random” array positions. Average jump:

5+ (1) moves

to adjacent array positions. Each move takes

  • (1) seconds.

Total sorting time:

1 5+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.
slide-12
SLIDE 12

The machine is a RAM rt.”

) cells

2-dimensional array:

  • Θ(
) columns.

has a head. Head recursively

  • ✁ numbers;
  • ✄ numbers;

rted lists. Merging requires

1+ (1) jumps

to “random” array positions. Average jump:

5+ (1) moves

to adjacent array positions. Each move takes

  • (1) seconds.

Total sorting time:

1 5+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.

Possibility 3: The “pipelined 2-dimensional using radix-2 sort.” Machine has Θ(

)

in a 2-dimensional Each cell in the arra network links to the cells in the same column. Each cell in the top network links to the cells in the top row. Machine also has a attached to top-left

slide-13
SLIDE 13

Merging requires

1+ (1) jumps

to “random” array positions. Average jump:

5+ (1) moves

to adjacent array positions. Each move takes

  • (1) seconds.

Total sorting time:

1 5+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.

Possibility 3: The machine is a “pipelined 2-dimensional RAM using radix-2 sort.” Machine has Θ(

) cells

in a 2-dimensional array. Each cell in the array has network links to the 2 adjacent cells in the same column. Each cell in the top row has network links to the 2 adjacent cells in the top row. Machine also has a CPU attached to top-left cell.

slide-14
SLIDE 14 1+ (1) jumps

rray positions.

5+ (1) moves

y positions. es

  • (1) seconds.

time:

  • seconds.

machine: once again

  • Possibility 3: The machine is a

“pipelined 2-dimensional RAM using radix-2 sort.” Machine has Θ(

) cells

in a 2-dimensional array. Each cell in the array has network links to the 2 adjacent cells in the same column. Each cell in the top row has network links to the 2 adjacent cells in the top row. Machine also has a CPU attached to top-left cell. Radix-2 sort: CPU shuffles array using even numbers befo 3 1 4 1 5 9 2 6

  • 4 2 6 3 1 1 5 9.

Then using bit 1: 4 1 1 5 9 2 6 3. Then using bit 2: 1 1 9 2 3 4 5 6. Then using bit 3: 1 1 2 3 4 5 6 9.

  • etc. Θ(lg
) bits.
slide-15
SLIDE 15

Possibility 3: The machine is a “pipelined 2-dimensional RAM using radix-2 sort.” Machine has Θ(

) cells

in a 2-dimensional array. Each cell in the array has network links to the 2 adjacent cells in the same column. Each cell in the top row has network links to the 2 adjacent cells in the top row. Machine also has a CPU attached to top-left cell. Radix-2 sort: CPU shuffles array using bit 0, even numbers before odd. 3 1 4 1 5 9 2 6

  • 4 2 6 3 1 1 5 9.

Then using bit 1: 4 1 1 5 9 2 6 3. Then using bit 2: 1 1 9 2 3 4 5 6. Then using bit 3: 1 1 2 3 4 5 6 9.

  • etc. Θ(lg
) bits.
slide-16
SLIDE 16

The machine is a 2-dimensional RAM rt.”

) cells

2-dimensional array. array has the 2 adjacent column. top row has the 2 adjacent row. has a CPU top-left cell. Radix-2 sort: CPU shuffles array using bit 0, even numbers before odd. 3 1 4 1 5 9 2 6

  • 4 2 6 3 1 1 5 9.

Then using bit 1: 4 1 1 5 9 2 6 3. Then using bit 2: 1 1 9 2 3 4 5 6. Then using bit 3: 1 1 2 3 4 5 6 9.

  • etc. Θ(lg
) bits.

CPU can read/write sending request through Does not need to w before sending next CPU can read an entire

  • f
5+ (1) cells

in

5+ (1) seconds.

Sends all requests, then receives responses. Total sorting time:

1+ (1) seconds.

Cost of machine: once

1+ (1) Euros.
slide-17
SLIDE 17

Radix-2 sort: CPU shuffles array using bit 0, even numbers before odd. 3 1 4 1 5 9 2 6

  • 4 2 6 3 1 1 5 9.

Then using bit 1: 4 1 1 5 9 2 6 3. Then using bit 2: 1 1 9 2 3 4 5 6. Then using bit 3: 1 1 2 3 4 5 6 9.

  • etc. Θ(lg
) bits.

CPU can read/write any cell by sending request through network. Does not need to wait for response before sending next request. CPU can read an entire row

  • f
5+ (1) cells

in

5+ (1) seconds.

Sends all requests, then receives responses. Total sorting time:

1+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.
slide-18
SLIDE 18

CPU using bit 0, efore odd.

  • 1:

2: 3:

  • bits.

CPU can read/write any cell by sending request through network. Does not need to wait for response before sending next request. CPU can read an entire row

  • f
5+ (1) cells

in

5+ (1) seconds.

Sends all requests, then receives responses. Total sorting time:

1+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.

Possibility 4: The “2-dimensional mesh using Schimmler so Machine has Θ(

)

in a 2-dimensional Each cell has netw to the 4 adjacent cells. Machine also has a attached to top-left CPU broadcasts instructions to all of the cells, but cells do most of the

slide-19
SLIDE 19

CPU can read/write any cell by sending request through network. Does not need to wait for response before sending next request. CPU can read an entire row

  • f
5+ (1) cells

in

5+ (1) seconds.

Sends all requests, then receives responses. Total sorting time:

1+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.

Possibility 4: The machine is a “2-dimensional mesh using Schimmler sort.” Machine has Θ(

) cells

in a 2-dimensional array. Each cell has network links to the 4 adjacent cells. Machine also has a CPU attached to top-left cell. CPU broadcasts instructions to all of the cells, but cells do most of the processing.

slide-20
SLIDE 20

read/write any cell by through network. to wait for response next request. an entire row

  • seconds.

requests, responses. time:

  • seconds.

machine: once again

  • Possibility 4: The machine is a

“2-dimensional mesh using Schimmler sort.” Machine has Θ(

) cells

in a 2-dimensional array. Each cell has network links to the 4 adjacent cells. Machine also has a CPU attached to top-left cell. CPU broadcasts instructions to all of the cells, but cells do most of the processing. Schimmler sort: Recursively sort quadrants in parallel. Then four Sort each column in Sort each row in pa Sort each column in Sort each row in pa With proper choice left-to-right/right-to-left for each row, can p that this sorts whole

slide-21
SLIDE 21

Possibility 4: The machine is a “2-dimensional mesh using Schimmler sort.” Machine has Θ(

) cells

in a 2-dimensional array. Each cell has network links to the 4 adjacent cells. Machine also has a CPU attached to top-left cell. CPU broadcasts instructions to all of the cells, but cells do most of the processing. Schimmler sort: Recursively sort quadrants in parallel. Then four steps: Sort each column in parallel. Sort each row in parallel. Sort each column in parallel. Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array.

slide-22
SLIDE 22

The machine is a mesh sort.”

) cells

2-dimensional array. network links adjacent cells. has a CPU top-left cell. instructions cells, but the processing. Schimmler sort: Recursively sort quadrants in parallel. Then four steps: Sort each column in parallel. Sort each row in parallel. Sort each column in parallel. Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. To sort one row: Sort each pair in pa 3 1 4 1 5 9 2 6

  • 1 3 1 4 5 9 2 6

Sort alternate pairs 1 3 1 4 5 9 2 6

  • 1 1 3 4 5 2 9 6

Repeat. Can prove that row when number of steps equals row length.

slide-23
SLIDE 23

Schimmler sort: Recursively sort quadrants in parallel. Then four steps: Sort each column in parallel. Sort each row in parallel. Sort each column in parallel. Sort each row in parallel. With proper choice of left-to-right/right-to-left for each row, can prove that this sorts whole array. To sort one row: Sort each pair in parallel. 3 1 4 1 5 9 2 6

  • 1 3 1 4 5 9 2 6

Sort alternate pairs in parallel. 1 3 1 4 5 9 2 6

  • 1 1 3 4 5 2 9 6

Repeat. Can prove that row is sorted when number of steps equals row length.

slide-24
SLIDE 24

quadrants four steps: column in parallel. parallel. column in parallel. parallel. choice of left-to-right/right-to-left can prove whole array. To sort one row: Sort each pair in parallel. 3 1 4 1 5 9 2 6

  • 1 3 1 4 5 9 2 6

Sort alternate pairs in parallel. 1 3 1 4 5 9 2 6

  • 1 1 3 4 5 2 9 6

Repeat. Can prove that row is sorted when number of steps equals row length. Sort one row in

5+ (1) seconds.

All rows in parallel:

5+ (1) seconds.

Total sorting time:

5+ (1) seconds.

Cost of machine: once

1+ (1) Euros.
slide-25
SLIDE 25

To sort one row: Sort each pair in parallel. 3 1 4 1 5 9 2 6

  • 1 3 1 4 5 9 2 6

Sort alternate pairs in parallel. 1 3 1 4 5 9 2 6

  • 1 1 3 4 5 2 9 6

Repeat. Can prove that row is sorted when number of steps equals row length. Sort one row in

5+ (1) seconds.

All rows in parallel:

5+ (1) seconds.

Total sorting time:

5+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.
slide-26
SLIDE 26

parallel.

  • pairs in parallel.
  • row is sorted

steps length. Sort one row in

5+ (1) seconds.

All rows in parallel:

5+ (1) seconds.

Total sorting time:

5+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.

Some philosophical 1-tape Turing machines, RAMs, 2-dimensional compute the same Prove this by proving each machine can computations on the (We believe that every reasonable model of can be simulated b 1-tape Turing machine. “Church-Turing thesis.”)

slide-27
SLIDE 27

Sort one row in

5+ (1) seconds.

All rows in parallel:

5+ (1) seconds.

Total sorting time:

5+ (1) seconds.

Cost of machine: once again

1+ (1) Euros.

Some philosophical notes 1-tape Turing machines, RAMs, 2-dimensional meshes compute the same functions. Prove this by proving that each machine can simulate computations on the others. (We believe that every reasonable model of computation can be simulated by a 1-tape Turing machine. “Church-Turing thesis.”)

slide-28
SLIDE 28
  • seconds.

rallel:

  • seconds.

time:

  • seconds.

machine: once again

  • Some philosophical notes

1-tape Turing machines, RAMs, 2-dimensional meshes compute the same functions. Prove this by proving that each machine can simulate computations on the others. (We believe that every reasonable model of computation can be simulated by a 1-tape Turing machine. “Church-Turing thesis.”) 1-tape Turing machines, RAMs, 2-dimensional compute the same in polynomial time at polynomial cost. Prove this by proving simulations are polynomial. (Is this true for every reasonable model of Consider quantum

slide-29
SLIDE 29

Some philosophical notes 1-tape Turing machines, RAMs, 2-dimensional meshes compute the same functions. Prove this by proving that each machine can simulate computations on the others. (We believe that every reasonable model of computation can be simulated by a 1-tape Turing machine. “Church-Turing thesis.”) 1-tape Turing machines, RAMs, 2-dimensional meshes compute the same functions in polynomial time at polynomial cost. Prove this by proving that simulations are polynomial. (Is this true for every reasonable model of computation? Consider quantum computers.)

slide-30
SLIDE 30

philosophical notes machines, 2-dimensional meshes same functions. roving that can simulate the others. every del of computation by a machine. thesis.”) 1-tape Turing machines, RAMs, 2-dimensional meshes compute the same functions in polynomial time at polynomial cost. Prove this by proving that simulations are polynomial. (Is this true for every reasonable model of computation? Consider quantum computers.) 1-tape Turing machines, RAMs, 2-dimensional do not compute the same functions within, e.g., time

  • and cost
1+ (1).

Example: 1-tape T cannot sort in time

  • Too local!

Example: 2-dimensional cannot sort in time

  • Too sequential!
slide-31
SLIDE 31

1-tape Turing machines, RAMs, 2-dimensional meshes compute the same functions in polynomial time at polynomial cost. Prove this by proving that simulations are polynomial. (Is this true for every reasonable model of computation? Consider quantum computers.) 1-tape Turing machines, RAMs, 2-dimensional meshes do not compute the same functions within, e.g., time

1+ (1)

and cost

1+ (1).

Example: 1-tape Turing machine cannot sort in time

1+ (1).

Too local! Example: 2-dimensional RAM cannot sort in time

5+ (1).

Too sequential!

slide-32
SLIDE 32

machines, 2-dimensional meshes same functions time cost. roving that

  • lynomial.

every del of computation? quantum computers.) 1-tape Turing machines, RAMs, 2-dimensional meshes do not compute the same functions within, e.g., time

1+ (1)

and cost

1+ (1).

Example: 1-tape Turing machine cannot sort in time

1+ (1).

Too local! Example: 2-dimensional RAM cannot sort in time

5+ (1).

Too sequential!

(1) is asymptotic.

Speedup factor such

  • might not be a speedup

for small values of

  • When
  • is small,

RAM might seem to sensible machine design. But, for large

,

having a huge memo waiting for a single is a silly machine design.

slide-33
SLIDE 33

1-tape Turing machines, RAMs, 2-dimensional meshes do not compute the same functions within, e.g., time

1+ (1)

and cost

1+ (1).

Example: 1-tape Turing machine cannot sort in time

1+ (1).

Too local! Example: 2-dimensional RAM cannot sort in time

5+ (1).

Too sequential!

(1) is asymptotic.

Speedup factor such as

5+ (1)

might not be a speedup for small values of

.

When

  • is small,

RAM might seem to be a sensible machine design. But, for large

,

having a huge memory waiting for a single CPU is a silly machine design.

slide-34
SLIDE 34

machines, 2-dimensional meshes functions time

1+ (1)
  • .

Turing machine time

1+ (1).

2-dimensional RAM time

5+ (1). (1) is asymptotic.

Speedup factor such as

5+ (1)

might not be a speedup for small values of

.

When

  • is small,

RAM might seem to be a sensible machine design. But, for large

,

having a huge memory waiting for a single CPU is a silly machine design. Myth: Parallel computation improve price-perfo parallel computers may reduce time b but increase cost b Reality: Can often a large serial computer into small parallel so cost does not increase by factor

slide-35
SLIDE 35 (1) is asymptotic.

Speedup factor such as

5+ (1)

might not be a speedup for small values of

.

When

  • is small,

RAM might seem to be a sensible machine design. But, for large

,

having a huge memory waiting for a single CPU is a silly machine design. Myth: Parallel computation cannot improve price-performance ratio; parallel computers may reduce time by factor but increase cost by factor . Reality: Can often convert a large serial computer into small parallel cells, so cost does not increase by factor .

slide-36
SLIDE 36
  • asymptotic.

such as

5+ (1)

speedup

  • f
.
  • small,

to be a design.

  • memory

single CPU machine design. Myth: Parallel computation cannot improve price-performance ratio; parallel computers may reduce time by factor but increase cost by factor . Reality: Can often convert a large serial computer into small parallel cells, so cost does not increase by factor . Myth: Designing a cannot produce mo small constant-facto compared to, e.g., What matters is sp streamlining, such instruction-decoding Reality: In 1997, DES was 1000 times faster set of Pentiums at What matters is pa

slide-37
SLIDE 37

Myth: Parallel computation cannot improve price-performance ratio; parallel computers may reduce time by factor but increase cost by factor . Reality: Can often convert a large serial computer into small parallel cells, so cost does not increase by factor . Myth: Designing a new machine cannot produce more than a small constant-factor improvement compared to, e.g., a Pentium. What matters is special-purpose streamlining, such as reducing instruction-decoding costs. Reality: In 1997, DES Cracker was 1000 times faster than a set of Pentiums at the same price. What matters is parallelism.

slide-38
SLIDE 38

computation cannot erformance ratio; computers by factor cost by factor .

  • ften convert

computer rallel cells, r . Myth: Designing a new machine cannot produce more than a small constant-factor improvement compared to, e.g., a Pentium. What matters is special-purpose streamlining, such as reducing instruction-decoding costs. Reality: In 1997, DES Cracker was 1000 times faster than a set of Pentiums at the same price. What matters is parallelism. Future computers massively parallel meshes. Computer designers today’s RAM-style just as we laugh at a 1-tape Turing machine. Algorithm experts today’s dominant st algorithm analysis, count CPU “operations” view memory access

slide-39
SLIDE 39

Myth: Designing a new machine cannot produce more than a small constant-factor improvement compared to, e.g., a Pentium. What matters is special-purpose streamlining, such as reducing instruction-decoding costs. Reality: In 1997, DES Cracker was 1000 times faster than a set of Pentiums at the same price. What matters is parallelism. Future computers will be massively parallel meshes. Computer designers will laugh at today’s RAM-style machines, just as we laugh at a 1-tape Turing machine. Algorithm experts will laugh at today’s dominant style of algorithm analysis, where we count CPU “operations” and view memory access as free.

slide-40
SLIDE 40

a new machine more than a constant-factor improvement e.g., a Pentium. special-purpose such as reducing ding costs. 1997, DES Cracker faster than a at the same price. parallelism. Future computers will be massively parallel meshes. Computer designers will laugh at today’s RAM-style machines, just as we laugh at a 1-tape Turing machine. Algorithm experts will laugh at today’s dominant style of algorithm analysis, where we count CPU “operations” and view memory access as free. Brute-force searches For each 128-bit AES define ( ) = AES

  • Typical known-plaintext

given ( ); want to Cryptanalyst builds parallel AES circuits, each guessing

  • keys,

for a total of

  • keys.

Time:

  • AES evaluations.

Cost: AES circuits. Success chance:

slide-41
SLIDE 41

Future computers will be massively parallel meshes. Computer designers will laugh at today’s RAM-style machines, just as we laugh at a 1-tape Turing machine. Algorithm experts will laugh at today’s dominant style of algorithm analysis, where we count CPU “operations” and view memory access as free. Brute-force searches For each 128-bit AES key define ( ) = AES

(0).

Typical known-plaintext attack: given ( ); want to find . Cryptanalyst builds machine with parallel AES circuits, each guessing

  • keys,

for a total of

  • keys.

Time:

  • AES evaluations.

Cost: AES circuits. Success chance:

  • 2128.
slide-42
SLIDE 42

computers will be rallel meshes. designers will laugh at yle machines, at machine. erts will laugh at dominant style of analysis, where we erations” and access as free. Brute-force searches For each 128-bit AES key define ( ) = AES

(0).

Typical known-plaintext attack: given ( ); want to find . Cryptanalyst builds machine with parallel AES circuits, each guessing

  • keys,

for a total of

  • keys.

Time:

  • AES evaluations.

Cost: AES circuits. Success chance:

  • 2128.

Cryptanalyst is actually attacking many AES Wants to find

1

✁ ✁ ✂ ✂ ✂

given ( 1)

( 2

✁ ✂ ✂ ✂

Rivest’s “time-memo using distinguished merges these computations. For any 128-bit

:

(

) ✁

( (

)) ✁ ✂ ✂ ✂

finding string that with 30 zero bits. Call that string (

slide-43
SLIDE 43

Brute-force searches For each 128-bit AES key define ( ) = AES

(0).

Typical known-plaintext attack: given ( ); want to find . Cryptanalyst builds machine with parallel AES circuits, each guessing

  • keys,

for a total of

  • keys.

Time:

  • AES evaluations.

Cost: AES circuits. Success chance:

  • 2128.

Cryptanalyst is actually attacking many AES keys. Wants to find

1

2

✁ ✂ ✂ ✂

given ( 1)

( 2)

✁ ✂ ✂ ✂ .

Rivest’s “time-memory tradeoff using distinguished points” merges these computations. For any 128-bit

: Compute

(

) ✁

( (

)) ✁ ✂ ✂ ✂ until

finding string that begins with 30 zero bits. Call that string (

).
slide-44
SLIDE 44

rches AES key AES

(0).

wn-plaintext attack: ant to find . builds machine with circuits,

  • keys,
  • keys.
  • evaluations.

circuits.

  • 2128.

Cryptanalyst is actually attacking many AES keys. Wants to find

1

2

✁ ✂ ✂ ✂

given ( 1)

( 2)

✁ ✂ ✂ ✂ .

Rivest’s “time-memory tradeoff using distinguished points” merges these computations. For any 128-bit

: Compute

(

) ✁

( (

)) ✁ ✂ ✂ ✂ until

finding string that begins with 30 zero bits. Call that string (

).

Given ( 1)

( 2

✁ ✂ ✂ ✂ ✁
  • Choose random
1 ✁
✂ ✂ ✂ ✁
  • Store

(

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁
  • in an array in RAM.

Compute each (

look up ( (

✁ ))

If ( (

✁ )) =

(

check whether (

any of (

  • ✂ )

(

✁ ✂ ✂ ✂

Details: avoid infinite handle multiple collisions.

slide-45
SLIDE 45

Cryptanalyst is actually attacking many AES keys. Wants to find

1

2

✁ ✂ ✂ ✂

given ( 1)

( 2)

✁ ✂ ✂ ✂ .

Rivest’s “time-memory tradeoff using distinguished points” merges these computations. For any 128-bit

: Compute

(

) ✁

( (

)) ✁ ✂ ✂ ✂ until

finding string that begins with 30 zero bits. Call that string (

).

Given ( 1)

( 2)

✁ ✂ ✂ ✂ ✁

(

):

Choose random

1 ✁ 2 ✁ ✂ ✂ ✂ ✁
  • .

Store (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁

(

  • )

in an array in RAM. Compute each ( (

✁ ));

look up ( (

✁ )) in the array.

If ( (

✁ )) =

(

  • ✂ ),

check whether (

✁ ) matches

any of (

  • ✂ )

( (

  • ✂ ))
✁ ✂ ✂ ✂ .

Details: avoid infinite loops; handle multiple collisions.

slide-46
SLIDE 46

actually AES keys.

2

✁ ✂ ✂ ✂ ✁

2)

✁ ✂ ✂ ✂ .

“time-memory tradeoff distinguished points” computations.

: Compute
✂ ✂ ✂ until

that begins bits. (

).

Given ( 1)

( 2)

✁ ✂ ✂ ✂ ✁

(

):

Choose random

1 ✁ 2 ✁ ✂ ✂ ✂ ✁
  • .

Store (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁

(

  • )

in an array in RAM. Compute each ( (

✁ ));

look up ( (

✁ )) in the array.

If ( (

✁ )) =

(

  • ✂ ),

check whether (

✁ ) matches

any of (

  • ✂ )

( (

  • ✂ ))
✁ ✂ ✂ ✂ .

Details: avoid infinite loops; handle multiple collisions. Heuristic analysis: (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁
  • involves

230 outputs If any of the inputs then we’ll find

1.

Chance 230 2128 Same for

2

3

✁ ✂ ✂ ✂

Total chance 230

  • f finding at least

On a serial computer, 231 AES evaluations. Cost: 128 bits

slide-47
SLIDE 47

Given ( 1)

( 2)

✁ ✂ ✂ ✂ ✁

(

):

Choose random

1 ✁ 2 ✁ ✂ ✂ ✂ ✁
  • .

Store (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁

(

  • )

in an array in RAM. Compute each ( (

✁ ));

look up ( (

✁ )) in the array.

If ( (

✁ )) =

(

  • ✂ ),

check whether (

✁ ) matches

any of (

  • ✂ )

( (

  • ✂ ))
✁ ✂ ✂ ✂ .

Details: avoid infinite loops; handle multiple collisions. Heuristic analysis: Computing (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁

(

  • )

involves 230 outputs of . If any of the inputs match

1

then we’ll find

1.

Chance 230 2128. Same for

2

3

✁ ✂ ✂ ✂ .

Total chance 230 2 2128

  • f finding at least one key.

On a serial computer, 231 AES evaluations. Cost: 128 bits of memory.

slide-48
SLIDE 48

2)

✁ ✂ ✂ ✂ ✁

(

): 1 ✁ 2 ✁ ✂ ✂ ✂ ✁
  • .
2) ✁ ✂ ✂ ✂ ✁

(

  • )

RAM. ( (

✁ )); ✁ )) in the array. ✁

(

  • ✂ ),

(

✁ ) matches

( (

  • ✂ ))
✁ ✂ ✂ ✂ .

infinite loops; collisions. Heuristic analysis: Computing (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁

(

  • )

involves 230 outputs of . If any of the inputs match

1

then we’ll find

1.

Chance 230 2128. Same for

2

3

✁ ✂ ✂ ✂ .

Total chance 230 2 2128

  • f finding at least one key.

On a serial computer, 231 AES evaluations. Cost: 128 bits of memory. Much better: Massive Compute all values using AES circuits. Use Schimmler sort collisions ( (

✁ ))

Time: 231 AES plus 8 Schimmler About times faster Cost: AES circuits, plus network links. Maybe 100 times mo than serial. Can reduce

slide-49
SLIDE 49

Heuristic analysis: Computing (

1) ✁

(

2) ✁ ✂ ✂ ✂ ✁

(

  • )

involves 230 outputs of . If any of the inputs match

1

then we’ll find

1.

Chance 230 2128. Same for

2

3

✁ ✂ ✂ ✂ .

Total chance 230 2 2128

  • f finding at least one key.

On a serial computer, 231 AES evaluations. Cost: 128 bits of memory. Much better: Massive parallelism. Compute all values in parallel, using AES circuits. Use Schimmler sort to find collisions ( (

✁ )) =

(

  • ✂ ).

Time: 231 AES evaluations, plus 8 Schimmler steps. About times faster than serial. Cost: AES circuits, plus network links. Maybe 100 times more expensive than serial. Can reduce the 100.

slide-50
SLIDE 50

analysis: Computing

✂ ✂ ✂ ✁

(

  • )
  • utputs of

. inputs match

1

. 2128.

✁ ✁ ✂ ✂ ✂ .

230 2 2128 least one key. computer, evaluations. bits of memory. Much better: Massive parallelism. Compute all values in parallel, using AES circuits. Use Schimmler sort to find collisions ( (

✁ )) =

(

  • ✂ ).

Time: 231 AES evaluations, plus 8 Schimmler steps. About times faster than serial. Cost: AES circuits, plus network links. Maybe 100 times more expensive than serial. Can reduce the 100. Sieving The “number-field is today’s fastest metho to factor a big RSA

  • Most important NFS

find small prime diviso

  • f
  • ✁✁ + 1
✁✁ + 2 ✁ ✂ ✂ ✂ ✁✂

1000002: divisible 1000003: 1000004: divisible 1000005: divisible 1000006: divisible

slide-51
SLIDE 51

Much better: Massive parallelism. Compute all values in parallel, using AES circuits. Use Schimmler sort to find collisions ( (

✁ )) =

(

  • ✂ ).

Time: 231 AES evaluations, plus 8 Schimmler steps. About times faster than serial. Cost: AES circuits, plus network links. Maybe 100 times more expensive than serial. Can reduce the 100. Sieving The “number-field sieve” (NFS) is today’s fastest method to factor a big RSA key

.

Most important NFS bottleneck: find small prime divisors

  • f
  • ✁✁ + 1
✁✁ + 2 ✁ ✂ ✂ ✂ ✁✂ + .

1000002: divisible by 2 3 1000003: 1000004: divisible by 2 2 1000005: divisible by 3 5 1000006: divisible by 2 7

slide-52
SLIDE 52

Massive parallelism. values in parallel, circuits. sort to find

✁ )) =

(

  • ✂ ).

AES evaluations, Schimmler steps. faster than serial. circuits, links. times more expensive reduce the 100. Sieving The “number-field sieve” (NFS) is today’s fastest method to factor a big RSA key

.

Most important NFS bottleneck: find small prime divisors

  • f
  • ✁✁ + 1
✁✁ + 2 ✁ ✂ ✂ ✂ ✁✂ + .

1000002: divisible by 2 3 1000003: 1000004: divisible by 2 2 1000005: divisible by 3 5 1000006: divisible by 2 7 Conventional sieving/TWINKLE (e.g. 2000 Silverman, 2000 Lenstra Shamir): Generate pairs (2

✁ 1000002),

(2

✁ 1000004), (2 ✁ 1000006), ✂ ✂ ✂

(3

✁ 1000002), (3 ✁ 1000005), ✂ ✂ ✂

etc. Use distribution so to sort by second comp

1+

(1) pairs.

Sorting time

1+

(1)

machine cost

1+

slide-53
SLIDE 53

Sieving The “number-field sieve” (NFS) is today’s fastest method to factor a big RSA key

.

Most important NFS bottleneck: find small prime divisors

  • f
  • ✁✁ + 1
✁✁ + 2 ✁ ✂ ✂ ✂ ✁✂ + .

1000002: divisible by 2 3 1000003: 1000004: divisible by 2 2 1000005: divisible by 3 5 1000006: divisible by 2 7 Conventional sieving/TWINKLE (e.g. 2000 Silverman, 2000 Lenstra Shamir): Generate pairs (2

✁ 1000002),

(2

✁ 1000004), (2 ✁ 1000006), ✂ ✂ ✂ ,

(3

✁ 1000002), (3 ✁ 1000005), ✂ ✂ ✂ ,

etc. Use distribution sort to sort by second component.

1+

(1) pairs.

Sorting time

1+

(1);

machine cost

1+

(1).
slide-54
SLIDE 54

er-field sieve” (NFS) fastest method RSA key

.

NFS bottleneck: divisors

  • ✁✁
✁✁

2

✁ ✂ ✂ ✂ ✁✂ + .

divisible by 2 3 divisible by 2 2 divisible by 3 5 divisible by 2 7 Conventional sieving/TWINKLE (e.g. 2000 Silverman, 2000 Lenstra Shamir): Generate pairs (2

✁ 1000002),

(2

✁ 1000004), (2 ✁ 1000006), ✂ ✂ ✂ ,

(3

✁ 1000002), (3 ✁ 1000005), ✂ ✂ ✂ ,

etc. Use distribution sort to sort by second component.

1+

(1) pairs.

Sorting time

1+

(1);

machine cost

1+

(1).

For same machine achieve much higher by massive parallelism. e.g. Schimmler sort: sorting time

5+
  • machine cost

1+

  • This drastically reduces
  • verall NFS time

for sufficiently large

  • (2001 Bernstein)
slide-55
SLIDE 55

Conventional sieving/TWINKLE (e.g. 2000 Silverman, 2000 Lenstra Shamir): Generate pairs (2

✁ 1000002),

(2

✁ 1000004), (2 ✁ 1000006), ✂ ✂ ✂ ,

(3

✁ 1000002), (3 ✁ 1000005), ✂ ✂ ✂ ,

etc. Use distribution sort to sort by second component.

1+

(1) pairs.

Sorting time

1+

(1);

machine cost

1+

(1).

For same machine cost, achieve much higher speed by massive parallelism. e.g. Schimmler sort: sorting time

5+ (1);

machine cost

1+

(1).

This drastically reduces

  • verall NFS time

for sufficiently large

.

(2001 Bernstein)

slide-56
SLIDE 56

sieving/TWINKLE Silverman, Shamir): (2

✁ 1000002), ✁ ✁ 1000006), ✂ ✂ ✂ , ✁ ✁ 1000005), ✂ ✂ ✂ ,

sort second component.

  • (1);
(1).

For same machine cost, achieve much higher speed by massive parallelism. e.g. Schimmler sort: sorting time

5+ (1);

machine cost

1+

(1).

This drastically reduces

  • verall NFS time

for sufficiently large

.

(2001 Bernstein) Can do even better low-memory small-diviso algorithms, such as elliptic-curve metho Time only

0+

(1);

machine cost

1+

  • This further reduces
  • verall NFS time

for sufficiently large

  • (2001 Bernstein)

Can also save time bottleneck, “linear less important. (2001

slide-57
SLIDE 57

For same machine cost, achieve much higher speed by massive parallelism. e.g. Schimmler sort: sorting time

5+ (1);

machine cost

1+

(1).

This drastically reduces

  • verall NFS time

for sufficiently large

.

(2001 Bernstein) Can do even better with low-memory small-divisor algorithms, such as the elliptic-curve method (ECM). Time only

0+

(1);

machine cost

1+

(1).

This further reduces

  • verall NFS time

for sufficiently large

.

(2001 Bernstein) Can also save time in another bottleneck, “linear algebra”; less important. (2001 Bernstein)

slide-58
SLIDE 58

machine cost, higher speed rallelism. sort:

5+ (1); (1).

reduces rge

.

Bernstein) Can do even better with low-memory small-divisor algorithms, such as the elliptic-curve method (ECM). Time only

0+

(1);

machine cost

1+

(1).

This further reduces

  • verall NFS time

for sufficiently large

.

(2001 Bernstein) Can also save time in another bottleneck, “linear algebra”; less important. (2001 Bernstein) NFS price-performance exp(( +

(1)) 3 (log
  • assuming standard

sieving linear algeb RAM RAM

✂ ✂ ✂ ✂

RAM RAM

✂ ✂ ✂ ✂

Schimmler RAM

✂ ✂ ✂ ✂

Schimmler Schimmler

✂ ✂ ✂ ✂

ECM RAM

✂ ✂ ✂ ✂

ECM Schimmler

✂ ✂ ✂ ✂

(RAM 2

✂ 85: standa

2

✂ 37, 1 ✂ 97: 2001.11

RAM 2

✂ 76: 2002.04
slide-59
SLIDE 59

Can do even better with low-memory small-divisor algorithms, such as the elliptic-curve method (ECM). Time only

0+

(1);

machine cost

1+

(1).

This further reduces

  • verall NFS time

for sufficiently large

.

(2001 Bernstein) Can also save time in another bottleneck, “linear algebra”; less important. (2001 Bernstein) NFS price-performance ratio is exp(( +

(1)) 3 (log )(log log )2)

assuming standard conjectures. sieving linear algebra RAM RAM 2

✂ 85 ✂ ✂ ✂

RAM RAM 2

✂ 76 ✂ ✂ ✂

Schimmler RAM 2

✂ 37 ✂ ✂ ✂

Schimmler Schimmler 2

✂ 36 ✂ ✂ ✂

ECM RAM 2

✂ 08 ✂ ✂ ✂

ECM Schimmler 1

✂ 97 ✂ ✂ ✂

(RAM 2

✂ 85: standard;

2

✂ 37, 1 ✂ 97: 2001.11 Bernstein;

RAM 2

✂ 76: 2002.04 Pomerance)
slide-60
SLIDE 60

etter with small-divisor as the method (ECM).

(1); (1).

reduces rge

.

Bernstein) time in another “linear algebra”; (2001 Bernstein) NFS price-performance ratio is exp(( +

(1)) 3 (log )(log log )2)

assuming standard conjectures. sieving linear algebra RAM RAM 2

✂ 85 ✂ ✂ ✂

RAM RAM 2

✂ 76 ✂ ✂ ✂

Schimmler RAM 2

✂ 37 ✂ ✂ ✂

Schimmler Schimmler 2

✂ 36 ✂ ✂ ✂

ECM RAM 2

✂ 08 ✂ ✂ ✂

ECM Schimmler 1

✂ 97 ✂ ✂ ✂

(RAM 2

✂ 85: standard;

2

✂ 37, 1 ✂ 97: 2001.11 Bernstein;

RAM 2

✂ 76: 2002.04 Pomerance)

Switching from RAM massively parallel machine produces gigantic NFS for sufficiently large

  • Improvement from

RAM factorization,

✂ ✂ ✂ ✂

to best machine,

✂ ✂ ✂ ✂

corresponds to multiplying number of digits of

  • by 3
✂ 009 ✂ ✂ ✂ + (1).
slide-61
SLIDE 61

NFS price-performance ratio is exp(( +

(1)) 3 (log )(log log )2)

assuming standard conjectures. sieving linear algebra RAM RAM 2

✂ 85 ✂ ✂ ✂

RAM RAM 2

✂ 76 ✂ ✂ ✂

Schimmler RAM 2

✂ 37 ✂ ✂ ✂

Schimmler Schimmler 2

✂ 36 ✂ ✂ ✂

ECM RAM 2

✂ 08 ✂ ✂ ✂

ECM Schimmler 1

✂ 97 ✂ ✂ ✂

(RAM 2

✂ 85: standard;

2

✂ 37, 1 ✂ 97: 2001.11 Bernstein;

RAM 2

✂ 76: 2002.04 Pomerance)

Switching from RAM to a massively parallel machine produces gigantic NFS speedups for sufficiently large

.

Improvement from conventional RAM factorization, = 2

✂ 85 ✂ ✂ ✂ ,

to best machine, = 1

✂ 97 ✂ ✂ ✂ ,

corresponds to multiplying number of digits of

  • by 3
✂ 009 ✂ ✂ ✂ + (1).
slide-62
SLIDE 62

rmance ratio is

  • (log
)(log log )2)

rd conjectures. r algebra 2

✂ 85 ✂ ✂ ✂

2

✂ 76 ✂ ✂ ✂

2

✂ 37 ✂ ✂ ✂

Schimmler 2

✂ 36 ✂ ✂ ✂

2

✂ 08 ✂ ✂ ✂

Schimmler 1

✂ 97 ✂ ✂ ✂ ✂

standard;

✂ ✂

2001.11 Bernstein;

2002.04 Pomerance) Switching from RAM to a massively parallel machine produces gigantic NFS speedups for sufficiently large

.

Improvement from conventional RAM factorization, = 2

✂ 85 ✂ ✂ ✂ ,

to best machine, = 1

✂ 97 ✂ ✂ ✂ ,

corresponds to multiplying number of digits of

  • by 3
✂ 009 ✂ ✂ ✂ + (1).

As always,

(1) is

Situation for small

  • is much less clear.

How expensive is it factor 1024-bit RSA We still don’t know. Can now find many making wild predictions. None of the predictions can be taken seriously!

slide-63
SLIDE 63

Switching from RAM to a massively parallel machine produces gigantic NFS speedups for sufficiently large

.

Improvement from conventional RAM factorization, = 2

✂ 85 ✂ ✂ ✂ ,

to best machine, = 1

✂ 97 ✂ ✂ ✂ ,

corresponds to multiplying number of digits of

  • by 3
✂ 009 ✂ ✂ ✂ + (1).

As always,

(1) is asymptotic.

Situation for small

  • is much less clear.

How expensive is it to factor 1024-bit RSA keys? We still don’t know. Can now find many papers making wild predictions. None of the predictions can be taken seriously!

slide-64
SLIDE 64

RAM to a rallel machine gigantic NFS speedups rge

.

from conventional rization, = 2

✂ 85 ✂ ✂ ✂ ,

machine, = 1

✂ 97 ✂ ✂ ✂ ,

multiplying

  • f
✂ ✂ ✂ (1).

As always,

(1) is asymptotic.

Situation for small

  • is much less clear.

How expensive is it to factor 1024-bit RSA keys? We still don’t know. Can now find many papers making wild predictions. None of the predictions can be taken seriously! NFS speed is complicated. Example: NFS facto

  • using an auxiliary p

Number of polynomial is huge. Effect of p takes time to compute. Some papers don’t effort into polynomial so they underestimate Some papers make

  • ptimal-polynomial

so they overestimate

slide-65
SLIDE 65

As always,

(1) is asymptotic.

Situation for small

  • is much less clear.

How expensive is it to factor 1024-bit RSA keys? We still don’t know. Can now find many papers making wild predictions. None of the predictions can be taken seriously! NFS speed is complicated. Example: NFS factors

  • using an auxiliary polynomial.

Number of polynomial choices is huge. Effect of polynomial takes time to compute. Some papers don’t put enough effort into polynomial choice, so they underestimate NFS speed. Some papers make unjustified

  • ptimal-polynomial extrapolations,

so they overestimate NFS speed.

slide-66
SLIDE 66
  • is asymptotic.

small

  • r.

is it to RSA keys? know. many papers redictions. redictions seriously! NFS speed is complicated. Example: NFS factors

  • using an auxiliary polynomial.

Number of polynomial choices is huge. Effect of polynomial takes time to compute. Some papers don’t put enough effort into polynomial choice, so they underestimate NFS speed. Some papers make unjustified

  • ptimal-polynomial extrapolations,

so they overestimate NFS speed. At a lower level, to massively parallel computers are much less streamlined than today’s Pentiums. Computer market will Massive parallelism become the de-facto and will be tuned ca How much speed will Today it’s hard to But we’ll find out!

slide-67
SLIDE 67

NFS speed is complicated. Example: NFS factors

  • using an auxiliary polynomial.

Number of polynomial choices is huge. Effect of polynomial takes time to compute. Some papers don’t put enough effort into polynomial choice, so they underestimate NFS speed. Some papers make unjustified

  • ptimal-polynomial extrapolations,

so they overestimate NFS speed. At a lower level, today’s massively parallel computers are much less streamlined than today’s Pentiums. Computer market will evolve. Massive parallelism will become the de-facto standard, and will be tuned carefully. How much speed will we gain? Today it’s hard to say. But we’ll find out!