Disclaimers Picture credit: Rick Bowmer/AP How to use the new - - PowerPoint PPT Presentation

disclaimers
SMART_READER_LITE
LIVE PREVIEW

Disclaimers Picture credit: Rick Bowmer/AP How to use the new - - PowerPoint PPT Presentation

Disclaimers Picture credit: Rick Bowmer/AP How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis D. J. Bernstein University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers


slide-1
SLIDE 1

Picture credit: Rick Bowmer/AP

How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers

slide-2
SLIDE 2

Picture credit: Rick Bowmer/AP

How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
slide-3
SLIDE 3

Picture credit: Rick Bowmer/AP

How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
slide-4
SLIDE 4

Picture credit: Rick Bowmer/AP

How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
slide-5
SLIDE 5

Picture credit: Rick Bowmer/AP

How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

slide-6
SLIDE 6

Picture credit: Rick Bowmer/AP

How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis

  • D. J. Bernstein

University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

  • 5. Also assuming use of

traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis.

slide-7
SLIDE 7

credit: Rick Bowmer/AP

to use new 65-megawatt Bluffdale supercomputer: gentle introduction cryptanalysis Bernstein University of Illinois at Chicago & echnische Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

  • 5. Also assuming use of

traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic My mission: protect every against espionage+sab

slide-8
SLIDE 8

Bowmer/AP

65-megawatt ercomputer: duction Illinois at Chicago & Universiteit Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

  • 5. Also assuming use of

traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet against espionage+sab

slide-9
SLIDE 9

Chicago & Eindhoven Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

  • 5. Also assuming use of

traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage.

slide-10
SLIDE 10

Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

  • 5. Also assuming use of

traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage.

slide-11
SLIDE 11

Disclaimers

  • 1. I don’t work for NSA.
  • 2. NSA hasn’t told me anything.
  • 3. This is not a leak.
  • 4. I’m assuming that

NSA is not stupid.

  • 5. Also assuming use of

traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast

  • n devices designed primarily

for doing something else:

slide-12
SLIDE 12

Disclaimers don’t work for NSA. NSA hasn’t told me anything. This is not a leak. assuming that is not stupid. Also assuming use of traditional transistors+wires, robably with some optics; long-term storage. Quantum computing would different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast

  • n devices designed primarily

for doing something else: User also crypto to Some examples ✎ 2009 exploit signatures (small ✎ 2010 exploit signatures (trivial—stupid ✎ 2012 exploit signatures (somewhat

slide-13
SLIDE 13

for NSA. told me anything. leak. that stupid. assuming use of transistors+wires, some optics; storage. computing would analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast

  • n devices designed primarily

for doing something else: User also needs crypto to be secure Some examples of ✎ 2009 exploit of RSA-512 signatures in TI (small public computation); ✎ 2010 exploit of ECDSA signatures in Pla (trivial—stupid Sony ✎ 2012 exploit of MD5-based signatures by Flame (somewhat larger

slide-14
SLIDE 14

anything. rs+wires,

  • ptics;
  • uld

Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast

  • n devices designed primarily

for doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mista ✎ 2012 exploit of MD5-based signatures by Flame malwa (somewhat larger computation).

slide-15
SLIDE 15

Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast

  • n devices designed primarily

for doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation).

slide-16
SLIDE 16

Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast

  • n devices designed primarily

for doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public.

slide-17
SLIDE 17

Cryptographic challenges mission: Cryptographically rotect every Internet packet against espionage+sabotage. needs crypto to be fast devices designed primarily doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic fit the user’s ✮ optimize cryptosystem for each

slide-18
SLIDE 18

challenges Cryptographically Internet packet espionage+sabotage. crypto to be fast designed primarily something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic fit the user’s cost constraints? ✮ optimize choice cryptosystem + algo for each user device.

slide-19
SLIDE 19

Cryptographically packet

  • tage.

fast rily User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device.

slide-20
SLIDE 20

User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device.

slide-21
SLIDE 21

User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem.

slide-22
SLIDE 22

User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture.

slide-23
SLIDE 23

also needs to be secure. examples of crypto failing: ✎ exploit of RSA-512 signatures in TI calculators (small public computation); ✎ exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. Predictions physicists sometimes Common underlying calculations

slide-24
SLIDE 24

secure.

  • f crypto failing:

  • f RSA-512

TI calculators computation); ✎

  • f ECDSA

PlayStation 3 (trivial—stupid Sony mistake); ✎

  • f MD5-based

Flame malware rger computation). many more examples the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made b physicists are often sometimes wrong. Common sources of underlying models calculations from those

slide-25
SLIDE 25

failing: ✎ calculators computation); ✎ 3 mistake); ✎ MD5-based malware computation). examples Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models.

slide-26
SLIDE 26

Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models.

slide-27
SLIDE 27

Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce.

slide-28
SLIDE 28

Critical questions: cryptographic systems user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm h user device. cryptographic systems broken by attackers? ✮ optimize choice of attack algorithm + device h cryptosystem. interactions between high-level algorithms and w-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics Of course Every field theoreticians regarding experimental measure we compa

slide-29
SLIDE 29

questions: cryptographic systems cost constraints? ✮ choice of algorithm device. cryptographic systems y attackers? ✮ choice of rithm + device cryptosystem. interactions between rithms and computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely Of course not. Every field of science: theoreticians make regarding observable experimental scientists measure those phenomena; we compare the re

slide-30
SLIDE 30

ystems constraints? ✮ rithm ystems ers? ✮ device een rchitecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results.

slide-31
SLIDE 31

Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results.

slide-32
SLIDE 32

Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest.

slide-33
SLIDE 33

ry vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. eriments aren’t perfect catch many errors; many disputes; rovide raw data leading to new theories; more confidence than alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm error-prone Theoreticians regarding These predictions disputed,

slide-34
SLIDE 34

eriment made by theoretical

  • ften disputed,

wrong. sources of error: dels of physics; those models. ren’t perfect errors; disputes; theories; confidence than ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis error-prone field of Theoreticians make regarding algorithm These predictions a disputed, sometimes

slide-35
SLIDE 35

retical disputed, physics; models. than duce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong.

slide-36
SLIDE 36

Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong.

slide-37
SLIDE 37

Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack.

slide-38
SLIDE 38

Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.
slide-39
SLIDE 39

physics uniquely error-prone? course not. field of science: reticians make predictions rding observable phenomena; erimental scientists measure those phenomena; compare the results. if measurements are expensive to carry out? Measurements start with scaled-down experiments, up towards scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s sec “QS” facto costs 2100

slide-40
SLIDE 40

uniquely error-prone? science: make predictions

  • bservable phenomena;

scientists phenomena; results. easurements are carry out? tart with eriments, interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization costs 2100 to break

slide-41
SLIDE 41

rone? redictions phenomena; phenomena;

  • ut?

Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024.

slide-42
SLIDE 42

Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024.

slide-43
SLIDE 43

Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”.

slide-44
SLIDE 44

Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024.

slide-45
SLIDE 45

Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280?

slide-46
SLIDE 46

Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms

  • n the largest scale we can.

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048.

slide-47
SLIDE 47

rithm analysis is another rone field of science. reticians make predictions rding algorithm performance. predictions are often disputed, sometimes wrong. rticularly error-prone: cryptanalytic extrapolations an academic computation serious real-world attack. catch errors, resolve disputes rrying out experiments: actually running these algorithms largest scale we can. 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attack Enough theo should reach

  • n amount

required But can this amount

slide-48
SLIDE 48

nalysis is another

  • f science.

make predictions rithm performance. redictions are often sometimes wrong. r-prone: extrapolations mic computation real-world attack. rs, resolve disputes experiments: these algorithms scale we can. 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s sup Enough theory+exp should reach consensus

  • n amount of computation

required to break a But can the attack this amount of computation?

slide-49
SLIDE 49

another science. redictions rmance.

  • ften

wrong.

  • lations

computation attack. disputes eriments: algorithms can. 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation?

slide-50
SLIDE 50

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation?

slide-51
SLIDE 51

1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.
slide-52
SLIDE 52

security evaluation: factorization algorithm

100 to break RSA-1024.

  • llard: new “NFS”.

Adleman: NFS beat QS for RSA-1024. Subsequent experiments ✮ much faster; maybe 280? Actual security of RSA-1024 is matter of dispute: e.g., Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.

Communication Bill Dally “Communication more energy

slide-53
SLIDE 53

evaluation: tion algorithm ak RSA-1024. new “NFS”. NFS for RSA-1024. eriments ✮ faster; maybe 280?

  • f RSA-1024 is

dispute: e.g., ra–Kleinjung– Lenstra–Montgomery oppose transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.

Communication vs. Bill Dally, 2013.06.17: “Communication tak more energy than a

slide-54
SLIDE 54

rithm RSA-1024. “NFS”. RSA-1024. ✮ e 280? RSA-1024 is e.g., jung–

  • se

RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”.

slide-55
SLIDE 55

The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”.

slide-56
SLIDE 56

The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.”

slide-57
SLIDE 57

The attacker’s supercomputer Enough theory+experiment should reach consensus

  • n amount of computation

required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze

  • ptimal use of those resources.

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic.

slide-58
SLIDE 58

attacker’s supercomputer Enough theory+experiment reach consensus amount of computation required to break a system. can the attacker perform amount of computation?

  • thesize attacker resources.

talk: $2 billion, 65MW. Alternative: millions of romised Internet computers. interesting part: analyze

  • ptimal use of those resources.

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algo ♥ Square matrix-ve ♥2 arithmetic.

slide-59
SLIDE 59

supercomputer experiment consensus computation reak a system. attacker perform computation? attacker resources. llion, 65MW. millions of Internet computers. part: analyze those resources. Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥ Square matrix-vecto ♥2 arithmetic.

slide-60
SLIDE 60

ercomputer eriment system. rm computation? resources. 65MW. computers. analyze resources. Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic.

slide-61
SLIDE 61

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic.

slide-62
SLIDE 62

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic.

slide-63
SLIDE 63

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc.

slide-64
SLIDE 64

Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more.

slide-65
SLIDE 65

Communication vs. arithmetic Dally, 2013.06.17: “Communication takes energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of that we spend today transferring data.” ends what you’re doing! Computations fundamentally vary amount of communication (distance and volume) amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥

is enough all data fo ♥

slide-66
SLIDE 66
  • vs. arithmetic

2013.06.17: takes than arithmetic”. wlowski, “The majority of spend today g data.”

  • u’re doing!

fundamentally vary communication volume) rithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2

slide-67
SLIDE 67

rithmetic rithmetic”. y of y doing! fundamentally vary communication Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT.

slide-68
SLIDE 68

Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT.

slide-69
SLIDE 69

Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs.

slide-70
SLIDE 70

Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data

  • ccupies area ♥2+✎

for time ♥1+✎.

slide-71
SLIDE 71

Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data

  • ccupies area ♥2+✎

for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays.

slide-72
SLIDE 72

algorithms using ♥2 data: matrix-vector product: ♥ rithmetic. for input size ♥2: ♥ ♥ arithmetic. Matrix-matrix product: ypically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, common iterations, algorithms, etc.: ♥ rithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data

  • ccupies area ♥2+✎

for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥

is enough several ♥ ✂ ♥ Routing

  • ccupies

for time ♥

Typical ♥ also occupies ♥ for time ♥

Closer lo ✎ the ALU although

slide-73
SLIDE 73

using ♥2 data: matrix-vector product: ♥ size ♥2: ♥ ♥ rithmetic. roduct: ♥ rithmetic etc. antum chemistry, iterations, rithms, etc.: ♥ sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data

  • ccupies area ♥2+✎

for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix pro

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much.

slide-74
SLIDE 74

♥ data: duct: ♥ ♥ ♥ ♥ ♥ chemistry, ♥ more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data

  • ccupies area ♥2+✎

for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much.

slide-75
SLIDE 75

Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data

  • ccupies area ♥2+✎

for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much.

slide-76
SLIDE 76

rea ♥2+✎ enough to store data for size-♥2 FFT. rea ♥2+✎ enough for ♥ rallel ALUs. takes time ♥✎, to parallelism? No! Routing the FFT data ccupies area ♥2+✎ e ♥1+✎. Brent–Kung: need ♥1+✎ without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of

  • f typical

is spent ❁10% on Is Bluffdale

slide-77
SLIDE 77

re ♥2 FFT. ♥

r ♥ ALUs. ♥✎, rallelism? No! data ♥2+✎ ♥

Brent–Kung: need ♥1+✎ wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this

slide-78
SLIDE 78

♥ ♥

♥ ♥✎ No! ♥

♥1+✎ Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way?

slide-79
SLIDE 79

Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way?

slide-80
SLIDE 80

Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product

  • ccupies area ♥2+✎

for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance

  • f matrix-matrix product

and heavier-arith computations. NSA’s computations have a mix

  • f heavy arith and heavy comm.
slide-81
SLIDE 81

rea ♥2+✎ enough to store several ♥ ✂ ♥ matrices. Routing matrix product ccupies area ♥2+✎ e ♥1+✎. ypical ♥3 arithmetic ccupies ♥2 ALUs e ♥1+✎. look at ✎: ALU cost dominates, although not by much. ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance

  • f matrix-matrix product

and heavier-arith computations. NSA’s computations have a mix

  • f heavy arith and heavy comm.

GPUs have but relatively communication a few long Is Bluffdale

slide-82
SLIDE 82

re ♥ ✂ ♥ matrices. product ♥2+✎ ♥

♥ rithmetic ♥ ALUs ♥

✎ dominates, much. ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance

  • f matrix-matrix product

and heavier-arith computations. NSA’s computations have a mix

  • f heavy arith and heavy comm.

GPUs have many ALUs but relatively little communication capacit a few long wires to Is Bluffdale built this

slide-83
SLIDE 83

♥ ✂ ♥ ♥

♥ ♥ ♥

✎ ❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance

  • f matrix-matrix product

and heavier-arith computations. NSA’s computations have a mix

  • f heavy arith and heavy comm.

GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way?

slide-84
SLIDE 84

❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance

  • f matrix-matrix product

and heavier-arith computations. NSA’s computations have a mix

  • f heavy arith and heavy comm.

GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way?

slide-85
SLIDE 85

❃90% of the cost

  • f typical supercomputers

is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance

  • f matrix-matrix product

and heavier-arith computations. NSA’s computations have a mix

  • f heavy arith and heavy comm.

GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc.

slide-86
SLIDE 86

  • f the cost

ypical supercomputers ent on communication; ❁

  • n ALUs.

Bluffdale built this way? NSA is not stupid. Doubling number of ALUs cost ❁10% extra. ✙double performance matrix-matrix product heavier-arith computations. computations have a mix heavy arith and heavy comm. GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation Intel Xeon and a few plus adjacent communication Is Bluffdale

slide-87
SLIDE 87

❃ cost ercomputers communication; ❁ this way? stupid. er of ALUs ❁ extra. ✙ performance product computations. computations have a mix and heavy comm. GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells Intel Xeon Phi has and a few long wires plus adjacent one-dimensional communication (ring Is Bluffdale built this

slide-88
SLIDE 88

❃ communication; ❁ ❁ ✙ rmance computations. a mix comm. GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way?

slide-89
SLIDE 89

GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way?

slide-90
SLIDE 90

GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems.

slide-91
SLIDE 91

have many ALUs relatively little communication capacity: long wires to RAM. Bluffdale built this way? NSA is not stupid. Adding communication een adjacent ALUs cost very little. drastically speed up matrix-matrix product heavier-comm computations: sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even with 3D e.g. 1983 Huge engineering 2D allows energy input, up to very 3D is hard Some limited (most interest presumably Progress e.g., 4 ✂ ✂ is often called

slide-92
SLIDE 92

many ALUs little capacity: to RAM. this way? stupid. communication adjacent ALUs little. drastically speed up roduct heavier-comm computations: etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even more time with 3D arrangement e.g. 1983 Rosenbe Huge engineering challenge. 2D allows easy scaling energy input, heat up to very large chip 3D is hard to scale. Some limited progress (most interesting: presumably used b Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”.

slide-93
SLIDE 93

computations: Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”.

slide-94
SLIDE 94

Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”.

slide-95
SLIDE 95

cumentation tells me that Xeon Phi has many ALUs few long wires to RAM adjacent one-dimensional communication (ring bus). Bluffdale built this way? NSA is not stupid. Adding two-dimensional grid drastically speed up heavy-comm computations. 1977 Thompson–Kung. examples: MasPar; FPGAs. FPGAs have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. Typical cryptanalytic between ✂ ✂ better perfo from ASICs mass-ma Some exce ASICs bring

slide-96
SLIDE 96

tells me that has many ALUs wires to RAM

  • ne-dimensional

(ring bus). this way? stupid.

  • -dimensional grid

speed up computations. Thompson–Kung. MasPar; FPGAs. have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general Typical cryptanalytic between 100✂ and ✂ better performance from ASICs than from mass-market CPUs, Some exceptions, but ASICs bring massive

slide-97
SLIDE 97

that ALUs RAM

  • ne-dimensional

bus). grid s. Thompson–Kung. FPGAs. roblems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transisto from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup.

slide-98
SLIDE 98

Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup.

slide-99
SLIDE 99

Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂.

slide-100
SLIDE 100

even more time 3D arrangement of ALUs? 1983 Rosenberg. engineering challenge. allows easy scaling of input, heat output very large chip area. hard to scale. limited progress interesting: optics), resumably used by NSA. Progress often exaggerated: ✂ 16384 ✂ 16384

  • ften called “3D”.

Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent chips spend

  • n decoding+scheduling

✮ CPU/GPU reduce insn-handling by adding apply same to multiple

slide-101
SLIDE 101

time rrangement of ALUs? Rosenberg. engineering challenge. scaling of heat output chip area. scale. rogress ng: optics), by NSA. exaggerated: ✂ ✂ 16384 “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time,

  • n decoding+scheduling

✮ CPU/GPU design reduce insn-handling by adding vectorization— apply same instruction to multiple data/thread

slide-102
SLIDE 102

ALUs? challenge. exaggerated: ✂ ✂ Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads.

slide-103
SLIDE 103

Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads.

slide-104
SLIDE 104

Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file.

slide-105
SLIDE 105

ecial vs. general purpose ypical cryptanalytic arith: een 100✂ and 1000✂ performance per transistor ASICs than from mass-market CPUs, GPUs. exceptions, but overall bring massive speedup. in cryptanalysis? No. Estimated ASIC improvement reliminary scan of other ercomputing arith problems: ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious reduce these combine doing mo between Example: to compute ①② ③ CPU reads ①❀ ②❀ ③ computes ①② ③ With sepa CPU reads ①❀ ② ①② writes; reads ③ computes ①② ③

slide-106
SLIDE 106

general purpose cryptanalytic arith: ✂ and 1000✂ rmance per transistor from CPUs, GPUs. , but overall massive speedup. lysis? No. improvement scan of other arith problems: ❃ ✂ often ❃100✂. Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and Example: Build circuit to compute ①② + ③ CPU reads regs ①❀ ②❀ ③ computes ①② + ③; With separate mul, CPU reads ①❀ ②; computes ①② writes; reads back; ③ computes ①② + ③;

slide-107
SLIDE 107
  • se

rith: ✂ ✂ ansistor GPUs.

  • verall

eedup. No. rovement

  • ther

roblems: ❃ ✂ ❃ ✂. Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①② writes; reads back; reads ③; computes ①② + ③; writes.

slide-108
SLIDE 108

Frequent observation: chips spend area, time, energy

  • n decoding+scheduling insns.

✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes.

slide-109
SLIDE 109

requent observation: spend area, time, energy decoding+scheduling insns. ✮ CPU/GPU design trend: insn-handling cost ding vectorization— same instruction multiple data/threads. this does nothing reduce costs of reading data from reg file, data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common evolved in Chip designer single-precision eventually circuit fo

slide-110
SLIDE 110
  • bservation:

rea, time, energy ding+scheduling insns. ✮ design trend: insn-handling cost rization— instruction data/threads. nothing

  • f

from reg file, reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way Chip designer saw single-precision fp eventually spent area circuit for those muls.

slide-111
SLIDE 111

energy insns. ✮ : Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls.

slide-112
SLIDE 112

Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls.

slide-113
SLIDE 113

Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp.

slide-114
SLIDE 114

Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle.

slide-115
SLIDE 115

Obvious strategy to these reg costs: combine arith operations, more arith een read and write. Example: Build circuit compute ①② + ③. reads regs ①❀ ②❀ ③; computes ①② + ③; writes. separate mul, add: reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another Your application mul-sub-sub-sub-sub in its inner Should CPU include mul 4 separate

slide-116
SLIDE 116

to costs: erations, and write. circuit ①② ③. ①❀ ②❀ ③; ①② ③; writes. mul, add: ①❀ ② computes ①②; back; reads ③; ①② ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application do mul-sub-sub-sub-sub in its inner loop. Should CPU design include mul circuit, 4 separate sub circuits?

slide-117
SLIDE 117

①② ③ ①❀ ②❀ ③ ①② ③ ①❀ ② ①②; ③; ①② ③ Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits?

slide-118
SLIDE 118

Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits?

slide-119
SLIDE 119

Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle.

slide-120
SLIDE 120

Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly.

slide-121
SLIDE 121

Common fp operations evolved in this way. designer saw many single-precision fp muls, eventually spent area on for those muls. spent much more area expand the multiplier double-precision fp. eople still run many single-precision computations. multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC beyond to ✎ Squaring than multiplication. ✎ Skip most ✎ Reduce what is ✎ Add very if application ✎ etc.

slide-122
SLIDE 122

erations ay. w many fp muls, area on muls. much more area multiplier cision fp. run many computations. transistors are idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedup beyond today’s CPUs/GPUs: ✎ Squaring is cheap than multiplication. ✎ Skip most normalizations. ✎ Reduce precision what is actually ✎ Add very fast sqrt if application needs ✎ etc.

slide-123
SLIDE 123

rea computations. are Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc.

slide-124
SLIDE 124

Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc.

slide-125
SLIDE 125

Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups.

slide-126
SLIDE 126

Another example: application does mul-sub-sub-sub-sub inner loop. CPU designer include mul circuit, rate sub circuits? CPU then runs another application. Subtraction circuits are mostly sitting idle. designer says no, reduces area per core. ✮

  • ur application runs slowly.

Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA for each The small ASIC design Not a serious for $2 billion.

slide-127
SLIDE 127

example: does mul-sub-sub-sub-sub

  • p.

signer circuit, circuits? runs application. circuits are idle. ys no, core. ✮ application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion.

slide-128
SLIDE 128

✮ slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion.

slide-129
SLIDE 129

Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion.

slide-130
SLIDE 130

Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful?

slide-131
SLIDE 131

ASIC fp speedups

  • nd today’s CPUs/GPUs:

✎ ring is cheaper multiplication. ✎ most normalizations. ✎ Reduce precision to what is actually needed. ✎ very fast sqrt application needs it. ✎ Cryptanalysis involves multiplications also a much wider

  • f operations.

larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious some ASICs, mix of application-tuned integrated Take a gene Add exactly XYZZY plus some Think ahead, XYZZ? XZZY? Still simila New CPU Merge simila if not much

slide-132
SLIDE 132

speedups CPUs/GPUs: ✎ cheaper multiplication. ✎ rmalizations. ✎ recision to actually needed. ✎ sqrt needs it. ✎ involves multiplications wider erations. speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious solution fo some ASICs, plus heterogeneous mix of application-tuned integrated circuits Take a general-purp Add exactly the big XYZZY needed by plus some vectorization. Think ahead, add XYZZ? XZZY? XYQZZY? Still similar cost to New CPU for each Merge similar applications if not much cost in

slide-133
SLIDE 133

CPUs/GPUs: ✎ ✎ rmalizations. ✎ ✎ ✎ eedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area.

slide-134
SLIDE 134

So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area.

slide-135
SLIDE 135

NSA builds ASICs h application? small problem: design effort. serious issue billion. big problem: redictable application mix. will want some agility adapt to new computations stop old computations. Quantify using historical data: long is an ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale Critical fo and implemento

slide-136
SLIDE 136

ASICs application? roblem: rt. issue roblem: pplication mix. some agility computations computations. historical data: ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user Critical for algorithm and implementor:

slide-137
SLIDE 137

mix. y computations computations. data: useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor:

slide-138
SLIDE 138

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor:

slide-139
SLIDE 139

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism.

slide-140
SLIDE 140

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication.

slide-141
SLIDE 141

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions.

slide-142
SLIDE 142

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions. Some vectorization.

slide-143
SLIDE 143

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions. Some vectorization. Occasional faults.

slide-144
SLIDE 144

Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions. Some vectorization. Occasional faults. Need to understand cryptanalysis: ECM, sparse linear algebra, differentials, FFTs, much more.