SLIDE 1 Picture credit: Rick Bowmer/AP
How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis
University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers
SLIDE 2 Picture credit: Rick Bowmer/AP
How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis
University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers
SLIDE 3 Picture credit: Rick Bowmer/AP
How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis
University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
SLIDE 4 Picture credit: Rick Bowmer/AP
How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis
University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
SLIDE 5 Picture credit: Rick Bowmer/AP
How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis
University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
SLIDE 6 Picture credit: Rick Bowmer/AP
How to use the new 65-megawatt Bluffdale supercomputer: a gentle introduction to cryptanalysis
University of Illinois at Chicago & Technische Universiteit Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis.
SLIDE 7 credit: Rick Bowmer/AP
to use new 65-megawatt Bluffdale supercomputer: gentle introduction cryptanalysis Bernstein University of Illinois at Chicago & echnische Universiteit Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic My mission: protect every against espionage+sab
SLIDE 8 Bowmer/AP
65-megawatt ercomputer: duction Illinois at Chicago & Universiteit Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet against espionage+sab
SLIDE 9 Chicago & Eindhoven Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage.
SLIDE 10 Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage.
SLIDE 11 Disclaimers
- 1. I don’t work for NSA.
- 2. NSA hasn’t told me anything.
- 3. This is not a leak.
- 4. I’m assuming that
NSA is not stupid.
traditional transistors+wires, probably with some optics; plus long-term storage. Quantum computing would require different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast
- n devices designed primarily
for doing something else:
SLIDE 12 Disclaimers don’t work for NSA. NSA hasn’t told me anything. This is not a leak. assuming that is not stupid. Also assuming use of traditional transistors+wires, robably with some optics; long-term storage. Quantum computing would different analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast
- n devices designed primarily
for doing something else: User also crypto to Some examples ✎ 2009 exploit signatures (small ✎ 2010 exploit signatures (trivial—stupid ✎ 2012 exploit signatures (somewhat
SLIDE 13 for NSA. told me anything. leak. that stupid. assuming use of transistors+wires, some optics; storage. computing would analysis. Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast
- n devices designed primarily
for doing something else: User also needs crypto to be secure Some examples of ✎ 2009 exploit of RSA-512 signatures in TI (small public computation); ✎ 2010 exploit of ECDSA signatures in Pla (trivial—stupid Sony ✎ 2012 exploit of MD5-based signatures by Flame (somewhat larger
SLIDE 14 anything. rs+wires,
Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast
- n devices designed primarily
for doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mista ✎ 2012 exploit of MD5-based signatures by Flame malwa (somewhat larger computation).
SLIDE 15 Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast
- n devices designed primarily
for doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation).
SLIDE 16 Cryptographic challenges My mission: Cryptographically protect every Internet packet against espionage+sabotage. User needs crypto to be fast
- n devices designed primarily
for doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public.
SLIDE 17
Cryptographic challenges mission: Cryptographically rotect every Internet packet against espionage+sabotage. needs crypto to be fast devices designed primarily doing something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic fit the user’s ✮ optimize cryptosystem for each
SLIDE 18
challenges Cryptographically Internet packet espionage+sabotage. crypto to be fast designed primarily something else: User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic fit the user’s cost constraints? ✮ optimize choice cryptosystem + algo for each user device.
SLIDE 19 Cryptographically packet
fast rily User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device.
SLIDE 20
User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device.
SLIDE 21
User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem.
SLIDE 22
User also needs crypto to be secure. Some examples of crypto failing: ✎ 2009 exploit of RSA-512 signatures in TI calculators (small public computation); ✎ 2010 exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ 2012 exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples not known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture.
SLIDE 23
also needs to be secure. examples of crypto failing: ✎ exploit of RSA-512 signatures in TI calculators (small public computation); ✎ exploit of ECDSA signatures in PlayStation 3 (trivial—stupid Sony mistake); ✎ exploit of MD5-based signatures by Flame malware (somewhat larger computation). Presumably many more examples known to the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. Predictions physicists sometimes Common underlying calculations
SLIDE 24 secure.
✎
TI calculators computation); ✎
PlayStation 3 (trivial—stupid Sony mistake); ✎
Flame malware rger computation). many more examples the public. Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made b physicists are often sometimes wrong. Common sources of underlying models calculations from those
SLIDE 25
failing: ✎ calculators computation); ✎ 3 mistake); ✎ MD5-based malware computation). examples Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models.
SLIDE 26
Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models.
SLIDE 27
Critical questions: Which cryptographic systems fit the user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm for each user device. Which cryptographic systems can be broken by attackers? ✮ optimize choice of attack algorithm + device for each cryptosystem. Heavy interactions between high-level algorithms and low-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce.
SLIDE 28
Critical questions: cryptographic systems user’s cost constraints? ✮ optimize choice of cryptosystem + algorithm h user device. cryptographic systems broken by attackers? ✮ optimize choice of attack algorithm + device h cryptosystem. interactions between high-level algorithms and w-level computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics Of course Every field theoreticians regarding experimental measure we compa
SLIDE 29
questions: cryptographic systems cost constraints? ✮ choice of algorithm device. cryptographic systems y attackers? ✮ choice of rithm + device cryptosystem. interactions between rithms and computer architecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely Of course not. Every field of science: theoreticians make regarding observable experimental scientists measure those phenomena; we compare the re
SLIDE 30
ystems constraints? ✮ rithm ystems ers? ✮ device een rchitecture. Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results.
SLIDE 31
Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results.
SLIDE 32
Theory vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. Experiments aren’t perfect but catch many errors; resolve many disputes; provide raw data leading to new theories; build more confidence than theory alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest.
SLIDE 33
ry vs. experiment Predictions made by theoretical physicists are often disputed, sometimes wrong. Common sources of error: underlying models of physics; calculations from those models. eriments aren’t perfect catch many errors; many disputes; rovide raw data leading to new theories; more confidence than alone can ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm error-prone Theoreticians regarding These predictions disputed,
SLIDE 34 eriment made by theoretical
wrong. sources of error: dels of physics; those models. ren’t perfect errors; disputes; theories; confidence than ever produce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis error-prone field of Theoreticians make regarding algorithm These predictions a disputed, sometimes
SLIDE 35
retical disputed, physics; models. than duce. Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong.
SLIDE 36
Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong.
SLIDE 37
Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack.
SLIDE 38 Is physics uniquely error-prone? Of course not. Every field of science: theoreticians make predictions regarding observable phenomena; experimental scientists measure those phenomena; we compare the results. What if measurements are too expensive to carry out? Measurements start with scaled-down experiments, work up towards the scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
SLIDE 39 physics uniquely error-prone? course not. field of science: reticians make predictions rding observable phenomena; erimental scientists measure those phenomena; compare the results. if measurements are expensive to carry out? Measurements start with scaled-down experiments, up towards scale of interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s sec “QS” facto costs 2100
SLIDE 40 uniquely error-prone? science: make predictions
scientists phenomena; results. easurements are carry out? tart with eriments, interest. Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization costs 2100 to break
SLIDE 41 rone? redictions phenomena; phenomena;
Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024.
SLIDE 42 Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024.
SLIDE 43 Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”.
SLIDE 44 Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024.
SLIDE 45 Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280?
SLIDE 46 Algorithm analysis is another error-prone field of science. Theoreticians make predictions regarding algorithm performance. These predictions are often disputed, sometimes wrong. Particularly error-prone: cryptanalytic extrapolations from an academic computation to a serious real-world attack. We catch errors, resolve disputes by carrying out experiments: actually running these algorithms
- n the largest scale we can.
1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048.
SLIDE 47 rithm analysis is another rone field of science. reticians make predictions rding algorithm performance. predictions are often disputed, sometimes wrong. rticularly error-prone: cryptanalytic extrapolations an academic computation serious real-world attack. catch errors, resolve disputes rrying out experiments: actually running these algorithms largest scale we can. 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attack Enough theo should reach
required But can this amount
SLIDE 48 nalysis is another
make predictions rithm performance. redictions are often sometimes wrong. r-prone: extrapolations mic computation real-world attack. rs, resolve disputes experiments: these algorithms scale we can. 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s sup Enough theory+exp should reach consensus
required to break a But can the attack this amount of computation?
SLIDE 49 another science. redictions rmance.
wrong.
computation attack. disputes eriments: algorithms can. 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation?
SLIDE 50 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation?
SLIDE 51 1980s security evaluation: “QS” factorization algorithm costs 2100 to break RSA-1024. 1990 Pollard: new “NFS”. 1991 Adleman: NFS won’t beat QS for RSA-1024. Subsequent experiments ✮ NFS is much faster; maybe 280? Actual security of RSA-1024 is still a matter of dispute: e.g., 2009 Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
SLIDE 52 security evaluation: factorization algorithm
100 to break RSA-1024.
Adleman: NFS beat QS for RSA-1024. Subsequent experiments ✮ much faster; maybe 280? Actual security of RSA-1024 is matter of dispute: e.g., Bos–Kaihara–Kleinjung– Lenstra–Montgomery oppose NIST’s transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
Communication Bill Dally “Communication more energy
SLIDE 53 evaluation: tion algorithm ak RSA-1024. new “NFS”. NFS for RSA-1024. eriments ✮ faster; maybe 280?
dispute: e.g., ra–Kleinjung– Lenstra–Montgomery oppose transition to RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
Communication vs. Bill Dally, 2013.06.17: “Communication tak more energy than a
SLIDE 54 rithm RSA-1024. “NFS”. RSA-1024. ✮ e 280? RSA-1024 is e.g., jung–
RSA-2048. The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”.
SLIDE 55 The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”.
SLIDE 56 The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.”
SLIDE 57 The attacker’s supercomputer Enough theory+experiment should reach consensus
required to break a system. But can the attacker perform this amount of computation? Hypothesize attacker resources. This talk: $2 billion, 65MW. Alternative: millions of compromised Internet computers. The interesting part: analyze
- ptimal use of those resources.
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic.
SLIDE 58 attacker’s supercomputer Enough theory+experiment reach consensus amount of computation required to break a system. can the attacker perform amount of computation?
- thesize attacker resources.
talk: $2 billion, 65MW. Alternative: millions of romised Internet computers. interesting part: analyze
- ptimal use of those resources.
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algo ♥ Square matrix-ve ♥2 arithmetic.
SLIDE 59
supercomputer experiment consensus computation reak a system. attacker perform computation? attacker resources. llion, 65MW. millions of Internet computers. part: analyze those resources. Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥ Square matrix-vecto ♥2 arithmetic.
SLIDE 60
ercomputer eriment system. rm computation? resources. 65MW. computers. analyze resources. Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic.
SLIDE 61
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic.
SLIDE 62
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic.
SLIDE 63
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc.
SLIDE 64
Communication vs. arithmetic Bill Dally, 2013.06.17: “Communication takes more energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of energy that we spend today is on transferring data.” Depends what you’re doing! Computations fundamentally vary in amount of communication (distance and volume) and amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more.
SLIDE 65
Communication vs. arithmetic Dally, 2013.06.17: “Communication takes energy than arithmetic”. Stephen S. Pawlowski, 2013.06.18: “The majority of that we spend today transferring data.” ends what you’re doing! Computations fundamentally vary amount of communication (distance and volume) amount of arithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥
✎
is enough all data fo ♥
SLIDE 66
2013.06.17: takes than arithmetic”. wlowski, “The majority of spend today g data.”
fundamentally vary communication volume) rithmetic. Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2
SLIDE 67
rithmetic rithmetic”. y of y doing! fundamentally vary communication Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT.
SLIDE 68
Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT.
SLIDE 69
Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs.
SLIDE 70 Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data
for time ♥1+✎.
SLIDE 71 Some algorithms using ♥2 data: Square matrix-vector product: ♥2 arithmetic. FFT for input size ♥2: ♥2 lg ♥ arithmetic. Matrix-matrix product: typically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, many common iterations, graph algorithms, etc.: ♥4 arithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data
for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays.
SLIDE 72 algorithms using ♥2 data: matrix-vector product: ♥ rithmetic. for input size ♥2: ♥ ♥ arithmetic. Matrix-matrix product: ypically ♥3 arithmetic without Strassen etc. Integrals in quantum chemistry, common iterations, algorithms, etc.: ♥ rithmetic, sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data
for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥
✎
is enough several ♥ ✂ ♥ Routing
♥
✎
for time ♥
✎
Typical ♥ also occupies ♥ for time ♥
✎
Closer lo ✎ the ALU although
SLIDE 73 using ♥2 data: matrix-vector product: ♥ size ♥2: ♥ ♥ rithmetic. roduct: ♥ rithmetic etc. antum chemistry, iterations, rithms, etc.: ♥ sometimes more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data
for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix pro
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much.
SLIDE 74 ♥ data: duct: ♥ ♥ ♥ ♥ ♥ chemistry, ♥ more. Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data
for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much.
SLIDE 75 Chip area ♥2+✎ is enough to store all data for size-♥2 FFT. Chip area ♥2+✎ is also enough for ♥2 parallel ALUs. FFT takes time ♥✎, thanks to parallelism? No! Routing the FFT data
for time ♥1+✎. 1981 Brent–Kung: need ♥1+✎ even without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much.
SLIDE 76 rea ♥2+✎ enough to store data for size-♥2 FFT. rea ♥2+✎ enough for ♥ rallel ALUs. takes time ♥✎, to parallelism? No! Routing the FFT data ccupies area ♥2+✎ e ♥1+✎. Brent–Kung: need ♥1+✎ without wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of
is spent ❁10% on Is Bluffdale
SLIDE 77 ♥
✎
re ♥2 FFT. ♥
✎
r ♥ ALUs. ♥✎, rallelism? No! data ♥2+✎ ♥
✎
Brent–Kung: need ♥1+✎ wire delays. Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this
SLIDE 78 ♥
✎
♥ ♥
✎
♥ ♥✎ No! ♥
✎
♥
✎
♥1+✎ Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way?
SLIDE 79 Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way?
SLIDE 80 Chip area ♥2+✎ is enough to store several ♥ ✂ ♥ matrices. Routing matrix product
for time ♥1+✎. Typical ♥3 arithmetic also occupies ♥2 ALUs for time ♥1+✎. Closer look at ✎: the ALU cost dominates, although not by much. ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance
and heavier-arith computations. NSA’s computations have a mix
- f heavy arith and heavy comm.
SLIDE 81 rea ♥2+✎ enough to store several ♥ ✂ ♥ matrices. Routing matrix product ccupies area ♥2+✎ e ♥1+✎. ypical ♥3 arithmetic ccupies ♥2 ALUs e ♥1+✎. look at ✎: ALU cost dominates, although not by much. ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance
and heavier-arith computations. NSA’s computations have a mix
- f heavy arith and heavy comm.
GPUs have but relatively communication a few long Is Bluffdale
SLIDE 82 ♥
✎
re ♥ ✂ ♥ matrices. product ♥2+✎ ♥
✎
♥ rithmetic ♥ ALUs ♥
✎
✎ dominates, much. ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance
and heavier-arith computations. NSA’s computations have a mix
- f heavy arith and heavy comm.
GPUs have many ALUs but relatively little communication capacit a few long wires to Is Bluffdale built this
SLIDE 83 ♥
✎
♥ ✂ ♥ ♥
✎
♥
✎
♥ ♥ ♥
✎
✎ ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance
and heavier-arith computations. NSA’s computations have a mix
- f heavy arith and heavy comm.
GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way?
SLIDE 84 ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance
and heavier-arith computations. NSA’s computations have a mix
- f heavy arith and heavy comm.
GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way?
SLIDE 85 ❃90% of the cost
is spent on communication; ❁10% on ALUs. Is Bluffdale built this way? No; NSA is not stupid. Doubling number of ALUs would cost ❁10% extra. Would ✙double performance
and heavier-arith computations. NSA’s computations have a mix
- f heavy arith and heavy comm.
GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc.
SLIDE 86 ❃
ypical supercomputers ent on communication; ❁
Bluffdale built this way? NSA is not stupid. Doubling number of ALUs cost ❁10% extra. ✙double performance matrix-matrix product heavier-arith computations. computations have a mix heavy arith and heavy comm. GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation Intel Xeon and a few plus adjacent communication Is Bluffdale
SLIDE 87
❃ cost ercomputers communication; ❁ this way? stupid. er of ALUs ❁ extra. ✙ performance product computations. computations have a mix and heavy comm. GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells Intel Xeon Phi has and a few long wires plus adjacent one-dimensional communication (ring Is Bluffdale built this
SLIDE 88
❃ communication; ❁ ❁ ✙ rmance computations. a mix comm. GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way?
SLIDE 89
GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way?
SLIDE 90
GPUs have many ALUs but relatively little communication capacity: a few long wires to RAM. Is Bluffdale built this way? No; NSA is not stupid. Adding communication between adjacent ALUs would cost very little. Would drastically speed up matrix-matrix product and heavier-comm computations: FFT, sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems.
SLIDE 91
have many ALUs relatively little communication capacity: long wires to RAM. Bluffdale built this way? NSA is not stupid. Adding communication een adjacent ALUs cost very little. drastically speed up matrix-matrix product heavier-comm computations: sorting, etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even with 3D e.g. 1983 Huge engineering 2D allows energy input, up to very 3D is hard Some limited (most interest presumably Progress e.g., 4 ✂ ✂ is often called
SLIDE 92
many ALUs little capacity: to RAM. this way? stupid. communication adjacent ALUs little. drastically speed up roduct heavier-comm computations: etc. Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even more time with 3D arrangement e.g. 1983 Rosenbe Huge engineering challenge. 2D allows easy scaling energy input, heat up to very large chip 3D is hard to scale. Some limited progress (most interesting: presumably used b Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”.
SLIDE 93
computations: Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”.
SLIDE 94
Documentation tells me that Intel Xeon Phi has many ALUs and a few long wires to RAM plus adjacent one-dimensional communication (ring bus). Is Bluffdale built this way? No; NSA is not stupid. Adding two-dimensional grid would drastically speed up heavy-comm computations. e.g. 1977 Thompson–Kung. Grid examples: MasPar; FPGAs. But FPGAs have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”.
SLIDE 95
cumentation tells me that Xeon Phi has many ALUs few long wires to RAM adjacent one-dimensional communication (ring bus). Bluffdale built this way? NSA is not stupid. Adding two-dimensional grid drastically speed up heavy-comm computations. 1977 Thompson–Kung. examples: MasPar; FPGAs. FPGAs have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. Typical cryptanalytic between ✂ ✂ better perfo from ASICs mass-ma Some exce ASICs bring
SLIDE 96 tells me that has many ALUs wires to RAM
(ring bus). this way? stupid.
speed up computations. Thompson–Kung. MasPar; FPGAs. have other problems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general Typical cryptanalytic between 100✂ and ✂ better performance from ASICs than from mass-market CPUs, Some exceptions, but ASICs bring massive
SLIDE 97 that ALUs RAM
bus). grid s. Thompson–Kung. FPGAs. roblems. Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transisto from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup.
SLIDE 98
Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup.
SLIDE 99
Save even more time with 3D arrangement of ALUs? e.g. 1983 Rosenberg. Huge engineering challenge. 2D allows easy scaling of energy input, heat output up to very large chip area. 3D is hard to scale. Some limited progress (most interesting: optics), presumably used by NSA. Progress often exaggerated: e.g., 4 ✂ 16384 ✂ 16384 is often called “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂.
SLIDE 100 even more time 3D arrangement of ALUs? 1983 Rosenberg. engineering challenge. allows easy scaling of input, heat output very large chip area. hard to scale. limited progress interesting: optics), resumably used by NSA. Progress often exaggerated: ✂ 16384 ✂ 16384
Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent chips spend
✮ CPU/GPU reduce insn-handling by adding apply same to multiple
SLIDE 101 time rrangement of ALUs? Rosenberg. engineering challenge. scaling of heat output chip area. scale. rogress ng: optics), by NSA. exaggerated: ✂ ✂ 16384 “3D”. Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time,
✮ CPU/GPU design reduce insn-handling by adding vectorization— apply same instruction to multiple data/thread
SLIDE 102 ALUs? challenge. exaggerated: ✂ ✂ Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads.
SLIDE 103 Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads.
SLIDE 104 Special vs. general purpose Typical cryptanalytic arith: between 100✂ and 1000✂ better performance per transistor from ASICs than from mass-market CPUs, GPUs. Some exceptions, but overall ASICs bring massive speedup. Only in cryptanalysis? No. Estimated ASIC improvement from preliminary scan of other supercomputing arith problems: usually ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file.
SLIDE 105 ecial vs. general purpose ypical cryptanalytic arith: een 100✂ and 1000✂ performance per transistor ASICs than from mass-market CPUs, GPUs. exceptions, but overall bring massive speedup. in cryptanalysis? No. Estimated ASIC improvement reliminary scan of other ercomputing arith problems: ❃10✂, often ❃100✂. Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious reduce these combine doing mo between Example: to compute ①② ③ CPU reads ①❀ ②❀ ③ computes ①② ③ With sepa CPU reads ①❀ ② ①② writes; reads ③ computes ①② ③
SLIDE 106 general purpose cryptanalytic arith: ✂ and 1000✂ rmance per transistor from CPUs, GPUs. , but overall massive speedup. lysis? No. improvement scan of other arith problems: ❃ ✂ often ❃100✂. Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and Example: Build circuit to compute ①② + ③ CPU reads regs ①❀ ②❀ ③ computes ①② + ③; With separate mul, CPU reads ①❀ ②; computes ①② writes; reads back; ③ computes ①② + ③;
SLIDE 107
rith: ✂ ✂ ansistor GPUs.
eedup. No. rovement
roblems: ❃ ✂ ❃ ✂. Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①② writes; reads back; reads ③; computes ①② + ③; writes.
SLIDE 108 Frequent observation: chips spend area, time, energy
- n decoding+scheduling insns.
✮ CPU/GPU design trend: reduce insn-handling cost by adding vectorization— apply same instruction to multiple data/threads. But this does nothing to reduce costs of reading data from reg file, writing data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes.
SLIDE 109
requent observation: spend area, time, energy decoding+scheduling insns. ✮ CPU/GPU design trend: insn-handling cost ding vectorization— same instruction multiple data/threads. this does nothing reduce costs of reading data from reg file, data to reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common evolved in Chip designer single-precision eventually circuit fo
SLIDE 110
rea, time, energy ding+scheduling insns. ✮ design trend: insn-handling cost rization— instruction data/threads. nothing
from reg file, reg file. Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way Chip designer saw single-precision fp eventually spent area circuit for those muls.
SLIDE 111
energy insns. ✮ : Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls.
SLIDE 112
Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls.
SLIDE 113
Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp.
SLIDE 114
Obvious strategy to reduce these reg costs: combine arith operations, doing more arith between read and write. Example: Build circuit to compute ①② + ③. CPU reads regs ①❀ ②❀ ③; computes ①② + ③; writes. With separate mul, add: CPU reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle.
SLIDE 115
Obvious strategy to these reg costs: combine arith operations, more arith een read and write. Example: Build circuit compute ①② + ③. reads regs ①❀ ②❀ ③; computes ①② + ③; writes. separate mul, add: reads ①❀ ②; computes ①②; writes; reads back; reads ③; computes ①② + ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another Your application mul-sub-sub-sub-sub in its inner Should CPU include mul 4 separate
SLIDE 116
to costs: erations, and write. circuit ①② ③. ①❀ ②❀ ③; ①② ③; writes. mul, add: ①❀ ② computes ①②; back; reads ③; ①② ③; writes. Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application do mul-sub-sub-sub-sub in its inner loop. Should CPU design include mul circuit, 4 separate sub circuits?
SLIDE 117
①② ③ ①❀ ②❀ ③ ①② ③ ①❀ ② ①②; ③; ①② ③ Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits?
SLIDE 118
Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits?
SLIDE 119
Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle.
SLIDE 120
Common fp operations evolved in this way. Chip designer saw many single-precision fp muls, eventually spent area on circuit for those muls. Then spent much more area to expand the multiplier to double-precision fp. But people still run many single-precision computations. The multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly.
SLIDE 121
Common fp operations evolved in this way. designer saw many single-precision fp muls, eventually spent area on for those muls. spent much more area expand the multiplier double-precision fp. eople still run many single-precision computations. multiplier transistors are mostly sitting idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC beyond to ✎ Squaring than multiplication. ✎ Skip most ✎ Reduce what is ✎ Add very if application ✎ etc.
SLIDE 122
erations ay. w many fp muls, area on muls. much more area multiplier cision fp. run many computations. transistors are idle. Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedup beyond today’s CPUs/GPUs: ✎ Squaring is cheap than multiplication. ✎ Skip most normalizations. ✎ Reduce precision what is actually ✎ Add very fast sqrt if application needs ✎ etc.
SLIDE 123
rea computations. are Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc.
SLIDE 124
Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc.
SLIDE 125
Another example: Your application does mul-sub-sub-sub-sub in its inner loop. Should CPU designer include mul circuit, 4 separate sub circuits? Same CPU then runs another application. Subtraction circuits are mostly sitting idle. CPU designer says no, reduces area per core. ✮ Your application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups.
SLIDE 126 Another example: application does mul-sub-sub-sub-sub inner loop. CPU designer include mul circuit, rate sub circuits? CPU then runs another application. Subtraction circuits are mostly sitting idle. designer says no, reduces area per core. ✮
- ur application runs slowly.
Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA for each The small ASIC design Not a serious for $2 billion.
SLIDE 127 example: does mul-sub-sub-sub-sub
signer circuit, circuits? runs application. circuits are idle. ys no, core. ✮ application runs slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion.
SLIDE 128
✮ slowly. Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion.
SLIDE 129
Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion.
SLIDE 130
Many ASIC fp speedups beyond today’s CPUs/GPUs: ✎ Squaring is cheaper than multiplication. ✎ Skip most normalizations. ✎ Reduce precision to what is actually needed. ✎ Add very fast sqrt if application needs it. ✎ etc. Cryptanalysis involves many multiplications but also a much wider variety of operations. Even larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful?
SLIDE 131 ASIC fp speedups
✎ ring is cheaper multiplication. ✎ most normalizations. ✎ Reduce precision to what is actually needed. ✎ very fast sqrt application needs it. ✎ Cryptanalysis involves multiplications also a much wider
larger ASIC speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious some ASICs, mix of application-tuned integrated Take a gene Add exactly XYZZY plus some Think ahead, XYZZ? XZZY? Still simila New CPU Merge simila if not much
SLIDE 132
speedups CPUs/GPUs: ✎ cheaper multiplication. ✎ rmalizations. ✎ recision to actually needed. ✎ sqrt needs it. ✎ involves multiplications wider erations. speedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious solution fo some ASICs, plus heterogeneous mix of application-tuned integrated circuits Take a general-purp Add exactly the big XYZZY needed by plus some vectorization. Think ahead, add XYZZ? XZZY? XYQZZY? Still similar cost to New CPU for each Merge similar applications if not much cost in
SLIDE 133
CPUs/GPUs: ✎ ✎ rmalizations. ✎ ✎ ✎ eedups. So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area.
SLIDE 134
So NSA builds ASICs for each application? The small problem: ASIC design effort. Not a serious issue for $2 billion. The big problem: Unpredictable application mix. NSA will want some agility to adapt to new computations and stop old computations. Quantify using historical data: how long is an ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area.
SLIDE 135
NSA builds ASICs h application? small problem: design effort. serious issue billion. big problem: redictable application mix. will want some agility adapt to new computations stop old computations. Quantify using historical data: long is an ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale Critical fo and implemento
SLIDE 136
ASICs application? roblem: rt. issue roblem: pplication mix. some agility computations computations. historical data: ASIC useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user Critical for algorithm and implementor:
SLIDE 137
mix. y computations computations. data: useful? Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor:
SLIDE 138
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor:
SLIDE 139
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism.
SLIDE 140
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication.
SLIDE 141
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions.
SLIDE 142
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions. Some vectorization.
SLIDE 143
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions. Some vectorization. Occasional faults.
SLIDE 144
Obvious solution for NSA: some ASICs, plus heterogeneous mix of application-tuned integrated circuits (ATICs). Take a general-purpose CPU. Add exactly the big insn XYZZY needed by application, plus some vectorization. Think ahead, add agility: XYZZ? XZZY? XYQZZY? Still similar cost to ASIC. New CPU for each application. Merge similar applications if not much cost in area. 1-slide Bluffdale user guide Critical for algorithm designer and implementor: Massive parallelism. Grid communication. Multiple instruction sets with very useful instructions. Some vectorization. Occasional faults. Need to understand cryptanalysis: ECM, sparse linear algebra, differentials, FFTs, much more.