SLIDE 1 Verifiable ASICs
Michael Walfish
- Dept. of Computer Science, Courant Institute, NYU
Aarhus Workshop on Secure Multiparty Computation 1 June 2016
SLIDE 2
This is joint work with:
Riad S. Wahby (Stanford), Max Howald (Cooper Union and NYU), Siddharth Garg (NYU), abhi shelat (U. of Virginia) Riad recently presented this work at IEEE S&P (Oakland).
SLIDE 3
Problem: the manufacturer (“foundry” or “fab”) of a custom chip (“ASIC”) can undermine the chip’s execution.
principal (govt, chip vendor, …) chip manufacturer (“foundry” or “fab”)
chip design
eavesdropper encrypted phone encrypted phone
Response: control the manufacturing chain with a trusted foundry
SLIDE 4 But trusted fabrication is not a panacea: § Only 5 countries have cutting-edge fabs on shore § Building a new fab takes $billions and years of R&D § With semiconductor technology, area and energy reduce with square and cube of transistor dimension § So: old fabs means enormous penalty. Example of India: 108×. Trusted fabrication is the only solution with strong guarantees. § For example, post-fab detection can be thwarted
[A2: Analog Malicious Hardware. Yang et al., IEEE S&P 2016]
We thought: probabilistic proofs might let us get trust more cheaply!
SLIDE 5 An alternative: Verifiable ASICs
principal
F → designs for P, V
integrator untrusted fab (fast) builds P trusted fab (slow) builds V
P V
V
P
x y proof that F(x) = y input
SLIDE 6 Makes sense if V + P cheaper than trusted F Reasons for hope: § Running time of V < F (asymptotically) § Implementations exist, and … § … though their costs for P are absurd, advanced fab might make P cheaper than F (!)
V P
x y proof that F(x) = y input
vs.
F
GMR85
Babai85
BCC86 BFLS91 FGLSS91
Kilian92
ALMSS92 AS92
Micali94
BG02 GOS06 IKO07 GKR08 KR09 GGP10
Groth10
GLR11
Lipmaa11
BCCT12 GGPR12 BCCT13 KRR14
…
SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13
Thaler13
BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b
BCGGMTV14
FL14 KPPSST14 FGP14 WSRHBW15 BBFR15 CFHKKNPZ15 CTV15 KZMQCPPsS15
SLIDE 7 Reasons for hope caution: § The theory is silent about feasibility (and the
- nus here is heavier than in prior work)
§ Costs must reflect hardware: energy, area, …. § We need physically realizable designs and plausible computation sizes Makes sense if V + P cheaper than trusted F
V P
x y proof that F(x) = y input
vs.
F
SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13
Thaler13
BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b
BCGGMTV14
FL14 KPPSST14 FGP14 WSRHBW15 BBFR15 CFHKKNPZ15 CTV15 KZMQCPPsS15 GMR85
Babai85
BCC86 BFLS91 FGLSS91
Kilian92
ALMSS92 AS92
Micali94
BG02 GOS06 IKO07 GKR08 KR09 GGP10
Groth10
GLR11
Lipmaa11
BCCT12 GGPR12 BCCT13 KRR14
…
SLIDE 8
Zebra: a system that saves costs … sometimes (1) (2)
SLIDE 9 probabilistic proof protocol (back-end) program translator (front-end)
Implementations of probabilistic proofs:
arithmetic circuit (AC) over 𝔾p
x y, proof
main(){ ... } C program
prover verifier
interactive proof [GKR08] interactive argument [IKO07] non-interactive argument (CS proof, SNARG, SNARK)
[Micali94, Groth10, Lipmaa12, GGPR12]
SLIDE 10 P V
arguments (interactive, SNARK, CS proof, etc.)
- non-deterministic ACs
- arbitrary AC geometry
- 1-round, 2-round protocols
- deterministic ACs only
- layered, low-depth ACs
- lots of rounds, communication
interactive proofs
[GGPR12, PGHR13, SBVBPW13, BCTV14] [GKR08, CMT12, VSBW13]
x y, proof
unsuited to hardware suited to hardware
SLIDE 11 Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔾p
verifier prover
x y …
ACCEPT/ REJECT
x y
sum-check invocation
[LFKN90]
…
sum-check invocation sum-check invocation
SLIDE 12
Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔾p
V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries Cost to execute F directly: O(depth · width) Soundness error: miniscule for large p
SLIDE 13 Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔾p
V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries Cost to execute F directly: O(depth · width) Soundness error: miniscule for large p verifier prover
x y …
sum-check invocation
[LFKN90]
…
sum-check invocation sum-check invocation
P’s sequential running time: O(depth · width · log width)
SLIDE 14
Zebra extracts parallelism Execution step: layers are sequential, but gates can be executed in parallel. Proving step: can P and V parallelize the interaction? § No. V must ask questions in order § But. Parallelism is still available
SLIDE 15 V questions P about F(x1)’s
Simultaneously, P returns F(x2)
F(x2) F(x1)
SLIDE 16
F(x3) F(x1) F(x2)
V questions P about F(x1)’s next layer and F(x2)’s output layer Meanwhile, P returns F(x3)
SLIDE 17
F(x4) F(x1) F(x2) F(x3)
This process continues
SLIDE 18
F(x5) F(x1) F(x2) F(x3) F(x4)
This process continues
SLIDE 19
F(x7) F(x1) F(x2) F(x3) F(x4) F(x5) F(x6)
This process continues until V and P are completing one proof in each time step.
SLIDE 20
sub-prover, layer 0 prover
…
sub-prover, layer d-1
This is nothing other than pipelining, a classic hardware technique. It applies because layering organizes the work into stages. There are other opportunities along these lines.
sub-prover, layer 1
SLIDE 21 sub-prover, layer i
for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)
Sub-prover’s obligation in round j
- f sum-check invocation: return
Hj(0), Hj(1), Hj(2), where Hj(k) = ∑ uj(g)⋅vj(g, k)
gates g
uj+1(g) = uj(g)⋅vj(g, rj) rj
Hj(0), Hj(1), Hj(2)
…
load u[1] u[1]*v(1, k=0) u[1]*v(1, k=1) u[1]*v(1, k=2) store new u[1] load u[g] u[g]*v(g, k=0) u[g]*v(g, k=1) u[g]*v(g, k=2) store new u[g]
…
adder tree
gate module 1 gate module g
RAM
SLIDE 22 sub-prover, layer i
for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)
Sub-prover’s obligation in round j
- f sum-check invocation: return
Hj(0), Hj(1), Hj(2), where Hj(k) = ∑ uj(g)⋅vj(g, k)
gates g
uj+1(g) = uj(g)⋅vj(g, rj) rj
Hj(0), Hj(1), Hj(2)
…
load u[1] u[1]*v(1, k=0) u[1]*v(1, k=1) u[1]*v(1, k=2) store new u[1] load u[g] u[g]*v(g, k=0) u[g]*v(g, k=1) u[g]*v(g, k=2) store new u[g]
…
adder tree
u[1] u[g]
gate module 1 gate module g
SLIDE 23 sub-prover, layer i
for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)
Sub-prover’s obligation in round j
- f sum-check invocation: return
Hj(0), Hj(1), Hj(2), where Hj(k) = ∑ uj(g)⋅vj(g, k)
gates g
uj+1(g) = uj(g)⋅vj(g, rj) rj
Hj(0), Hj(1), Hj(2)
adder tree
u[1] u[g] u[1]*v(1, k=1) u[1]*v(1, k=2) u[1]*v(1, k=0)
gate module 1 gate module g
SLIDE 24
§ Extract parallelism
§ Pipelined proving, adder tree, gate proving, etc.
§ Exploit locality: distribute state and control
§ Custom registers (no RAM): “data” wires are few and short § Latency-insensitive design: few “control” wires
§ Reduce and reuse Summary of Zebra’s design approach:
SLIDE 25
sub-prover, layer i
adder tree
gate module 1 gate module g
u[1] u[g] u[1]*v(1, k=1) u[1]*v(1, k=2) u[1]*v(1, k=0)
SLIDE 26
§ Extract parallelism
§ Pipelined proving, adder tree, gate proving, etc.
§ Exploit locality: distribute state and control
§ Custom registers (no RAM): “data” wires are few and short § Latency-insensitive design: few “control” wires
§ Reuse and recycle
§ Recycle hardware circuitry for different tasks § Save energy by adding memoization to P § Reuse block designs; optimizations thus have high pay-off
Summary of Zebra’s design approach:
SLIDE 27 Architectural and operational challenges for Zebra
- 1. Communication between V and P is high bandwidth
§ V and P on circuit board? Too much energy, circuit area § Zebra’s response: use 3D packaging
- 2. Protocol requires input-independent precomputation
§ Zebra’s response: amortize precomputations over many V-P pairs
- 3. Trusted storage would be prohibitive
§ Zebra’s response: use untrusted storage, with auth-encryption
SLIDE 28
§ An arithmetic circuit to synthesizable Verilog compiler for P § Composes with existing C to arithmetic circuit compilers § Two V implementations:
§ hardware (Verilog) § software (C++)
§ Library to generate V’s precomputations § Verilog simulator extensions to model software or hardware V’s interactions with P and with storage
The implementation of Zebra includes:
SLIDE 29
This implementation seemed to work great.
Zebra: 104 or 105 proofs per second Existing implementations: 10 seconds per proof, at least
But that isn’t a serious evaluation …
SLIDE 30 § Baseline: direct implementation of F in same technology as V § Metrics: energy, chip size per throughput (in paper) § Assessed with circuit synthesis and simulation, published chip designs, and CMOS scaling models
§ Charge for V , P, communication; retrieving and decrypting precomputations; PRNG; operator communicating with V
§ Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; max chip area = 200 mm2; max total power = 150 W
Evaluation method V P
x y proof input
vs.
F
SLIDE 31 § Baseline: direct implementation of F in same technology as V § Metrics: energy, chip size per throughput (in paper) § Assessed with circuit synthesis and simulation, published chip designs, and CMOS scaling models
§ Charge for V , P, communication; retrieving and decrypting precomputations; PRNG; operator communicating with V
§ Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; max chip area = 200 mm2; max total power = 150 W
Evaluation method V P
x y proof input
vs.
F
1997 350 nm 2017 7 nm [TSMC]
SLIDE 32
Application #1: number theoretic transform NTT: a Fourier transform over 𝔾p Used in signal processing, computer algebra, etc.
SLIDE 33
Application #1: number theoretic transform
6 7 8 9 10 11 12 13 0.1 0.3 1 3 log2(NTT size) baseline vs. Zebra (higher is better)
Ratio of baseline energy to Zebra energy
SLIDE 34
Application #2: Curve25519 point multiplication Curve25519: a commonly-used elliptic curve Point multiplication: primitive used for ECDH
SLIDE 35
84 170 340 682 1147 0.1 0.3 1 3 Parallel Curve25519 point multiplications baseline vs. Zebra (higher is better)
Ratio of baseline energy to Zebra energy
Application #2: Curve25519 point multiplication
SLIDE 36
(1) Zebra: a system that saves costs (2) … sometimes
SLIDE 37 Summary of Zebra’s applicability:
- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Need wide gap between (fast) fab for P and (trusted) fab for V
- 3. Computation F must be relatively large for V to save work
- 4. Computation F must be efficient as an arithmetic circuit (AC)
- 5. Must amortize precomputations over many chips
restriction of the interactive proof (IP) setup
SLIDE 38 Why did we build Zebra atop IPs instead of arguments?
Design principle interactive proofs
[GKR08, CMT12, VSBW13]
arguments (interactive, SNARK, CS proof, etc.)
[GGPR12, PGHR13, SBVBPW13, BCTV14]
Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce and reuse ✓ ✗
In arguments, P computes over entire AC at once ⟶ need RAM P does crypto for every gate in AC ⟶ special crypto circuits needed We hope these issues are surmountable! Because argument protocols seem unfriendly to hardware:
SLIDE 39 Reality check on the restrictions:
- 1. Computation F must have a layered, shallow, deterministic AC
- 2. Need wide gap between (fast) fab for P and (trusted) fab for V
- 3. Computation F must be relatively large for V to save work
- 4. Computation F must be efficient as an arithmetic circuit (AC)
- 5. Must amortize precomputations over many chips
applies to interactive proofs (IPs) but not arguments common to all implementations of probabilistic proofs
SLIDE 40 A limitation that is endemic to the area: Need wide gap between (fast) fab for P and (trusted) fab for V
101 105 109 103 107 1011
worker’s cost normalized to native C
matrix multiplication (m=128) PAM clustering (m=20, d=128) N/A 1013 Pepper Ginger Pinocchio Zaatar CMT native C Allspice TinyRAM Thaler Pepper Ginger Pinocchio Zaatar CMT native C Allspice TinyRAM Thaler
SLIDE 41
Limitations that are endemic to the area: Computation F must be relatively large for V to save work Computation F must be efficient as an arithmetic circuit
§ Example: libsnark’s [BCTV14] optimized implementation of GGPR/Pinocchio [GGPR12, PGHR13]. Great work, but: § Verification time: 6 ms + (|x| + |y|)・3 µs on 2.7 Ghz CPU § That time is >16 million CPU ops, which is a break-even point § libsnark handles ≤ 16 million gates (with 32 GB of RAM), so to break even, F also needs on average CPU_ops/AC_gate > 1.
§ Example: addition over 𝔾p instead of over fixed-width integers
SLIDE 42 Built probabilistic proof protocols amortize precomputations*
*Exception: CMT [CMT12] applied to highly regular arithmetic circuits
System amortize precomputation over size of advice Zebra multiple V-P pairs short Allspice [VSBW13]
- ver a batch of instances of a given F
short Bootstrapped SNARKs
[BCTV14a, CTV15]
long BCTV [BCTV14b]
- ver all computations of the same length
long Pinocchio [PGHR13]
- ver all future uses of a given F
long Pepper [SMBW12], Ginger [SVPBBW12], Zaatar [SBVBPW13]
- ver a batch of instances of a given F
long
SLIDE 43
Lessons (re)learned:
§ Do careful feasibility studies first! § Hardware is a powerful tool for acceleration … § ... but only if data flows are amenable § Theory of computation versus application of physics § General-purpose verifiable computation and succinct arguments are still far from practical
SLIDE 44
Summary and take-aways
§ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model § First hardware design for a probabilistic proof protocol; first work to capture cost of prover, verifier together § Improves performance compared to trusted baseline § Improvement compared to baseline is modest § Applicability is limited
§ Amortization, arithmetic circuits, “big enough” computations, large gap between trusted and untrusted technology, etc.
§ Zebra is plausibly deployable (!), but work remains for this area http://www.pepper-project.org