 
              Verifiable ASICs Aarhus Workshop on Secure Multiparty Computation 1 June 2016 Michael Walfish Dept. of Computer Science, Courant Institute, NYU
This is joint work with: Riad S. Wahby (Stanford), Max Howald (Cooper Union and NYU), Siddharth Garg (NYU), abhi shelat (U. of Virginia) Riad recently presented this work at IEEE S&P (Oakland).
Problem: the manufacturer (“foundry” or “fab”) of a custom chip (“ASIC”) can undermine the chip’s execution. chip design principal chip manufacturer (govt, chip vendor, …) (“foundry” or “fab”) encrypted encrypted phone phone eavesdropper Response: control the manufacturing chain with a trusted foundry
Trusted fabrication is the only solution with strong guarantees. § For example, post-fab detection can be thwarted [A2: Analog Malicious Hardware. Yang et al., IEEE S&P 2016] But trusted fabrication is not a panacea: Only 5 countries have cutting-edge fabs on shore § Building a new fab takes $billions and years of R&D § With semiconductor technology, area and energy reduce with § square and cube of transistor dimension So: old fabs means enormous penalty. Example of India: 10 8 × . § We thought: probabilistic proofs might let us get trust more cheaply!
An alternative: Verifiable ASICs principal F → designs for trusted untrusted P, V fab (slow) fab (fast) builds V builds P P integrator V operator input x P y V proof that output F(x) = y
input x P y vs. F V SBW 11 proof that output CMT 12 F(x) = y SMBW 12 GMR 85 TRMP 12 Babai85 SVPBBW 12 Makes sense if V + P cheaper than trusted F BCC 86 SBVBPW 13 BFLS 91 VSBW 13 FGLSS 91 PGHR 13 Kilian92 Reasons for hope: Thaler13 ALMSS 92 BCGTV 13 AS 92 BFRSBW 13 § Running time of V < F (asymptotically) Micali94 BFR 13 BG 02 DFKP 13 GOS 06 § Implementations exist, and … BCTV 14a IKO 07 BCTV 14b GKR 08 § … though their costs for P are absurd, BCGGMTV 14 KR 09 GGP 10 FL 14 advanced fab might make P cheaper than F (!) Groth10 KPPSST 14 GLR 11 FGP 14 Lipmaa11 WSRHBW 15 BCCT 12 BBFR 15 GGPR 12 CFHKKNPZ 15 BCCT 13 CTV 15 KRR 14 KZMQCPP s S 15 …
input x P y vs. F V SBW 11 proof that output CMT 12 F(x) = y SMBW 12 GMR 85 TRMP 12 Babai85 SVPBBW 12 Makes sense if V + P cheaper than trusted F BCC 86 SBVBPW 13 BFLS 91 VSBW 13 FGLSS 91 PGHR 13 Kilian92 Reasons for hope caution: Thaler13 ALMSS 92 BCGTV 13 AS 92 BFRSBW 13 Micali94 § The theory is silent about feasibility (and the BFR 13 BG 02 DFKP 13 onus here is heavier than in prior work) GOS 06 BCTV 14a IKO 07 BCTV 14b GKR 08 § Costs must reflect hardware: energy, area, …. BCGGMTV 14 KR 09 GGP 10 FL 14 Groth10 KPPSST 14 § We need physically realizable designs and GLR 11 FGP 14 Lipmaa11 WSRHBW 15 plausible computation sizes BCCT 12 BBFR 15 GGPR 12 CFHKKNPZ 15 BCCT 13 CTV 15 KRR 14 KZMQCPP s S 15 …
(1) Zebra: a system that saves costs (2) … sometimes
Implementations of probabilistic proofs: program translator probabilistic proof protocol (front-end) (back-end) verifier main(){ ... } x y, proof C program arithmetic circuit prover (AC) over 𝔾 p interactive proof [ GKR 08] interactive argument [ IKO 07] non-interactive argument (CS proof, SNARG , SNARK ) [Micali94, Groth10, Lipmaa12, GGPR 12]
x V P y, proof arguments (interactive, SNARK , CS proof, etc.) interactive proofs [ GGPR 12, PGHR 13, SBVBPW 13, BCTV 14] [ GKR 08, CMT 12, VSBW 13] • non-deterministic ACs • deterministic ACs only • arbitrary AC geometry • layered, low-depth ACs • 1-round, 2-round protocols • lots of rounds, communication suited to hardware unsuited to hardware
Zebra builds on the GKR interactive proof [ GKR 08, CMT 12, VSBW 13] ; computations are expressed as layered arithmetic circuits over 𝔾 p x prover verifier x y sum-check … invocation [ LFKN 90] sum-check invocation … sum-check invocation ACCEPT / REJECT y
Zebra builds on the GKR interactive proof [ GKR 08, CMT 12, VSBW 13] ; computations are expressed as layered arithmetic circuits over 𝔾 p Soundness error: miniscule for large p Cost to execute F directly: O(depth · width) V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries
Zebra builds on the GKR interactive proof [ GKR 08, CMT 12, VSBW 13] ; computations are expressed as layered arithmetic circuits over 𝔾 p prover verifier Soundness error: miniscule for large p x y Cost to execute F directly: sum-check O(depth · width) … invocation [ LFKN 90] V’s sequential running time: O(depth · log width + |x| + |y|), sum-check invocation … assuming precomputation of queries sum-check invocation P’s sequential running time: O(depth · width · log width)
Zebra extracts parallelism Execution step: layers are sequential, but gates can be executed in parallel. Proving step: can P and V parallelize the interaction? § No. V must ask questions in order § But. Parallelism is still available
V questions P about F(x 1 )’s output layer Simultaneously, P returns F(x 2 ) F(x 1 ) F(x 2 )
V questions P about F(x 1 )’s next layer and F(x 2 )’s output layer Meanwhile, P returns F(x 3 ) F(x 1 ) F(x 2 ) F(x 3 )
This process continues F(x 1 ) F(x 2 ) F(x 3 ) F(x 4 )
This process continues F(x 1 ) F(x 2 ) F(x 3 ) F(x 4 ) F(x 5 )
F(x 1 ) F(x 2 ) This process continues until V and P are completing one F(x 3 ) proof in each time step. F(x 4 ) F(x 5 ) F(x 6 ) F(x 7 )
prover sub-prover, layer 0 sub-prover, layer 1 … sub-prover, layer d-1 This is nothing other than pipelining, a classic hardware technique. It applies because layering organizes the work into stages. There are other opportunities along these lines.
Sub-prover’s obligation in round j for k in {0,1,2}: of sum-check invocation: return H[k] ← 0 H j (0), H j (1), H j (2), where for all gates g: H j (k) = ∑ u j (g) ⋅ v j (g, k) H[k] ← H[k] + u[g]*v(g,k) gates g for all gates g: u j+1 (g) = u j (g) ⋅ v j (g, r j ) u[g] ← u[g]*v(g,rj) sub-prover, layer i gate module 1 gate module g load u[1] load u[g] u[1]*v(1, k=0) u[g]*v(g, k=0) … … r j u[1]*v(1, k=1) u[g]*v(g, k=1) u[1]*v(1, k=2) u[g]*v(g, k=2) store new u[1] store new u[g] H j (0), H j (1), H j (2) adder tree RAM
Sub-prover’s obligation in round j for k in {0,1,2}: of sum-check invocation: return H[k] ← 0 H j (0), H j (1), H j (2), where for all gates g: H j (k) = ∑ u j (g) ⋅ v j (g, k) H[k] ← H[k] + u[g]*v(g,k) gates g for all gates g: u j+1 (g) = u j (g) ⋅ v j (g, r j ) u[g] ← u[g]*v(g,rj) sub-prover, layer i gate module 1 gate module g u[1] u[g] … … r j load u[1] load u[g] u[1]*v(1, k=0) u[g]*v(g, k=0) u[1]*v(1, k=1) u[g]*v(g, k=1) H j (0), H j (1), H j (2) u[1]*v(1, k=2) u[g]*v(g, k=2) store new u[1] store new u[g] adder tree
Sub-prover’s obligation in round j for k in {0,1,2}: of sum-check invocation: return H[k] ← 0 H j (0), H j (1), H j (2), where for all gates g: H j (k) = ∑ u j (g) ⋅ v j (g, k) H[k] ← H[k] + u[g]*v(g,k) gates g for all gates g: u j+1 (g) = u j (g) ⋅ v j (g, r j ) u[g] ← u[g]*v(g,rj) sub-prover, layer i gate module 1 gate module g u[1] u[g] r j u[1]*v(1, k=2) u[1]*v(1, k=1) u[1]*v(1, k=0) H j (0), H j (1), H j (2) adder tree
Summary of Zebra’s design approach: Extract parallelism § Pipelined proving, adder tree, gate proving, etc. § Exploit locality: distribute state and control § Custom registers (no RAM): “data” wires are few and short § Latency-insensitive design: few “control” wires § Reduce and reuse §
sub-prover, layer i gate module 1 gate module g u[1] u[g] u[1]*v(1, k=2) u[1]*v(1, k=1) u[1]*v(1, k=0) adder tree
Summary of Zebra’s design approach: Extract parallelism § Pipelined proving, adder tree, gate proving, etc. § Exploit locality: distribute state and control § Custom registers (no RAM): “data” wires are few and short § Latency-insensitive design: few “control” wires § Reuse and recycle § Recycle hardware circuitry for different tasks § Save energy by adding memoization to P § Reuse block designs; optimizations thus have high pay-off §
Architectural and operational challenges for Zebra 1. Communication between V and P is high bandwidth V and P on circuit board? Too much energy, circuit area § Zebra’s response: use 3D packaging § 2. Protocol requires input-independent precomputation Zebra’s response: amortize precomputations over many V-P pairs § 3. Trusted storage would be prohibitive § Zebra’s response: use untrusted storage, with auth-encryption
The implementation of Zebra includes: § An arithmetic circuit to synthesizable Verilog compiler for P § Composes with existing C to arithmetic circuit compilers § Two V implementations: § hardware (Verilog) § software (C++) § Library to generate V’s precomputations § Verilog simulator extensions to model software or hardware V’s interactions with P and with storage
This implementation seemed to work great. Existing implementations: 10 seconds per proof, at least Zebra: 10 4 or 10 5 proofs per second But that isn’t a serious evaluation …
Recommend
More recommend