Verifiable ASICs Aarhus Workshop on Secure Multiparty Computation 1 - - PowerPoint PPT Presentation

verifiable asics
SMART_READER_LITE
LIVE PREVIEW

Verifiable ASICs Aarhus Workshop on Secure Multiparty Computation 1 - - PowerPoint PPT Presentation

Verifiable ASICs Aarhus Workshop on Secure Multiparty Computation 1 June 2016 Michael Walfish Dept. of Computer Science, Courant Institute, NYU This is joint work with: Riad S. Wahby (Stanford), Max Howald (Cooper Union and NYU), Siddharth


slide-1
SLIDE 1

Verifiable ASICs

Michael Walfish

  • Dept. of Computer Science, Courant Institute, NYU

Aarhus Workshop on Secure Multiparty Computation 1 June 2016

slide-2
SLIDE 2

This is joint work with:

Riad S. Wahby (Stanford), Max Howald (Cooper Union and NYU), Siddharth Garg (NYU), abhi shelat (U. of Virginia) Riad recently presented this work at IEEE S&P (Oakland).

slide-3
SLIDE 3

Problem: the manufacturer (“foundry” or “fab”) of a custom chip (“ASIC”) can undermine the chip’s execution.

principal (govt, chip vendor, …) chip manufacturer (“foundry” or “fab”)

chip design

eavesdropper encrypted phone encrypted phone

Response: control the manufacturing chain with a trusted foundry

slide-4
SLIDE 4

But trusted fabrication is not a panacea: § Only 5 countries have cutting-edge fabs on shore § Building a new fab takes $billions and years of R&D § With semiconductor technology, area and energy reduce with square and cube of transistor dimension § So: old fabs means enormous penalty. Example of India: 108×. Trusted fabrication is the only solution with strong guarantees. § For example, post-fab detection can be thwarted

[A2: Analog Malicious Hardware. Yang et al., IEEE S&P 2016]

We thought: probabilistic proofs might let us get trust more cheaply!

slide-5
SLIDE 5

An alternative: Verifiable ASICs

principal

F → designs for P, V

integrator untrusted fab (fast) builds P trusted fab (slow) builds V

P V

V

  • perator

P

x y proof that F(x) = y input

  • utput
slide-6
SLIDE 6

Makes sense if V + P cheaper than trusted F Reasons for hope: § Running time of V < F (asymptotically) § Implementations exist, and … § … though their costs for P are absurd, advanced fab might make P cheaper than F (!)

V P

x y proof that F(x) = y input

  • utput

vs.

F

GMR85

Babai85

BCC86 BFLS91 FGLSS91

Kilian92

ALMSS92 AS92

Micali94

BG02 GOS06 IKO07 GKR08 KR09 GGP10

Groth10

GLR11

Lipmaa11

BCCT12 GGPR12 BCCT13 KRR14

SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13

Thaler13

BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b

BCGGMTV14

FL14 KPPSST14 FGP14 WSRHBW15 BBFR15 CFHKKNPZ15 CTV15 KZMQCPPsS15

slide-7
SLIDE 7

Reasons for hope caution: § The theory is silent about feasibility (and the

  • nus here is heavier than in prior work)

§ Costs must reflect hardware: energy, area, …. § We need physically realizable designs and plausible computation sizes Makes sense if V + P cheaper than trusted F

V P

x y proof that F(x) = y input

  • utput

vs.

F

SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13

Thaler13

BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b

BCGGMTV14

FL14 KPPSST14 FGP14 WSRHBW15 BBFR15 CFHKKNPZ15 CTV15 KZMQCPPsS15 GMR85

Babai85

BCC86 BFLS91 FGLSS91

Kilian92

ALMSS92 AS92

Micali94

BG02 GOS06 IKO07 GKR08 KR09 GGP10

Groth10

GLR11

Lipmaa11

BCCT12 GGPR12 BCCT13 KRR14

slide-8
SLIDE 8

Zebra: a system that saves costs … sometimes (1) (2)

slide-9
SLIDE 9

probabilistic proof protocol (back-end) program translator (front-end)

Implementations of probabilistic proofs:

arithmetic circuit (AC) over 𝔾p

x y, proof

main(){ ... } C program

prover verifier

interactive proof [GKR08] interactive argument [IKO07] non-interactive argument (CS proof, SNARG, SNARK)

[Micali94, Groth10, Lipmaa12, GGPR12]

slide-10
SLIDE 10

P V

arguments (interactive, SNARK, CS proof, etc.)

  • non-deterministic ACs
  • arbitrary AC geometry
  • 1-round, 2-round protocols
  • deterministic ACs only
  • layered, low-depth ACs
  • lots of rounds, communication

interactive proofs

[GGPR12, PGHR13, SBVBPW13, BCTV14] [GKR08, CMT12, VSBW13]

x y, proof

unsuited to hardware suited to hardware

slide-11
SLIDE 11

Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔾p

verifier prover

x y …

ACCEPT/ REJECT

x y

sum-check invocation

[LFKN90]

sum-check invocation sum-check invocation

slide-12
SLIDE 12

Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔾p

V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries Cost to execute F directly: O(depth · width) Soundness error: miniscule for large p

slide-13
SLIDE 13

Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔾p

V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries Cost to execute F directly: O(depth · width) Soundness error: miniscule for large p verifier prover

x y …

sum-check invocation

[LFKN90]

sum-check invocation sum-check invocation

P’s sequential running time: O(depth · width · log width)

slide-14
SLIDE 14

Zebra extracts parallelism Execution step: layers are sequential, but gates can be executed in parallel. Proving step: can P and V parallelize the interaction? § No. V must ask questions in order § But. Parallelism is still available

slide-15
SLIDE 15

V questions P about F(x1)’s

  • utput layer

Simultaneously, P returns F(x2)

F(x2) F(x1)

slide-16
SLIDE 16

F(x3) F(x1) F(x2)

V questions P about F(x1)’s next layer and F(x2)’s output layer Meanwhile, P returns F(x3)

slide-17
SLIDE 17

F(x4) F(x1) F(x2) F(x3)

This process continues

slide-18
SLIDE 18

F(x5) F(x1) F(x2) F(x3) F(x4)

This process continues

slide-19
SLIDE 19

F(x7) F(x1) F(x2) F(x3) F(x4) F(x5) F(x6)

This process continues until V and P are completing one proof in each time step.

slide-20
SLIDE 20

sub-prover, layer 0 prover

sub-prover, layer d-1

This is nothing other than pipelining, a classic hardware technique. It applies because layering organizes the work into stages. There are other opportunities along these lines.

sub-prover, layer 1

slide-21
SLIDE 21

sub-prover, layer i

for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)

Sub-prover’s obligation in round j

  • f sum-check invocation: return

Hj(0), Hj(1), Hj(2), where Hj(k) = ∑ uj(g)⋅vj(g, k)

gates g

uj+1(g) = uj(g)⋅vj(g, rj) rj

Hj(0), Hj(1), Hj(2)

load u[1] u[1]*v(1, k=0) u[1]*v(1, k=1) u[1]*v(1, k=2) store new u[1] load u[g] u[g]*v(g, k=0) u[g]*v(g, k=1) u[g]*v(g, k=2) store new u[g]

adder tree

gate module 1 gate module g

RAM

slide-22
SLIDE 22

sub-prover, layer i

for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)

Sub-prover’s obligation in round j

  • f sum-check invocation: return

Hj(0), Hj(1), Hj(2), where Hj(k) = ∑ uj(g)⋅vj(g, k)

gates g

uj+1(g) = uj(g)⋅vj(g, rj) rj

Hj(0), Hj(1), Hj(2)

load u[1] u[1]*v(1, k=0) u[1]*v(1, k=1) u[1]*v(1, k=2) store new u[1] load u[g] u[g]*v(g, k=0) u[g]*v(g, k=1) u[g]*v(g, k=2) store new u[g]

adder tree

u[1] u[g]

gate module 1 gate module g

slide-23
SLIDE 23

sub-prover, layer i

for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)

Sub-prover’s obligation in round j

  • f sum-check invocation: return

Hj(0), Hj(1), Hj(2), where Hj(k) = ∑ uj(g)⋅vj(g, k)

gates g

uj+1(g) = uj(g)⋅vj(g, rj) rj

Hj(0), Hj(1), Hj(2)

adder tree

u[1] u[g] u[1]*v(1, k=1) u[1]*v(1, k=2) u[1]*v(1, k=0)

gate module 1 gate module g

slide-24
SLIDE 24

§ Extract parallelism

§ Pipelined proving, adder tree, gate proving, etc.

§ Exploit locality: distribute state and control

§ Custom registers (no RAM): “data” wires are few and short § Latency-insensitive design: few “control” wires

§ Reduce and reuse Summary of Zebra’s design approach:

slide-25
SLIDE 25

sub-prover, layer i

adder tree

gate module 1 gate module g

u[1] u[g] u[1]*v(1, k=1) u[1]*v(1, k=2) u[1]*v(1, k=0)

slide-26
SLIDE 26

§ Extract parallelism

§ Pipelined proving, adder tree, gate proving, etc.

§ Exploit locality: distribute state and control

§ Custom registers (no RAM): “data” wires are few and short § Latency-insensitive design: few “control” wires

§ Reuse and recycle

§ Recycle hardware circuitry for different tasks § Save energy by adding memoization to P § Reuse block designs; optimizations thus have high pay-off

Summary of Zebra’s design approach:

slide-27
SLIDE 27

Architectural and operational challenges for Zebra

  • 1. Communication between V and P is high bandwidth

§ V and P on circuit board? Too much energy, circuit area § Zebra’s response: use 3D packaging

  • 2. Protocol requires input-independent precomputation

§ Zebra’s response: amortize precomputations over many V-P pairs

  • 3. Trusted storage would be prohibitive

§ Zebra’s response: use untrusted storage, with auth-encryption

slide-28
SLIDE 28

§ An arithmetic circuit to synthesizable Verilog compiler for P § Composes with existing C to arithmetic circuit compilers § Two V implementations:

§ hardware (Verilog) § software (C++)

§ Library to generate V’s precomputations § Verilog simulator extensions to model software or hardware V’s interactions with P and with storage

The implementation of Zebra includes:

slide-29
SLIDE 29

This implementation seemed to work great.

Zebra: 104 or 105 proofs per second Existing implementations: 10 seconds per proof, at least

But that isn’t a serious evaluation …

slide-30
SLIDE 30

§ Baseline: direct implementation of F in same technology as V § Metrics: energy, chip size per throughput (in paper) § Assessed with circuit synthesis and simulation, published chip designs, and CMOS scaling models

§ Charge for V , P, communication; retrieving and decrypting precomputations; PRNG; operator communicating with V

§ Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; max chip area = 200 mm2; max total power = 150 W

Evaluation method V P

x y proof input

  • utput

vs.

F

slide-31
SLIDE 31

§ Baseline: direct implementation of F in same technology as V § Metrics: energy, chip size per throughput (in paper) § Assessed with circuit synthesis and simulation, published chip designs, and CMOS scaling models

§ Charge for V , P, communication; retrieving and decrypting precomputations; PRNG; operator communicating with V

§ Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; max chip area = 200 mm2; max total power = 150 W

Evaluation method V P

x y proof input

  • utput

vs.

F

1997 350 nm 2017 7 nm [TSMC]

slide-32
SLIDE 32

Application #1: number theoretic transform NTT: a Fourier transform over 𝔾p Used in signal processing, computer algebra, etc.

slide-33
SLIDE 33

Application #1: number theoretic transform

6 7 8 9 10 11 12 13 0.1 0.3 1 3 log2(NTT size) baseline vs. Zebra (higher is better)

Ratio of baseline energy to Zebra energy

slide-34
SLIDE 34

Application #2: Curve25519 point multiplication Curve25519: a commonly-used elliptic curve Point multiplication: primitive used for ECDH

slide-35
SLIDE 35

84 170 340 682 1147 0.1 0.3 1 3 Parallel Curve25519 point multiplications baseline vs. Zebra (higher is better)

Ratio of baseline energy to Zebra energy

Application #2: Curve25519 point multiplication

slide-36
SLIDE 36

(1) Zebra: a system that saves costs (2) … sometimes

slide-37
SLIDE 37

Summary of Zebra’s applicability:

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Need wide gap between (fast) fab for P and (trusted) fab for V
  • 3. Computation F must be relatively large for V to save work
  • 4. Computation F must be efficient as an arithmetic circuit (AC)
  • 5. Must amortize precomputations over many chips

restriction of the interactive proof (IP) setup

slide-38
SLIDE 38

Why did we build Zebra atop IPs instead of arguments?

Design principle interactive proofs

[GKR08, CMT12, VSBW13]

arguments (interactive, SNARK, CS proof, etc.)

[GGPR12, PGHR13, SBVBPW13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce and reuse ✓ ✗

In arguments, P computes over entire AC at once ⟶ need RAM P does crypto for every gate in AC ⟶ special crypto circuits needed We hope these issues are surmountable! Because argument protocols seem unfriendly to hardware:

slide-39
SLIDE 39

Reality check on the restrictions:

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Need wide gap between (fast) fab for P and (trusted) fab for V
  • 3. Computation F must be relatively large for V to save work
  • 4. Computation F must be efficient as an arithmetic circuit (AC)
  • 5. Must amortize precomputations over many chips

applies to interactive proofs (IPs) but not arguments common to all implementations of probabilistic proofs

slide-40
SLIDE 40

A limitation that is endemic to the area: Need wide gap between (fast) fab for P and (trusted) fab for V

101 105 109 103 107 1011

worker’s cost normalized to native C

matrix multiplication (m=128) PAM clustering (m=20, d=128) N/A 1013 Pepper Ginger Pinocchio Zaatar CMT native C Allspice TinyRAM Thaler Pepper Ginger Pinocchio Zaatar CMT native C Allspice TinyRAM Thaler

slide-41
SLIDE 41

Limitations that are endemic to the area: Computation F must be relatively large for V to save work Computation F must be efficient as an arithmetic circuit

§ Example: libsnark’s [BCTV14] optimized implementation of GGPR/Pinocchio [GGPR12, PGHR13]. Great work, but: § Verification time: 6 ms + (|x| + |y|)・3 µs on 2.7 Ghz CPU § That time is >16 million CPU ops, which is a break-even point § libsnark handles ≤ 16 million gates (with 32 GB of RAM), so to break even, F also needs on average CPU_ops/AC_gate > 1.

§ Example: addition over 𝔾p instead of over fixed-width integers

slide-42
SLIDE 42

Built probabilistic proof protocols amortize precomputations*

*Exception: CMT [CMT12] applied to highly regular arithmetic circuits

System amortize precomputation over size of advice Zebra multiple V-P pairs short Allspice [VSBW13]

  • ver a batch of instances of a given F

short Bootstrapped SNARKs

[BCTV14a, CTV15]

  • ver all computations

long BCTV [BCTV14b]

  • ver all computations of the same length

long Pinocchio [PGHR13]

  • ver all future uses of a given F

long Pepper [SMBW12], Ginger [SVPBBW12], Zaatar [SBVBPW13]

  • ver a batch of instances of a given F

long

slide-43
SLIDE 43

Lessons (re)learned:

§ Do careful feasibility studies first! § Hardware is a powerful tool for acceleration … § ... but only if data flows are amenable § Theory of computation versus application of physics § General-purpose verifiable computation and succinct arguments are still far from practical

slide-44
SLIDE 44

Summary and take-aways

§ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model § First hardware design for a probabilistic proof protocol; first work to capture cost of prover, verifier together § Improves performance compared to trusted baseline § Improvement compared to baseline is modest § Applicability is limited

§ Amortization, arithmetic circuits, “big enough” computations, large gap between trusted and untrusted technology, etc.

§ Zebra is plausibly deployable (!), but work remains for this area http://www.pepper-project.org