verifiable asics trustworthy hardware with untrusted
play

Verifiable ASICs: trustworthy hardware with untrusted components - PowerPoint PPT Presentation

Verifiable ASICs: trustworthy hardware with untrusted components Riad S. Wahby , Max Howald , Siddharth Garg , abhi shelat , and Michael Walfish Stanford University New York University The Cooper Union


  1. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] F must be expressed as a layered arithmetic circuit.

  2. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs V P x

  3. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x

  4. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x

  5. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates thinking... V P x

  6. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y V P x y y

  7. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires thinking... V P x y

  8. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check V P x y . . sum-check . [LFKN90]

  9. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer V P x y . . sum-check . [LFKN90]

  10. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks

  11. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks

  12. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates V P x y . . sum-check . [LFKN90] more sum-checks

  13. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] 1. V sends inputs 2. P evaluates, returns output y 3. V constructs polynomial relating y to last layer’s input wires 4. V engages P in a sum-check, gets claim about second-last layer 5. V iterates, gets claim about inputs, which it can check V P x y . . sum-check . [LFKN90] more sum-checks

  14. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 V P x y . . sum-check . [LFKN90] more sum-checks

  15. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 Cost to execute F directly: O(depth · width) V ’s sequential running time: O(depth · log width + | x | + | y | ) (assuming precomputed queries) V P x y . . sum-check . [LFKN90] more sum-checks

  16. Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] Soundness error ∝ p − 1 Cost to execute F directly: O(depth · width) V ’s sequential running time: O(depth · log width + | x | + | y | ) (assuming precomputed queries) P ’s sequential running time: O(depth · width · log width) V P x y . . sum-check . [LFKN90] more sum-checks

  17. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel

  18. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once?

  19. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once? No. V must ask questions in order or soundness is lost.

  20. Extracting parallelism in Zebra P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once? No. V must ask questions in order or soundness is lost. But: there is still parallelism to be extracted. . .

  21. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s output layer. F( x 1 )

  22. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s output layer. Simultaneously, P returns F( x 2 ). F( x 1 ) F( x 2 )

  23. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer F( x 1 )

  24. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer, and F( x 2 )’s output layer. F( x 1 ) F( x 2 )

  25. Extracting parallelism in Zebra’s P V questions P about F( x 1 )’s next layer, and F( x 2 )’s output layer. Meanwhile, P returns F( x 3 ). F( x 1 ) F( x 2 ) F( x 3 )

  26. Extracting parallelism in Zebra’s P This process continues. . . F( x 1 ) F( x 2 ) F( x 3 ) F( x 4 )

  27. Extracting parallelism in Zebra’s P This process continues. . . F( x 1 ) F( x 2 ) F( x 3 ) F( x 4 ) F( x 5 )

  28. Extracting parallelism in Zebra’s P F( x 1 ) This process continues F( x 2 ) until V and P interact about every layer F( x 3 ) simultaneously—but for different computations. F( x 4 ) V and P can complete one proof in each time F( x 5 ) step. F( x 6 ) F( x 7 ) F( x 8 )

  29. Extracting parallelism in Zebra’s P with pipelining Input ( x ) P queries Sub-prover, layer d − 1 prove responses . . . . V . . queries Sub-prover, layer 1 prove responses queries Sub-prover, layer 0 prove responses Output ( y ) This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged.

  30. Extracting parallelism in Zebra’s P with pipelining Input ( x ) P queries Sub-prover, layer d − 1 prove responses . . . . V . . queries Sub-prover, layer 1 prove responses queries Sub-prover, layer 0 prove responses Output ( y ) This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged. There are other opportunities to leverage the protocol’s structure.

  31. Per-layer computations For each sum-check round, P sums over each gate in a layer

  32. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 }

  33. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In software: // compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: state[g] ← δ (g, r j )

  34. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: state[g] ← δ (g, r j )

  35. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table // with V ’s random coin for g ∈ layer: RAM state[g] ← δ (g, r j )

  36. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] δ (3, 0) . . . prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  37. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) δ (3, 1) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) for g ∈ layer: H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  38. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) δ (3, 2) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) H[k] ← H[k] + δ (g, k) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  39. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) H[k] ← H[k] + δ (g, k) δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  40. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: gate gate gate gate // compute H[0], H[1], H[2] prover prover prover prover for k ∈ {0, 1, 2}: δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) . . . H[k] ← 0 δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) for g ∈ layer: δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) H[k] ← H[k] + δ (g, k) δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // δ uses state[g] // update lookup table + + // with V ’s random coin + for g ∈ layer: RAM Adder tree state[g] ← δ (g, r j )

  41. Per-layer computations � H[ k ] = δ ( g , k ) g ∈ layer For each sum-check round, P layer: sums over each gate in a layer, evaluating H[ k ], k ∈ { 0, 1, 2 } In hardware: In software: state[0] state[1] state[2] state[3] // compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: gate gate gate gate H[k] ← 0 prover prover prover prover δ (0, 0) δ (1, 0) δ (2, 0) δ (3, 0) for g ∈ layer: . . . δ (0, 1) δ (1, 1) δ (2, 1) δ (3, 1) H[k] ← H[k] + δ (g, k) δ (0, 2) δ (1, 2) δ (2, 2) δ (3, 2) // δ uses state[g] δ (0, r j ) δ (1, r j ) δ (2, r j ) δ (3, r j ) // update lookup table // with V ’s random coin + + for g ∈ layer: + state[g] ← δ (g, r j ) Adder tree

  42. Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed

  43. Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control

  44. Zebra’s design approach ✓ Extract parallelism e.g., pipelined proving e.g., parallel evaluation of δ by gate provers ✓ Exploit locality: distribute data and control e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control ✓ Reduce, reuse, recycle e.g., computation: save energy by adding memoization to P e.g., hardware: save chip area by reusing the same circuits

  45. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area

  46. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

  47. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13]

  48. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs

  49. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs Precomputations need secrecy, integrity ✗ Give V trusted storage? Cost would be prohibitive V input x y P output pre i proof that y = F( x )

  50. Architectural challenges Interaction between V and P requires a lot of bandwidth ✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration Protocol requires input-independent precomputation [VSBW13] ✓ Zebra amortizes precomputations over many V - P pairs Precomputations need secrecy, integrity ✗ Give V trusted storage? Cost would be prohibitive ✓ Zebra uses untrusted storage + authenticated encryption input x V y P E k (pre i ) output proof that y = F( x )

  51. Implementation Zebra’s implementation includes • a compiler that produces synthesizable Verilog for P • two V implementations • hardware (Verilog) • software (C++) • library to generate V ’s precomputations • Verilog simulator extensions to model software or hardware V ’s interactions with P

  52. . . . and it seemed to work really well! Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!

  53. . . . and it seemed to work really well! Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof! But that’s not a serious evaluation. . .

  54. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V

  55. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper)

  56. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V , P , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V

  57. Evaluation method input x F V y P vs. output proof that y = F( x ) Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) 350 nm: 1997 (Pentium II) 7 nm: ≈ 2017 [TSMC] Measurements: based on circuit synthesis and simulation, ≈ 20 year gap between published chip designs, and CMOS scaling models trusted and untrusted fab Charge for V , P , communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; 200 mm 2 max chip area; 150 W max total power

  58. Application #1: number theoretic transform NTT: a Fourier transform over F p Widely used, e.g., in computer algebra

  59. Application #1: number theoretic transform Ratio of baseline energy to Zebra energy baseline vs. Zebra (higher is better) 3 1 0.3 0.1 6 7 8 9 10 11 12 13 log 2 (NTT size)

  60. Application #2: Curve25519 point multiplication Curve25519: a commonly-used elliptic curve Point multiplication: primitive, e.g., for ECDH

  61. Application #2: Curve25519 point multiplication Ratio of baseline energy to Zebra energy baseline vs. Zebra (higher is better) 3 1 0.3 0.1 84 170 340 682 1147 Parallel Curve25519 point multiplications

  62. A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes .

  63. Summary of Zebra’s applicability 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit

  64. Summary of Zebra’s applicability Applies to IPs, but not arguments 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit

  65. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ Reduce, reuse, recycle ✓ Argument protocols seem friendly to hardware?

  66. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM

  67. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits

  68. Arguments versus IPs, redux IPs Arguments Design principle [GKR08, CMT12, [GGPR13, SBVBPW13, VSBW13] PGHR13, BCTV14] Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits . . . but we hope these issues are surmountable!

  69. Summary of Zebra’s applicability 1. Computation F must have a layered, shallow, deterministic AC 2. Must have a wide gap between cutting-edge fab (for P ) and trusted fab (for V ) 3. Amortizes precomputations over many instances 4. Computation F must be very large for V to save work 5. Computation F must be efficient as an arithmetic circuit Common to essentially all built proof systems

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend