Verifiable ASICs: trustworthy hardware with untrusted components - - PowerPoint PPT Presentation

verifiable asics trustworthy hardware with untrusted
SMART_READER_LITE
LIVE PREVIEW

Verifiable ASICs: trustworthy hardware with untrusted components - - PowerPoint PPT Presentation

Verifiable ASICs: trustworthy hardware with untrusted components Riad S. Wahby , Max Howald , Siddharth Garg , abhi shelat , and Michael Walfish Stanford University New York University The Cooper Union


slide-1
SLIDE 1

Verifiable ASICs: trustworthy hardware with untrusted components

Riad S. Wahby◦⋆, Max Howald†⋆, Siddharth Garg⋆, abhi shelat‡, and Michael Walfish⋆

  • Stanford University

⋆New York University †The Cooper Union ‡The University of Virginia

June 10th, 2016

slide-2
SLIDE 2

Setting: ASICs with mutually distrusting designer, manufacturer chip design

Principal

(government, chip designer)

Manufacturer

(“foundry”

  • r “fab”)
slide-3
SLIDE 3

Setting: ASICs with mutually distrusting designer, manufacturer chip design

Principal

(government, chip designer)

Manufacturer

(“foundry”

  • r “fab”)

Here we are thinking about ASICs, not CPUs: CPU

RAM cache register file ALU

ASIC

in[0] in[n]

. . .

D Q D Q

slide-4
SLIDE 4

Setting: ASICs with mutually distrusting designer, manufacturer

Firewall

e.g., a network firewall appliance, with a custom chip for packet processing

slide-5
SLIDE 5

Untrusted manufacturers can craft hardware Trojans

Firewall

What if our packet processing chip has a back door?

slide-6
SLIDE 6

Untrusted manufacturers can craft hardware Trojans

Firewall

What if our packet processing chip has a back door? Threat: incorrect execution of the packet filter

(Other concerns, e.g., secret state, are important but orthogonal)

slide-7
SLIDE 7

Untrusted manufacturers can craft hardware Trojans

Firewall

What if our packet processing chip has a back door?

slide-8
SLIDE 8

Untrusted manufacturers can craft hardware Trojans

Firewall

US DoD controls supply chain with trusted foundries.

slide-9
SLIDE 9

Trusted fabs are the only way to get strong guarantees

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

slide-10
SLIDE 10

Trusted fabs are the only way to get strong guarantees

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

But trusted fabrication is not a panacea:

✗ Only 5 countries have cutting-edge fabs on-shore ✗ Building a new fab takes $$$$$$, years of R&D

slide-11
SLIDE 11

Trusted fabs are the only way to get strong guarantees

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

But trusted fabrication is not a panacea:

✗ Only 5 countries have cutting-edge fabs on-shore ✗ Building a new fab takes $$$$$$, years of R&D ✗ Semiconductor scaling: chip area and energy go with square and cube of transistor length (“critical dimension”) ✗ So using an old fab means an enormous performance hit

e.g., India’s best on-shore fab is 108× behind state of the art

slide-12
SLIDE 12

Trusted fabs are the only way to get strong guarantees

For example, stealthy trojans can thwart post-fab detection [A2: Analog Malicious Hardware, Yang et al., IEEE S&P 2016; Stealthy Dopant-Level Trojans, Becker et al., CHES 2013]

But trusted fabrication is not a panacea:

✗ Only 5 countries have cutting-edge fabs on-shore ✗ Building a new fab takes $$$$$$, years of R&D ✗ Semiconductor scaling: chip area and energy go with square and cube of transistor length (“critical dimension”) ✗ So using an old fab means an enormous performance hit

e.g., India’s best on-shore fab is 108× behind state of the art

Can we get trust more cheaply?

slide-13
SLIDE 13

Verifiable ASICs

Principal

F → designs for P, V

slide-14
SLIDE 14

Verifiable ASICs

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

slide-15
SLIDE 15

Verifiable ASICs

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

Integrator V P

slide-16
SLIDE 16

Verifiable ASICs

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

Integrator

V P

input

  • utput
slide-17
SLIDE 17

Verifiable ASICs

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

Integrator

V P

x y proof that y = F(x) input

  • utput
slide-18
SLIDE 18

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

Makes sense if V + P are cheaper than trusted F

slide-19
SLIDE 19

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

Makes sense if V + P are cheaper than trusted F Reasons for hope:

  • running time of V < F (asymptotically)

Babai85 GMR85 BCC86 BFLS91 FGLSS91 Kilian92 ALMSS92 AS92 Micali94 BG02 GOS06 IKO07 GKR08 KR09 GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR13 BCCT13 KRR14 . . .

slide-20
SLIDE 20

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

Makes sense if V + P are cheaper than trusted F Reasons for hope:

  • running time of V < F (asymptotically)
  • Implementations exist

Babai85 GMR85 BCC86 BFLS91 FGLSS91 Kilian92 ALMSS92 AS92 Micali94 BG02 GOS06 IKO07 GKR08 KR09 GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR13 BCCT13 KRR14 . . . SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13 Thaler13 BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b BCGGMTV14 FL14 KPPSST14 FTP14 WSRHBW15 BBFR15 CFHKNPZ15 CTV15 KZMQCPPsS15

slide-21
SLIDE 21

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

Makes sense if V + P are cheaper than trusted F Reasons for hope:

  • running time of V < F (asymptotically)
  • Implementations exist
  • P overheads are massive, but using an

advanced fab might offset these costs

Babai85 GMR85 BCC86 BFLS91 FGLSS91 Kilian92 ALMSS92 AS92 Micali94 BG02 GOS06 IKO07 GKR08 KR09 GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR13 BCCT13 KRR14 . . . SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13 Thaler13 BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b BCGGMTV14 FL14 KPPSST14 FTP14 WSRHBW15 BBFR15 CFHKNPZ15 CTV15 KZMQCPPsS15

101 105 109 103 107 1011

worker’s cost normalized to native C

matrix multiplication (m=128) 1013 P e p p e r G i n g e r P i n

  • c

c h i

  • Z

a a t a r C M T n a t i v e C A l l s p i c e T i n y R A M T h a l e r P e

108×

slide-22
SLIDE 22

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

Makes sense if V + P are cheaper than trusted F Reasons for hope caution:

  • Theory is silent about feasibility
  • Onus is heavier than in prior work
  • Hardware issues: energy, chip area
  • Need physically realizable circuit design
  • Need V to save for plausible computation sizes

Babai85 GMR85 BCC86 BFLS91 FGLSS91 Kilian92 ALMSS92 AS92 Micali94 BG02 GOS06 IKO07 GKR08 KR09 GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR13 BCCT13 KRR14 . . . SBW11 CMT12 SMBW12 TRMP12 SVPBBW12 SBVBPW13 VSBW13 PGHR13 Thaler13 BCGTV13 BFRSBW13 BFR13 DFKP13 BCTV14a BCTV14b BCGGMTV14 FL14 KPPSST14 FTP14 WSRHBW15 BBFR15 CFHKNPZ15 CTV15 KZMQCPPsS15

slide-23
SLIDE 23

Zebra: a hardware design that saves costs

slide-24
SLIDE 24

A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes.

slide-25
SLIDE 25

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

F must be expressed as an arithmetic circuit (AC) AC satisfiable ⇐ ⇒ F was executed correctly P convinces V that the AC is satisfiable

slide-26
SLIDE 26

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice

slide-27
SLIDE 27

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice

What about other schemes? e.g., FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14], PCIP [RRR16], IOP [BCS16], PIR [BHK16], . . .

slide-28
SLIDE 28

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice

What about other schemes? e.g., FHE [GGP10], MIP+FHE [BC12], MIP [BTWV14], PCIP [RRR16], IOP [BCS16], PIR [BHK16], . . . These all seem a bit further from practicality.

slide-29
SLIDE 29

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark + nondeterministic ACs, arbitrary connectivity + Few rounds (≤ 3) IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice – deterministic ACs; layered, low depth – Many rounds

slide-30
SLIDE 30

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark + nondeterministic ACs, arbitrary connectivity + Few rounds (≤ 3) Unsuited to hardware implementation IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice – deterministic ACs; layered, low depth – Many rounds

slide-31
SLIDE 31

Probabilistic proof protocols, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark + nondeterministic ACs, arbitrary connectivity + Few rounds (≤ 3) Unsuited to hardware implementation IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice – deterministic ACs; layered, low depth – Many rounds Suited to hardware implementation

✗ ✓

slide-32
SLIDE 32

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] F must be expressed as a layered arithmetic circuit.

slide-33
SLIDE 33

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs

V P x

slide-34
SLIDE 34

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates

V P x thinking...

slide-35
SLIDE 35

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates

V P x thinking...

slide-36
SLIDE 36

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates

V P x thinking...

slide-37
SLIDE 37

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y

y

V P x y

slide-38
SLIDE 38

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

V P x y thinking...

slide-39
SLIDE 39

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

  • 4. V engages P in a sum-check

V P x y

. . .

sum-check [LFKN90]

slide-40
SLIDE 40

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

  • 4. V engages P in a sum-check, gets

claim about second-last layer

V P x y

. . .

sum-check [LFKN90]

slide-41
SLIDE 41

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

  • 4. V engages P in a sum-check, gets

claim about second-last layer

  • 5. V iterates

V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-42
SLIDE 42

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

  • 4. V engages P in a sum-check, gets

claim about second-last layer

  • 5. V iterates

V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-43
SLIDE 43

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

  • 4. V engages P in a sum-check, gets

claim about second-last layer

  • 5. V iterates

V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-44
SLIDE 44

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates, returns output y
  • 3. V constructs polynomial relating

y to last layer’s input wires

  • 4. V engages P in a sum-check, gets

claim about second-last layer

  • 5. V iterates, gets claim about

inputs, which it can check

V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-45
SLIDE 45

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

Soundness error ∝ p−1 V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-46
SLIDE 46

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

Soundness error ∝ p−1 Cost to execute F directly: O(depth · width) V’s sequential running time: O(depth · log width + |x| + |y|) (assuming precomputed queries) V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-47
SLIDE 47

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

Soundness error ∝ p−1 Cost to execute F directly: O(depth · width) V’s sequential running time: O(depth · log width + |x| + |y|) (assuming precomputed queries) P’s sequential running time: O(depth · width · log width) V P x y

. . .

sum-check [LFKN90] more sum-checks

slide-48
SLIDE 48

Extracting parallelism in Zebra

P executing AC: layers are sequential, but all gates at a layer can be executed in parallel

slide-49
SLIDE 49

Extracting parallelism in Zebra

P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once?

slide-50
SLIDE 50

Extracting parallelism in Zebra

P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once?

  • No. V must ask questions in
  • rder or soundness is lost.
slide-51
SLIDE 51

Extracting parallelism in Zebra

P executing AC: layers are sequential, but all gates at a layer can be executed in parallel Proving step: Can V and P interact about all of F’s layers at once?

  • No. V must ask questions in
  • rder or soundness is lost.

But: there is still parallelism to be extracted. . .

slide-52
SLIDE 52

Extracting parallelism in Zebra’s P

V questions P about F(x1)’s output layer.

F(x1)

slide-53
SLIDE 53

Extracting parallelism in Zebra’s P

V questions P about F(x1)’s output layer. Simultaneously, P returns F(x2).

F(x1) F(x2)

slide-54
SLIDE 54

Extracting parallelism in Zebra’s P

V questions P about F(x1)’s next layer

F(x1)

slide-55
SLIDE 55

Extracting parallelism in Zebra’s P

V questions P about F(x1)’s next layer, and F(x2)’s output layer.

F(x1) F(x2)

slide-56
SLIDE 56

Extracting parallelism in Zebra’s P

V questions P about F(x1)’s next layer, and F(x2)’s output layer. Meanwhile, P returns F(x3).

F(x1) F(x2) F(x3)

slide-57
SLIDE 57

Extracting parallelism in Zebra’s P

This process continues. . .

F(x1) F(x2) F(x3) F(x4)

slide-58
SLIDE 58

Extracting parallelism in Zebra’s P

This process continues. . .

F(x1) F(x2) F(x3) F(x4) F(x5)

slide-59
SLIDE 59

Extracting parallelism in Zebra’s P

This process continues until V and P interact about every layer simultaneously—but for different computations. V and P can complete

  • ne proof in each time

step.

F(x1) F(x2) F(x3) F(x4) F(x5) F(x6) F(x7) F(x8)

slide-60
SLIDE 60

Extracting parallelism in Zebra’s P with pipelining

Input (x) Output (y) prove prove prove Sub-prover, layer 0 Sub-prover, layer 1 Sub-prover, layer d − 1

V P

queries responses queries responses queries responses

. . . . . .

This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged.

slide-61
SLIDE 61

Extracting parallelism in Zebra’s P with pipelining

Input (x) Output (y) prove prove prove Sub-prover, layer 0 Sub-prover, layer 1 Sub-prover, layer d − 1

V P

queries responses queries responses queries responses

. . . . . .

This approach is just a standard hardware technique, pipelining; it is possible because the protocol is naturally staged. There are other opportunities to leverage the protocol’s structure.

slide-62
SLIDE 62

Per-layer computations

For each sum-check round, P sums over each gate in a layer

slide-63
SLIDE 63

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2}

layer:

H[k] =

  • g∈layer

δ(g, k)

slide-64
SLIDE 64

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k)

slide-65
SLIDE 65

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) gate prover δ(1, 0) gate prover δ(2, 0) gate prover δ(3, 0) . . .

slide-66
SLIDE 66

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) gate prover δ(1, 0) gate prover δ(2, 0) gate prover δ(3, 0) . . .

RAM

slide-67
SLIDE 67

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) gate prover δ(1, 0) gate prover δ(2, 0) gate prover δ(3, 0) . . .

RAM

+ + +

Adder tree

slide-68
SLIDE 68

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) δ(0, 1) gate prover δ(1, 0) δ(1, 1) gate prover δ(2, 0) δ(2, 1) gate prover δ(3, 0) δ(3, 1) . . .

RAM

+ + +

Adder tree

slide-69
SLIDE 69

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) δ(0, 1) δ(0, 2) gate prover δ(1, 0) δ(1, 1) δ(1, 2) gate prover δ(2, 0) δ(2, 1) δ(2, 2) gate prover δ(3, 0) δ(3, 1) δ(3, 2) . . .

RAM

+ + +

Adder tree

slide-70
SLIDE 70

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) δ(0, 1) δ(0, 2) δ(0, rj) gate prover δ(1, 0) δ(1, 1) δ(1, 2) δ(1, rj) gate prover δ(2, 0) δ(2, 1) δ(2, 2) δ(2, rj) gate prover δ(3, 0) δ(3, 1) δ(3, 2) δ(3, rj)

. . .

RAM

+ + +

Adder tree

slide-71
SLIDE 71

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

gate prover δ(0, 0) δ(0, 1) δ(0, 2) δ(0, rj) gate prover δ(1, 0) δ(1, 1) δ(1, 2) δ(1, rj) gate prover δ(2, 0) δ(2, 1) δ(2, 2) δ(2, rj) gate prover δ(3, 0) δ(3, 1) δ(3, 2) δ(3, rj)

. . .

RAM

+ + +

Adder tree

slide-72
SLIDE 72

Per-layer computations

For each sum-check round, P sums over each gate in a layer, evaluating H[k], k ∈ {0, 1, 2} In software:

// compute H[0], H[1], H[2] for k ∈ {0, 1, 2}: H[k] ← 0 for g ∈ layer: H[k] ← H[k] + δ(g, k) // δ uses state[g] // update lookup table // with V’s random coin for g ∈ layer: state[g] ← δ(g, rj)

layer:

H[k] =

  • g∈layer

δ(g, k) In hardware:

state[0] gate prover δ(0, 0) δ(0, 1) δ(0, 2) δ(0, rj) state[1] gate prover δ(1, 0) δ(1, 1) δ(1, 2) δ(1, rj) state[2] gate prover δ(2, 0) δ(2, 1) δ(2, 2) δ(2, rj) state[3] gate prover δ(3, 0) δ(3, 1) δ(3, 2) δ(3, rj)

. . .

+ + +

Adder tree

slide-73
SLIDE 73

Zebra’s design approach

✓ Extract parallelism

e.g., pipelined proving e.g., parallel evaluation of δ by gate provers

✓ Exploit locality: distribute data and control

e.g., no RAM: data is kept close to places it is needed

slide-74
SLIDE 74

Zebra’s design approach

✓ Extract parallelism

e.g., pipelined proving e.g., parallel evaluation of δ by gate provers

✓ Exploit locality: distribute data and control

e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control

slide-75
SLIDE 75

Zebra’s design approach

✓ Extract parallelism

e.g., pipelined proving e.g., parallel evaluation of δ by gate provers

✓ Exploit locality: distribute data and control

e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: localized control

✓ Reduce, reuse, recycle

e.g., computation: save energy by adding memoization to P e.g., hardware: save chip area by reusing the same circuits

slide-76
SLIDE 76

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area

slide-77
SLIDE 77

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

slide-78
SLIDE 78

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [VSBW13]

slide-79
SLIDE 79

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [VSBW13]

✓ Zebra amortizes precomputations over many V-P pairs

slide-80
SLIDE 80

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [VSBW13]

✓ Zebra amortizes precomputations over many V-P pairs

Precomputations need secrecy, integrity

✗ Give V trusted storage? Cost would be prohibitive

V P

x y proof that y = F(x) input

  • utput

prei

slide-81
SLIDE 81

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [VSBW13]

✓ Zebra amortizes precomputations over many V-P pairs

Precomputations need secrecy, integrity

✗ Give V trusted storage? Cost would be prohibitive ✓ Zebra uses untrusted storage + authenticated encryption

V P

x y proof that y = F(x) input

  • utput

Ek(prei)

slide-82
SLIDE 82

Implementation Zebra’s implementation includes

  • a compiler that produces synthesizable Verilog for P
  • two V implementations
  • hardware (Verilog)
  • software (C++)
  • library to generate V’s precomputations
  • Verilog simulator extensions to model

software or hardware V’s interactions with P

slide-83
SLIDE 83

. . . and it seemed to work really well!

Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!

slide-84
SLIDE 84

. . . and it seemed to work really well!

Zebra can produce 10k–100k proofs per second, while existing systems take tens of seconds per proof!

But that’s not a serious evaluation. . .

slide-85
SLIDE 85

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V

slide-86
SLIDE 86

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper)

slide-87
SLIDE 87

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V, P, communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V

slide-88
SLIDE 88

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (discussed in paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V, P, communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; 200 mm2 max chip area; 150 W max total power

350 nm: 1997 (Pentium II) 7 nm: ≈ 2017 [TSMC] ≈ 20 year gap between trusted and untrusted fab

slide-89
SLIDE 89

Application #1: number theoretic transform NTT: a Fourier transform over Fp Widely used, e.g., in computer algebra

slide-90
SLIDE 90

Application #1: number theoretic transform

Ratio of baseline energy to Zebra energy

6 7 8 9 10 11 12 13 0.1 0.3 1 3 log2(NTT size) baseline vs. Zebra (higher is better)

slide-91
SLIDE 91

Application #2: Curve25519 point multiplication Curve25519: a commonly-used elliptic curve Point multiplication: primitive, e.g., for ECDH

slide-92
SLIDE 92

Application #2: Curve25519 point multiplication

Ratio of baseline energy to Zebra energy

84 170 340 682 1147 0.1 0.3 1 3 Parallel Curve25519 point multiplications baseline vs. Zebra (higher is better)

slide-93
SLIDE 93

A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes.

slide-94
SLIDE 94

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit
slide-95
SLIDE 95

Summary of Zebra’s applicability

Applies to IPs, but not arguments

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit
slide-96
SLIDE 96

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ Reduce, reuse, recycle ✓ Argument protocols seem friendly to hardware?

slide-97
SLIDE 97

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM

slide-98
SLIDE 98

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits

slide-99
SLIDE 99

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits

. . . but we hope these issues are surmountable!

slide-100
SLIDE 100

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit

Common to essentially all built proof systems

slide-101
SLIDE 101

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit
slide-102
SLIDE 102

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit

System Amortization regime Advice Zebra many V-P pairs short Allspice [VSBW13] batch of instances

  • f a particular F

short Bootstrapped SNARKs [BCTV14a, CTV15] all computations long BCTV [BCTV14b] all computations

  • f the same length

long Pinocchio [PGHR13] all future instances

  • f a particular F

long Zaatar [SBVBPW13] batch of instances

  • f a particular F

long Exception: [CMT12] with logspace-uniform ACs

slide-103
SLIDE 103

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit
slide-104
SLIDE 104

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementa- tion of [GGPR13] and Pinocchio [PGHR13]: V’s work: 6 ms + (|x| + |y|) · 3 µs on a 2.7 GHz CPU

slide-105
SLIDE 105

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementa- tion of [GGPR13] and Pinocchio [PGHR13]: V’s work: 6 ms + (|x| + |y|) · 3 µs on a 2.7 GHz CPU ⇒ break-even point ≥ 16 × 106 CPU ops

slide-106
SLIDE 106

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementa- tion of [GGPR13] and Pinocchio [PGHR13]: V’s work: 6 ms + (|x| + |y|) · 3 µs on a 2.7 GHz CPU ⇒ break-even point ≥ 16 × 106 CPU ops With 32 GB RAM, libsnark handles ACs with ≤ 16 × 106 gates

slide-107
SLIDE 107

Summary of Zebra’s applicability

  • 1. Computation F must have a layered, shallow, deterministic AC
  • 2. Must have a wide gap between cutting-edge fab (for P)

and trusted fab (for V)

  • 3. Amortizes precomputations over many instances
  • 4. Computation F must be very large for V to save work
  • 5. Computation F must be efficient as an arithmetic circuit

For example, libsnark [BCTV14b], a highly optimized implementa- tion of [GGPR13] and Pinocchio [PGHR13]: V’s work: 6 ms + (|x| + |y|) · 3 µs on a 2.7 GHz CPU ⇒ break-even point ≥ 16 × 106 CPU ops With 32 GB RAM, libsnark handles ACs with ≤ 16 × 106 gates ⇒ breaking even requires > 1 CPU op per AC gate, e.g., computations over Fp rather than machine integers

slide-108
SLIDE 108

Recap

V P

x y proof that y = F(x) input

  • utput

+ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model + First hardware design for a probabilistic proof protocol + Improves performance compared to trusted baseline

slide-109
SLIDE 109

Recap

V P

x y proof that y = F(x) input

  • utput

+ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model + First hardware design for a probabilistic proof protocol + Improves performance compared to trusted baseline – Improvement compared to the baseline is modest – Applicability is limited:

precomputations must be amortized computation needs to be “big enough” large gap between trusted and untrusted technology does not apply to all computations

slide-110
SLIDE 110

Recap

V P

x y proof that y = F(x) input

  • utput

+ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model + First hardware design for a probabilistic proof protocol + Improves performance compared to trusted baseline – Improvement compared to the baseline is modest – Applicability is limited:

precomputations must be amortized computation needs to be “big enough” large gap between trusted and untrusted technology does not apply to all computations

Bottom line: Zebra is plausible—when it applies

https://www.pepper-project.org/