Verifiable ASICs: trustworthy hardware with untrusted components - - PowerPoint PPT Presentation

verifiable asics trustworthy hardware with untrusted
SMART_READER_LITE
LIVE PREVIEW

Verifiable ASICs: trustworthy hardware with untrusted components - - PowerPoint PPT Presentation

Verifiable ASICs: trustworthy hardware with untrusted components Riad S. Wahby , Max Howald , Siddharth Garg , abhi shelat , and Michael Walfish Stanford University New York University The Cooper Union


slide-1
SLIDE 1

Verifiable ASICs: trustworthy hardware with untrusted components

Riad S. Wahby◦⋆, Max Howald†⋆, Siddharth Garg⋆, abhi shelat‡, and Michael Walfish⋆

  • Stanford University

⋆New York University †The Cooper Union ‡The University of Virginia

May 25th, 2016

slide-2
SLIDE 2

Untrusted manufacturers can craft hardware Trojans

slide-3
SLIDE 3

Untrusted manufacturers can craft hardware Trojans

slide-4
SLIDE 4

Untrusted manufacturers can craft hardware Trojans

slide-5
SLIDE 5

Untrusted manufacturers can craft hardware Trojans

slide-6
SLIDE 6

Untrusted manufacturers can craft hardware Trojans Trusted fabrication is not a panacea: ✗ Only 5 countries have cutting-edge fabs on-shore ✗ Building a new fab takes $$$$$$, years of R&D ✗ An old fab could mean 108× performance hit accounting for speed, chip area, and energy

Can we get trust more cheaply?

slide-7
SLIDE 7

Can we build Verifiable ASICs?

Principal

F → designs for P, V

slide-8
SLIDE 8

Can we build Verifiable ASICs?

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

slide-9
SLIDE 9

Can we build Verifiable ASICs?

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

Integrator V P

slide-10
SLIDE 10

Can we build Verifiable ASICs?

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

Integrator

V P

input

  • utput
slide-11
SLIDE 11

Can we build Verifiable ASICs?

Untrusted fab (fast) builds P Trusted fab (slow) builds V Principal

F → designs for P, V

Integrator

V P

x y proof that y = F(x) input

  • utput
slide-12
SLIDE 12

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

  • Makes sense if V + P are cheaper than trusted F
slide-13
SLIDE 13

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

  • Makes sense if V + P are cheaper than trusted F
  • Reasons for hope:
  • running time of V < running time of F (asymptotically)
  • speed of cutting-edge fab might offset P’s overheads
slide-14
SLIDE 14

Can we build Verifiable ASICs?

V P

x y proof that y = F(x) input

  • utput

F vs.

  • Makes sense if V + P are cheaper than trusted F
  • Reasons for hope:
  • running time of V < running time of F (asymptotically)
  • speed of cutting-edge fab might offset P’s overheads
  • Challenges remain:
  • Hardware issues: energy, chip area
  • Need physically realizable circuit design
  • V needs to save work at plausible computation sizes
slide-15
SLIDE 15

Zebra: a hardware design that saves costs

slide-16
SLIDE 16

A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes.

slide-17
SLIDE 17

Probabilistic proof systems, briefly

V P

x y proof that y = F(x) input

  • utput

F must be expressed as an arithmetic circuit (AC)

generalized boolean circuit over Fp ∧ → × ∨ → +

slide-18
SLIDE 18

Probabilistic proof systems, briefly

V P

x y proof that y = F(x) input

  • utput

F must be expressed as an arithmetic circuit (AC) AC satisfiable ⇐ ⇒ F was executed correctly P convinces V that the AC is satisfiable

slide-19
SLIDE 19

Probabilistic proof systems, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice

slide-20
SLIDE 20

Probabilistic proof systems, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark + F with RAM, complex control flow + Little V-P communication IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice – “Quasi–straight line” F – Lots of V-P communication

slide-21
SLIDE 21

Probabilistic proof systems, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark + F with RAM, complex control flow + Little V-P communication Unsuited to hardware implementation IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice – “Quasi–straight line” F – Lots of V-P communication

slide-22
SLIDE 22

Probabilistic proof systems, briefly

V P

x y proof that y = F(x) input

  • utput

Arguments [GGPR13, SBVBPW13, PGHR13, BCTV14] e.g., Zaatar, Pinocchio, libsnark + F with RAM, complex control flow + Little V-P communication Unsuited to hardware implementation IPs [GKR08, CMT12, VSBW13] e.g., Muggles, CMT, Allspice – “Quasi–straight line” F – Lots of V-P communication Suited to hardware implementation

✗ ✓

slide-23
SLIDE 23

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13] F must be expressed as a layered arithmetic circuit. Note: this is an abstraction of F, not a physical circuit!

slide-24
SLIDE 24

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
slide-25
SLIDE 25

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit
slide-26
SLIDE 26

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit
slide-27
SLIDE 27

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit
slide-28
SLIDE 28

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

y

slide-29
SLIDE 29

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer

slide-30
SLIDE 30

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer, ends up with claim about second-last layer

slide-31
SLIDE 31

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer, ends up with claim about second-last layer

  • 4. V iterates
slide-32
SLIDE 32

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer, ends up with claim about second-last layer

  • 4. V iterates
slide-33
SLIDE 33

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer, ends up with claim about second-last layer

  • 4. V iterates
slide-34
SLIDE 34

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer, ends up with claim about second-last layer

  • 4. V iterates, ends up with

claim about inputs

slide-35
SLIDE 35

Zebra builds on IPs of GKR [GKR08, CMT12, VSBW13]

  • 1. V sends inputs
  • 2. P evaluates circuit,

returns output y

  • 3. V cross-examines P

about the last layer, ends up with claim about second-last layer

  • 4. V iterates, ends up with

claim about inputs

  • 5. V checks consistency

with the inputs V’s work ≈ O(depth · log width), so it saves work when width ≫ depth

slide-36
SLIDE 36

Can we parallelize this interaction?

Can V and P interact about all

  • f F’s layers at once?
  • No. V must ask questions in

correct order or P can cheat!

slide-37
SLIDE 37

Can we parallelize this interaction?

Can V and P interact about all

  • f F’s layers at once?
  • No. V must ask questions in

correct order or P can cheat! But: Zebra uses pipelining to parallelize several Fs.

slide-38
SLIDE 38

Extracting parallelism through pipelining

V questions P about F(x1)’s output layer.

F(x1)

slide-39
SLIDE 39

Extracting parallelism through pipelining

V questions P about F(x1)’s output layer. Simultaneously, P returns F(x2).

F(x1) F(x2)

slide-40
SLIDE 40

Extracting parallelism through pipelining

V questions P about F(x1)’s next layer

F(x1)

slide-41
SLIDE 41

Extracting parallelism through pipelining

V questions P about F(x1)’s next layer, and F(x2)’s output layer.

F(x1) F(x2)

slide-42
SLIDE 42

Extracting parallelism through pipelining

V questions P about F(x1)’s next layer, and F(x2)’s output layer. Meanwhile, P returns F(x3).

F(x1) F(x2) F(x3)

slide-43
SLIDE 43

Extracting parallelism through pipelining

This process continues until the pipeline is full.

F(x1) F(x2) F(x3) F(x4)

slide-44
SLIDE 44

Extracting parallelism through pipelining

This process continues until the pipeline is full.

F(x1) F(x2) F(x3) F(x4) F(x5)

slide-45
SLIDE 45

Extracting parallelism through pipelining

This process continues until the pipeline is full. V and P can complete

  • ne proof in each time

step.

F(x1) F(x2) F(x3) F(x4) F(x5) F(x6) F(x7) F(x8)

slide-46
SLIDE 46

Zebra’s design approach

✓ Extract parallelism

e.g., pipelined proving

slide-47
SLIDE 47

Zebra’s design approach

✓ Extract parallelism

e.g., pipelined proving

✓ Exploit locality: distribute data and control

e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: distributed state machine avoids bottlenecks associated with central controller

slide-48
SLIDE 48

Zebra’s design approach

✓ Extract parallelism

e.g., pipelined proving

✓ Exploit locality: distribute data and control

e.g., no RAM: data is kept close to places it is needed e.g., latency-insensitive design: distributed state machine avoids bottlenecks associated with central controller

✓ Reduce, reuse, recycle

e.g., computation: save energy by adding memoization to P e.g., hardware: save chip area by reusing the same circuits

slide-49
SLIDE 49

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area

Protocol requires input-independent precomputation [Allspice13]

slide-50
SLIDE 50

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [Allspice13]

slide-51
SLIDE 51

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [Allspice13]

✓ Zebra amortizes precomputations over many V-P pairs

slide-52
SLIDE 52

Architectural challenges

Interaction between V and P requires a lot of bandwidth

✗ V and P on circuit board? Too much energy, circuit area ✓ Zebra uses 3D integration

Protocol requires input-independent precomputation [Allspice13]

✓ Zebra amortizes precomputations over many V-P pairs

Several other details (see paper)

slide-53
SLIDE 53

Implementation Zebra’s implementation includes

  • a compiler that produces synthesizable Verilog for P
  • two V implementations
  • hardware (Verilog)
  • software (C++)
  • library to generate V’s precomputations
  • Verilog simulator extensions to model

software or hardware V’s interactions with P

slide-54
SLIDE 54

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V

slide-55
SLIDE 55

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (see paper)

slide-56
SLIDE 56

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (see paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V, P, communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V

slide-57
SLIDE 57

Evaluation method

V P

x y proof that y = F(x) input

  • utput

F vs.

Baseline: direct implementation of F in same technology as V Metrics: energy, chip size per throughput (see paper) Measurements: based on circuit synthesis and simulation, published chip designs, and CMOS scaling models Charge for V, P, communication; retrieving and decrypting precomputations; PRNG; Operator communicating with V Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; 200 mm2 max chip area; 150 W max total power

350 nm: 1997 (Pentium II) 7 nm: ≈ 2017 [TSMC] ≈ 20 year gap between trusted and untrusted fab

slide-58
SLIDE 58

Application #1: number theoretic transform NTT: a Fourier transform over Fp Widely used, e.g., in computer algebra

slide-59
SLIDE 59

Application #1: number theoretic transform

Ratio of baseline energy to Zebra energy

6 7 8 9 10 11 12 13 0.1 0.3 1 3 log2(NTT size) baseline vs. Zebra (higher is better)

slide-60
SLIDE 60

Application #2: Curve25519 point multiplication

Curve25519: a commonly-used elliptic curve Point multiplication: primitive used for ECDH

slide-61
SLIDE 61

Application #2: Curve25519 point multiplication

Ratio of baseline energy to Zebra energy

84 170 340 682 1147 0.1 0.3 1 3 Parallel Curve25519 point multiplications baseline vs. Zebra (higher is better)

slide-62
SLIDE 62

A qualified success Zebra: a hardware design that saves costs. . . . . . sometimes.

slide-63
SLIDE 63

Summary of Zebra’s applicability

  • 1. Must have a wide gap between cutting-edge fab for P

and trusted fab for V

  • 2. Must amortize precomputations over many instances
  • 3. Computation F must be very large for V to save work
  • 4. Computation F must be efficient as an arithmetic circuit
  • 5. Computation F must have a layered, shallow, deterministic AC
slide-64
SLIDE 64

Summary of Zebra’s applicability

Common to essentially all built proof systems

  • 1. Must have a wide gap between cutting-edge fab for P

and trusted fab for V

  • 2. Must amortize precomputations over many instances
  • 3. Computation F must be very large for V to save work
  • 4. Computation F must be efficient as an arithmetic circuit
  • 5. Computation F must have a layered, shallow, deterministic AC
slide-65
SLIDE 65

Summary of Zebra’s applicability

Common to essentially all built proof systems

  • 1. Must have a wide gap between cutting-edge fab for P

and trusted fab for V

  • 2. Must amortize precomputations over many instances
  • 3. Computation F must be very large for V to save work
  • 4. Computation F must be efficient as an arithmetic circuit
  • 5. Computation F must have a layered, shallow, deterministic AC

Applies to IPs, but not arguments

slide-66
SLIDE 66

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ Reduce, reuse, recycle ✓ Argument protocols seem friendly to hardware?

slide-67
SLIDE 67

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM

slide-68
SLIDE 68

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits

slide-69
SLIDE 69

Arguments versus IPs, redux

Design principle IPs

[GKR08, CMT12, VSBW13]

Arguments

[GGPR13, SBVBPW13, PGHR13, BCTV14]

Extract parallelism ✓ ✓ Exploit locality ✓ ✗ Reduce, reuse, recycle ✓ ✗ Argument protocols seem unfriendly to hardware: P computes over entire AC at once = ⇒ need RAM P does crypto for every gate in AC = ⇒ special crypto circuits

. . . but we hope these issues are surmountable!

slide-70
SLIDE 70

Recap

V P

x y proof that y = F(x) input

  • utput

+ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model + First hardware design for a probabilistic proof protocol + Improves performance compared to trusted baseline

slide-71
SLIDE 71

Recap

V P

x y proof that y = F(x) input

  • utput

+ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model + First hardware design for a probabilistic proof protocol + Improves performance compared to trusted baseline – Improvement compared to the baseline is modest – Applicability is limited:

precomputations must be amortized computation needs to be “big enough” large gap between trusted and untrusted technology does not apply to all computations

slide-72
SLIDE 72

Recap

V P

x y proof that y = F(x) input

  • utput

+ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model + First hardware design for a probabilistic proof protocol + Improves performance compared to trusted baseline – Improvement compared to the baseline is modest – Applicability is limited:

precomputations must be amortized computation needs to be “big enough” large gap between trusted and untrusted technology does not apply to all computations

https://www.pepper-project.org/