Efficient Cryptography on the RISC-V Architecture Ko Stoffelen - - PowerPoint PPT Presentation

efficient cryptography on the risc v architecture
SMART_READER_LITE
LIVE PREVIEW

Efficient Cryptography on the RISC-V Architecture Ko Stoffelen - - PowerPoint PPT Presentation

Efficient Cryptography on the RISC-V Architecture Ko Stoffelen Tl;dr In this talk: Fast AES-128 assembly for RV32I Fast ChaCha20 assembly for RV32I Fast Keccak- f [1600] assembly for RV32I Fast arbitrary-precision integer


slide-1
SLIDE 1

Efficient Cryptography on the RISC-V Architecture

Ko Stoffelen

slide-2
SLIDE 2

Tl;dr

In this talk:

  • Fast AES-128 assembly for RV32I
  • Fast ChaCha20 assembly for RV32I
  • Fast Keccak-f [1600] assembly for RV32I
  • Fast arbitrary-precision integer arithmetic for RV32IM
  • Estimate potential speedup with several RISC-V extensions

2/18

slide-3
SLIDE 3

RISC-V is . . .

  • . . . a new open reduced instruction set architecture (ISA)
  • . . . a research project that started at UC Berkely in 2010
  • . . . a foundation with > 325 members, including Google, Infineon,

NXP, Qualcomm, Samsung, etc.

  • . . . a serious competitor to ARM?
  • . . . a big hype?
  • . . . a project with a lot of work in progress
  • . . . a frozen 32-bit base ISA (RV32I) and 64-bit (RV64I) with

standardized optional extensions

  • . . . a production-ready core design

3/18

slide-4
SLIDE 4

4/18

slide-5
SLIDE 5

RV32I

  • 32 32-bit registers x0–x31, but some are reserved
  • Basic three-operand arithmetic and bitwise instructions
  • Basic shift instructions
  • Basic load/store instructions
  • Basic jump, conditional jumps, comparison instructions
  • That’s more or less it — boring!
  • No: rotate instructions, carry flag, DSP/vector instructions, nice bit
  • peration instructions
  • Compensated by extensions: M, A, F, D, Q, C, . . .

– M: integer multiplication/division – B: bit operation instructions (WIP)

  • HiFive1: 5-stage single-issue in-order pipelined RV32IMAC E31 CPU

– < 384 MHz, 64 KiB RAM, 16 MiB flash, 16 KiB I$ – Most instructions single cycle result latency, except loads

5/18

slide-6
SLIDE 6

AES-128

  • Lookup tables or bitsliced?
  • Both! Depends on data caches
  • 4 KiB table-based fairly straightforward [BS08]

– Baseline of 704 instructions – LBU byte loads: ✓ (−4) – Everything else: ✗ – Can’t load from address with offset in registers (+160) – Key expansion in 340 cycles, encryption in 57 cycles/byte

  • Bitsliced based on Cortex-M3/M4 implementation [SS16]

– 2 blocks in parallel in CTR mode – RV32I advantage: no spills in SubBytes, enough registers! – RV32I disadvantage: no rotates, no byte extraction – Key expansion in 1239 cycles, encryption in 124.4 cycles/byte

6/18

slide-7
SLIDE 7

ChaCha20

  • Stream cipher with 512-bit state
  • RV32I advantage: fits in registers
  • RV32I disadvantage: no rotates
  • Encryption/decryption in 27.9 cycles/byte

7/18

slide-8
SLIDE 8

Keccak-f [1600]

  • Design space explored in Keccak Implementation Overview [BDP+12]

– Bit interleaving: ✓ – Lane complementing: ✓ – State extension for smoother scheduling: ✓ – Plane per plane: ✓ – In-place: ✓

  • Inspired by Cortex-M3/M4 implementation in XKCP
  • Permutation in 72.4 cycles/byte

8/18

slide-9
SLIDE 9

Speed comparison

Table-based AES Bitsliced AES-CTR ChaCha20 Keccak-f [1600] 50 100 Cycles/byte Cortex-M4 RV32I

9/18

slide-10
SLIDE 10

What if. . . single-cycle rotations?

Table-based AES Bitsliced AES-CTR ChaCha20 Keccak-f [1600] 50 100 Cycles/byte Cortex-M4 RV32I RV32I with rotate

10/18

slide-11
SLIDE 11

Arbitrary-precision arithmetic

  • A.k.a. big-integer arithmetic (well, only +, ×)
  • Used by RSA, ECC, some post-quantum, . . .
  • Split large number in 32-bit limbs
  • Addition of two 32-bit limbs may overflow
  • RV32I: no carry flag!
  • ADDS r0,a0,b0; ADC r1,a1,b1 on ARM becomes
  • ADD r0,a0,b0; SLTU c,r0,a0; ADD r1,a1,b1; ADD r1,r1,c
  • Reduced-radix representations appear attractive
  • Radix 2k: only fill k < 32 bits per limb
  • We keep it generic and don’t fix specific radix

11/18

slide-12
SLIDE 12

Arbitrary-precision addition

2 4 6 8 10 12 14 16 18 20 100 200 300 Number of limbs Cycles Reduced Full Note: reduced radix requires more limbs

12/18

slide-13
SLIDE 13

What if. . . carry flag?

2 4 6 8 10 12 14 16 18 20 100 200 300 Number of limbs Cycles Reduced Full Full + carry Note: reduced radix requires more limbs

13/18

slide-14
SLIDE 14

Arbitrary-precision multiplication

  • M extension provides MUL/MULHU instructions
  • Result latency of 2 cycles
  • Consider schoolbook and one level of (subtractive) Karatsuba
  • Instead of n-limb multiplication, do 3

n

2

  • multiplication and some

additions/subtractions

14/18

slide-15
SLIDE 15

Arbitrary-precision multiplication

2 4 6 8 10 12 14 16 18 20 2,000 4,000 6,000 8,000 10,000 Number of limbs Cycles Schoolbook reduced Schoolbook full Karatsuba reduced Karatsuba full

15/18

slide-16
SLIDE 16

What if. . . carry flag?

2 4 6 8 10 12 14 16 18 20 2,000 4,000 6,000 8,000 10,000 Number of limbs Cycles Schoolbook reduced Schoolbook full Schoolbook full + carry Karatsuba reduced Karatsuba full Karatsuba full + carry

16/18

slide-17
SLIDE 17

Some conclusions

  • The base RV32I ISA is not that interesting for optimization
  • Comparing speed results across different RISC-V cores is going to be a

pain in the future – More variation in clock cycle behavior – Different standardized and perhaps also proprietary extensions

  • Symmetric crypto would really benefit from nice bit operation

instructions

  • Carry-chain crypto would really benefit from a carry flag
  • Having more registers is always nice

17/18

slide-18
SLIDE 18
  • Thanks. . .

. . . for your attention! Slides/paper at https://ko.stoffelen.nl Code at https://github.com/Ko-/riscvcrypto

18/18

slide-19
SLIDE 19

References I

Guido Bertoni, Joan Daemen, Michaël Peeters, Gilles Van Assche, and Ronny Van Keer. Keccak implementation overview, May 2012. https://keccak.team/files/Keccak-implementation-3.2.pdf. Daniel J. Bernstein and Peter Schwabe. New AES software speed records. In Dipanwita Roy Chowdhury, Vincent Rijmen, and Abhijit Das, editors, Progress in Cryptology - INDOCRYPT 2008: 9th International Conference in Cryptology in India, volume 5365 of Lecture Notes in Computer Science, pages 322–336. Springer, Heidelberg, December 2008. Peter Schwabe and Ko Stoffelen. All the AES you need on Cortex-M3 and M4. In Roberto Avanzi and Howard M. Heys, editors, SAC 2016: 23rd Annual International Workshop on Selected Areas in Cryptography, volume 10532 of Lecture Notes in Computer Science, pages 180–194. Springer, Heidelberg, August 2016.

19/18