efficient cryptography on the risc v architecture
play

Efficient Cryptography on the RISC-V Architecture Ko Stoffelen - PowerPoint PPT Presentation

Efficient Cryptography on the RISC-V Architecture Ko Stoffelen Tl;dr In this talk: Fast AES-128 assembly for RV32I Fast ChaCha20 assembly for RV32I Fast Keccak- f [1600] assembly for RV32I Fast arbitrary-precision integer


  1. Efficient Cryptography on the RISC-V Architecture Ko Stoffelen

  2. Tl;dr In this talk: Fast AES-128 assembly for RV32I • Fast ChaCha20 assembly for RV32I • Fast Keccak- f [1600] assembly for RV32I • Fast arbitrary-precision integer arithmetic for RV32IM • Estimate potential speedup with several RISC-V extensions • 2/18

  3. RISC-V is . . . . . . a new open reduced instruction set architecture (ISA) • . . . a research project that started at UC Berkely in 2010 • . . . a foundation with > 325 members, including Google, Infineon, • NXP, Qualcomm, Samsung, etc. . . . a serious competitor to ARM? • . . . a big hype? • . . . a project with a lot of work in progress • . . . a frozen 32-bit base ISA (RV32I) and 64-bit (RV64I) with • standardized optional extensions . . . a production-ready core design • 3/18

  4. 4/18

  5. RV32I 32 32-bit registers x0 – x31 , but some are reserved • Basic three-operand arithmetic and bitwise instructions • Basic shift instructions • Basic load/store instructions • Basic jump, conditional jumps, comparison instructions • That’s more or less it — boring! • No: rotate instructions, carry flag, DSP/vector instructions, nice bit • operation instructions Compensated by extensions: M , A , F , D , Q , C , . . . • – M : integer multiplication/division – B : bit operation instructions (WIP) HiFive1: 5-stage single-issue in-order pipelined RV32IMAC E31 CPU • – < 384 MHz, 64 KiB RAM, 16 MiB flash, 16 KiB I$ – Most instructions single cycle result latency, except loads 5/18

  6. AES-128 Lookup tables or bitsliced? • Both! Depends on data caches • 4 KiB table-based fairly straightforward [BS08] • – Baseline of 704 instructions – LBU byte loads: ✓ ( − 4) – Everything else: ✗ – Can’t load from address with offset in registers ( + 160) – Key expansion in 340 cycles, encryption in 57 cycles/byte Bitsliced based on Cortex-M3/M4 implementation [SS16] • – 2 blocks in parallel in CTR mode – RV32I advantage: no spills in SubBytes, enough registers! – RV32I disadvantage: no rotates, no byte extraction – Key expansion in 1239 cycles, encryption in 124.4 cycles/byte 6/18

  7. ChaCha20 Stream cipher with 512-bit state • RV32I advantage: fits in registers • RV32I disadvantage: no rotates • Encryption/decryption in 27.9 cycles/byte • 7/18

  8. Keccak- f [1600] Design space explored in Keccak Implementation Overview [ BDP + 12 ] • – Bit interleaving: ✓ – Lane complementing: ✓ – State extension for smoother scheduling: ✓ – Plane per plane: ✓ – In-place: ✓ Inspired by Cortex-M3/M4 implementation in XKCP • Permutation in 72.4 cycles/byte • 8/18

  9. Speed comparison Cortex-M4 RV32I 100 Cycles/byte 50 Table-based Bitsliced ChaCha20 Keccak- f [1600] AES AES-CTR 9/18

  10. What if. . . single-cycle rotations? Cortex-M4 RV32I 100 RV32I with rotate Cycles/byte 50 Table-based Bitsliced ChaCha20 Keccak- f [1600] AES AES-CTR 10/18

  11. Arbitrary-precision arithmetic A.k.a. big-integer arithmetic (well, only + , × ) • Used by RSA, ECC, some post-quantum, . . . • Split large number in 32-bit limbs • Addition of two 32-bit limbs may overflow • RV32I: no carry flag! • ADDS r0,a0,b0 ; ADC r1,a1,b1 on ARM becomes • ADD r0,a0,b0 ; SLTU c,r0,a0 ; ADD r1,a1,b1 ; ADD r1,r1,c • Reduced-radix representations appear attractive • Radix 2 k : only fill k < 32 bits per limb • We keep it generic and don’t fix specific radix • 11/18

  12. Arbitrary-precision addition Reduced 300 Full Cycles 200 100 2 4 6 8 10 12 14 16 18 20 Number of limbs Note: reduced radix requires more limbs 12/18

  13. What if. . . carry flag? Reduced 300 Full Full + carry Cycles 200 100 2 4 6 8 10 12 14 16 18 20 Number of limbs Note: reduced radix requires more limbs 13/18

  14. Arbitrary-precision multiplication M extension provides MUL / MULHU instructions • Result latency of 2 cycles • Consider schoolbook and one level of (subtractive) Karatsuba • � n � Instead of n -limb multiplication, do 3 multiplication and some • 2 additions/subtractions 14/18

  15. Arbitrary-precision multiplication 10 , 000 Schoolbook reduced Schoolbook full 8 , 000 Karatsuba reduced 6 , 000 Karatsuba full Cycles 4 , 000 2 , 000 0 2 4 6 8 10 12 14 16 18 20 Number of limbs 15/18

  16. What if. . . carry flag? 10 , 000 Schoolbook reduced Schoolbook full 8 , 000 Schoolbook full + carry 6 , 000 Karatsuba reduced Cycles Karatsuba full 4 , 000 Karatsuba full + carry 2 , 000 0 2 4 6 8 10 12 14 16 18 20 Number of limbs 16/18

  17. Some conclusions The base RV32I ISA is not that interesting for optimization • Comparing speed results across different RISC-V cores is going to be a • pain in the future – More variation in clock cycle behavior – Different standardized and perhaps also proprietary extensions Symmetric crypto would really benefit from nice bit operation • instructions Carry-chain crypto would really benefit from a carry flag • Having more registers is always nice • 17/18

  18. Thanks. . . . . . for your attention! Slides/paper at https://ko.stoffelen.nl Code at https://github.com/Ko-/riscvcrypto 18/18

  19. References I Guido Bertoni, Joan Daemen, Michaël Peeters, Gilles Van Assche, and Ronny Van Keer. Keccak implementation overview, May 2012. https://keccak.team/files/Keccak-implementation-3.2.pdf . Daniel J. Bernstein and Peter Schwabe. New AES software speed records. In Dipanwita Roy Chowdhury, Vincent Rijmen, and Abhijit Das, editors, Progress in Cryptology - INDOCRYPT 2008: 9th International Conference in Cryptology in India , volume 5365 of Lecture Notes in Computer Science , pages 322–336. Springer, Heidelberg, December 2008. Peter Schwabe and Ko Stoffelen. All the AES you need on Cortex-M3 and M4. In Roberto Avanzi and Howard M. Heys, editors, SAC 2016: 23rd Annual International Workshop on Selected Areas in Cryptography , volume 10532 of Lecture Notes in Computer Science , pages 180–194. Springer, Heidelberg, August 2016. 19/18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend