pipeline oriented implementation of norx for arm
play

Pipeline Oriented Implementation of NORX for ARM Processors Luan - PowerPoint PPT Presentation

Pipeline Oriented Implementation of NORX for ARM Processors Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio Lpez jlopez@ic.unicamp.br November 7, 2017 Institute of Computating - UNICAMP LASCA Table of contents 1. Introduction 2.


  1. Pipeline Oriented Implementation of NORX for ARM Processors Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio López jlopez@ic.unicamp.br November 7, 2017 Institute of Computating - UNICAMP LASCA

  2. Table of contents 1. Introduction 2. Target architecture 3. NORX family of AEAD algorithms 4. Pipeline optimization 5. Results 6. Future work 1/37

  3. Introduction

  4. Authenticated encryption (with additional data) • An AEAD scheme is an algorithm that uses a secret key and public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02]. • Such a scheme is useful, for example, to encrypt the body of a message, keep a header in plaintext and authenticate the whole. Figure 1: Basic block design of an AEAD. 2/37

  5. Authenticated encryption (with additional data) Formally: • An AEAD scheme is defined by Π = ( K , E , D ) and the associated sets Nonce = { 0 , 1 } n , Header ⊂ { 0 , 1 } ∗ and Message ⊆ { 0 , 1 } ∗ . • The keyspace K is a non-empty set of strings. • The message M ∈ Message ; The Nonce N ∈ Nonce ; The Header H ∈ Header . • The encryption algorithm E N , H ( M ) → C . K • The decryption algorithm D N , H ( C ) → { M , ⊥} . K • It is required that D N , H ( E N , H ( M )) = M for all K ∈ K , N , H K K and M . • And |E N , H ( M ) | = ℓ ( | M | ) for some linear-time length K function ℓ . 3/37

  6. Cryptographic competitions: CAESAR • CAESAR (2013, –) stands for ”Competition for Authenticated Encryption: Security, Applicability, and Robustness” [CAE13]. • CAESAR aims to select a portfolio of AEAD ciphers, suited for widespread adoption and that offer advantages over NIST’s AES-GCM. • Following the footsteps of other cryptographic competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms. 4/37

  7. Cryptographic sponges • A cryptographic sponge function is an algorithm with a finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11]. • Sponges can be used to creat hash functions, MACs, stream ciphers, RNGs and AEAD. Figure 2: The basic design of a sponge function [BDPA11]. 5/37

  8. Target architecture

  9. inside-the-numbers-100-billion-arm-based-chips-1345571105 ARM processors • The Advanced RISC Machine is a mainly 32-bit architecture owned by the British company ARM Holdings. • With more than 100 billion chips deployes up to 2017, it is one of the most widespread architectures nowadays. 1 • ARM follows a load/store architecture, and mostly a single clock cycle execution. • In this work, we focused on the Cortex-A family: Cortex-A7, Cortex-A15 and Cortex-A53. 1 https://community.arm.com/processors/b/blog/posts/ 6/37

  10. ARM processors: Target cores i • Cortex-A7: The most efficient ARMv7-A core, with over a billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with other high-performance cores. • Cortex-A15: A high-performance ARMv7-A core, well suited to consumer items such as smartphones and embedded applications. As with other processors of the same line, it is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations. 7/37

  11. ARM processors: Target cores ii • Cortex-A53: An ARMv8-A core capable of seamlessly running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness. 8/37

  12. NORX family of AEAD algorithms

  13. NORX AEAD • NORX is a family of AEAD algorithms, currently in the third round of CAESAR. • Based on a sponge design, it is a simple yet fast algorithm, optimized for both 32-bit and 64-bit architectures. • The design of NORX also allows for arbitrary parallelism in the payload processing. • Based on ARX 2 primitives, NORX is optimized for both software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access. 2 Addition-Rotation-Xor 9/37

  14. NORX AEAD • The naming convention for NORX is NORX wlpt , where: • w is the bit size of the words in the internal state. • l is the number of rounds. • p is the parallelism degree. • t is the bitsize length of the authentication tag. When t = 4 w , it is omitted. • The key length of NORX is k = 4 w , therefore, the 32-bit algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits. 10/37

  15. NORX’s mode of operation i The state is transformed in each step of the cipher using a non linear permutation F ℓ . Figure 3: The layout of NORX.[AJN14]. 11/37

  16. NORX’s mode of operation ii Figure 4: The layout of NORX, with parallel payload processing.[AJN14]. 12/37

  17. NORX’s core permutation • The core of NORX is a 16-word internal state S, that can be viewed as a 4 × 4 matrix: s 0 s 1 s 2 s 3   s 4 s 5 s 6 s 7   S =   s 8 s 9 s 10 s 11     s 12 s 13 s 14 s 15 13/37

  18. Pipeline optimization

  19. Original permutation The permutation can be visually represented as: G() G() G() s 3 s 2 s 1 s 0 s 0 s 2 G() s 1 s 3 s 7 s 6 s 5 s 4 s 6 s 4 s 5 s 7 s 11 s 10 s 8 s 10 s 9 s 8 s 9 s 11 s 15 s 12 s 14 s 14 s 13 s 13 s 15 s 12 G() G() G() G() Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14]. 14/37

  20. Original permutation The Norx permutation is subdivided into a function G () , applied to the lines and diagonals of S : Algorithm 1 NORX F round function 1: function F 2: input: S , G () ▷ Norx State s 0 · · · s 15 and G () function 3: s 0 , s 4 , s 8 , s 12 ← G ( s 0 , s 4 , s 8 , s 12 ) ▷ Processing the columns 4: s 1 , s 5 , s 9 , s 13 ← G ( s 1 , s 5 , s 9 , s 13 ) 5: s 2 , s 6 , s 10 , s 14 ← G ( s 2 , s 6 , s 10 , s 14 ) 6: s 3 , s 7 , s 11 , s 15 ← G ( s 3 , s 7 , s 11 , s 15 ) 7: s 0 , s 5 , s 10 , s 15 ← G ( s 0 , s 5 , s 10 , s 15 ) ▷ Processing the diagonals 8: s 1 , s 6 , s 11 , s 12 ← G ( s 1 , s 6 , s 11 , s 12 ) 9: s 2 , s 7 , s 8 , s 13 ← G ( s 2 , s 7 , s 8 , s 13 ) 10: s 3 , s 4 , s 9 , s 14 ← G ( s 3 , s 4 , s 9 , s 14 ) 11: output: S 12: end function 15/37

  21. Original permutation With G ( a , b , c , d ) being defined as: Algorithm 2 NORX G permutation function 1: function G 2: input: a , b , c , d ▷ Four words of the State 3: a ← ( a ⊕ b ) ⊕ (( a ∧ b ) ≪ 1 ) 4: d ← ( a ⊕ d ) ≫ r 0 5: c ← ( c ⊕ d ) ⊕ (( c ∧ d ) ≪ 1 ) 6: b ← ( c ⊕ b ) ≫ r 1 7: a ← ( a ⊕ b ) ⊕ (( a ∧ b ) ≪ 1 ) 8: d ← ( a ⊕ d ) ≫ r 2 9: c ← ( c ⊕ d ) ⊕ (( c ∧ d ) ≪ 1 ) 10: b ← ( c ⊕ b ) ≫ r 3 11: output: a , b , c , d 12: end function How can we improve the this permutation? 16/37

  22. Code profiling A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization. Figure 6: Profiling results 17/37

  23. Optimizing the F () function The G () function can be split and reorganized in order to better use the processor’s pipeline: G2() G2() G2() s 2 s 3 s 0 s 1 s 1 s 2 s 3 s 0 s 7 s 5 s 4 s 5 s 6 s 7 s 6 s 4 s 10 s 11 s 8 s 9 s 10 s 11 s 8 s 9 s 15 s 12 s 13 s 14 s 15 s 13 s 12 s 14 G2() Figure 7: Column and diagonal steps, with two way pipeline optimization. Notice that each call to function G 2 () operates over 8 words. 18/37

  24. optimizing the F () function Or even further, with a single G () function operating in the whole state at once: G4() G4() s 1 s 2 s 3 s 1 s 2 s 3 s 0 s 0 s 4 s 5 s 6 s 7 s s s s 5 6 7 4 s 8 s 9 s 10 s 11 s s s s 10 11 8 9 s 12 s 13 s 14 s 15 s s s s 15 12 13 14 Figure 8: Column and diagonal steps, with four way pipeline optimization, operating over the whole state at once. 19/37

  25. Additional optimizations A few extra steps were taken to improve code performance: • Extensive use of preprocessor macros and code inlining. • Avoiding use of extra or temporary variables, encrypting and decrypting in place. • Initialization of the sponge via loads of constant values instead of evaluating F 2 ( 0 ∥ 1 ∥ 2 ∥ · · · ∥ 15 ) . • Where possible, concatenate shift and rotate operations together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b<<2 will compile into ADD r1, r1, r2, LSL #2 . 20/37

  26. Results

  27. Benchmark i • Benchmarks were carried out on a Odroid XU4 device running Arch Linux for Cortex-A7 and Cortex-A15. • An Odroid-C2 device was used for tests with the 64-bit Cortex-A53, also running Arch Linux. • Codes were compiled with gcc 6.3.1. 21/37

  28. Benchmark ii • Each test consists of the encryption of random data with lengths between 128 bytes and 1 megabyte, with a 128-bit key for NORX3261 and NORX3264, and a 256-bit key for NORX6461. • Our ests were also carried out on Cortex M4, M3 and M0 for consistency. • All measures were done using the processors’ cycle counter. 22/37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend