Pipeline Oriented Implementation of NORX for ARM Processors Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio López jlopez@ic.unicamp.br November 7, 2017 Institute of Computating - UNICAMP LASCA
Table of contents 1. Introduction 2. Target architecture 3. NORX family of AEAD algorithms 4. Pipeline optimization 5. Results 6. Future work 1/37
Introduction
Authenticated encryption (with additional data) • An AEAD scheme is an algorithm that uses a secret key and public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02]. • Such a scheme is useful, for example, to encrypt the body of a message, keep a header in plaintext and authenticate the whole. Figure 1: Basic block design of an AEAD. 2/37
Authenticated encryption (with additional data) Formally: • An AEAD scheme is defined by Π = ( K , E , D ) and the associated sets Nonce = { 0 , 1 } n , Header ⊂ { 0 , 1 } ∗ and Message ⊆ { 0 , 1 } ∗ . • The keyspace K is a non-empty set of strings. • The message M ∈ Message ; The Nonce N ∈ Nonce ; The Header H ∈ Header . • The encryption algorithm E N , H ( M ) → C . K • The decryption algorithm D N , H ( C ) → { M , ⊥} . K • It is required that D N , H ( E N , H ( M )) = M for all K ∈ K , N , H K K and M . • And |E N , H ( M ) | = ℓ ( | M | ) for some linear-time length K function ℓ . 3/37
Cryptographic competitions: CAESAR • CAESAR (2013, –) stands for ”Competition for Authenticated Encryption: Security, Applicability, and Robustness” [CAE13]. • CAESAR aims to select a portfolio of AEAD ciphers, suited for widespread adoption and that offer advantages over NIST’s AES-GCM. • Following the footsteps of other cryptographic competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms. 4/37
Cryptographic sponges • A cryptographic sponge function is an algorithm with a finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11]. • Sponges can be used to creat hash functions, MACs, stream ciphers, RNGs and AEAD. Figure 2: The basic design of a sponge function [BDPA11]. 5/37
Target architecture
inside-the-numbers-100-billion-arm-based-chips-1345571105 ARM processors • The Advanced RISC Machine is a mainly 32-bit architecture owned by the British company ARM Holdings. • With more than 100 billion chips deployes up to 2017, it is one of the most widespread architectures nowadays. 1 • ARM follows a load/store architecture, and mostly a single clock cycle execution. • In this work, we focused on the Cortex-A family: Cortex-A7, Cortex-A15 and Cortex-A53. 1 https://community.arm.com/processors/b/blog/posts/ 6/37
ARM processors: Target cores i • Cortex-A7: The most efficient ARMv7-A core, with over a billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with other high-performance cores. • Cortex-A15: A high-performance ARMv7-A core, well suited to consumer items such as smartphones and embedded applications. As with other processors of the same line, it is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations. 7/37
ARM processors: Target cores ii • Cortex-A53: An ARMv8-A core capable of seamlessly running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness. 8/37
NORX family of AEAD algorithms
NORX AEAD • NORX is a family of AEAD algorithms, currently in the third round of CAESAR. • Based on a sponge design, it is a simple yet fast algorithm, optimized for both 32-bit and 64-bit architectures. • The design of NORX also allows for arbitrary parallelism in the payload processing. • Based on ARX 2 primitives, NORX is optimized for both software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access. 2 Addition-Rotation-Xor 9/37
NORX AEAD • The naming convention for NORX is NORX wlpt , where: • w is the bit size of the words in the internal state. • l is the number of rounds. • p is the parallelism degree. • t is the bitsize length of the authentication tag. When t = 4 w , it is omitted. • The key length of NORX is k = 4 w , therefore, the 32-bit algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits. 10/37
NORX’s mode of operation i The state is transformed in each step of the cipher using a non linear permutation F ℓ . Figure 3: The layout of NORX.[AJN14]. 11/37
NORX’s mode of operation ii Figure 4: The layout of NORX, with parallel payload processing.[AJN14]. 12/37
NORX’s core permutation • The core of NORX is a 16-word internal state S, that can be viewed as a 4 × 4 matrix: s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 S = s 8 s 9 s 10 s 11 s 12 s 13 s 14 s 15 13/37
Pipeline optimization
Original permutation The permutation can be visually represented as: G() G() G() s 3 s 2 s 1 s 0 s 0 s 2 G() s 1 s 3 s 7 s 6 s 5 s 4 s 6 s 4 s 5 s 7 s 11 s 10 s 8 s 10 s 9 s 8 s 9 s 11 s 15 s 12 s 14 s 14 s 13 s 13 s 15 s 12 G() G() G() G() Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14]. 14/37
Original permutation The Norx permutation is subdivided into a function G () , applied to the lines and diagonals of S : Algorithm 1 NORX F round function 1: function F 2: input: S , G () ▷ Norx State s 0 · · · s 15 and G () function 3: s 0 , s 4 , s 8 , s 12 ← G ( s 0 , s 4 , s 8 , s 12 ) ▷ Processing the columns 4: s 1 , s 5 , s 9 , s 13 ← G ( s 1 , s 5 , s 9 , s 13 ) 5: s 2 , s 6 , s 10 , s 14 ← G ( s 2 , s 6 , s 10 , s 14 ) 6: s 3 , s 7 , s 11 , s 15 ← G ( s 3 , s 7 , s 11 , s 15 ) 7: s 0 , s 5 , s 10 , s 15 ← G ( s 0 , s 5 , s 10 , s 15 ) ▷ Processing the diagonals 8: s 1 , s 6 , s 11 , s 12 ← G ( s 1 , s 6 , s 11 , s 12 ) 9: s 2 , s 7 , s 8 , s 13 ← G ( s 2 , s 7 , s 8 , s 13 ) 10: s 3 , s 4 , s 9 , s 14 ← G ( s 3 , s 4 , s 9 , s 14 ) 11: output: S 12: end function 15/37
Original permutation With G ( a , b , c , d ) being defined as: Algorithm 2 NORX G permutation function 1: function G 2: input: a , b , c , d ▷ Four words of the State 3: a ← ( a ⊕ b ) ⊕ (( a ∧ b ) ≪ 1 ) 4: d ← ( a ⊕ d ) ≫ r 0 5: c ← ( c ⊕ d ) ⊕ (( c ∧ d ) ≪ 1 ) 6: b ← ( c ⊕ b ) ≫ r 1 7: a ← ( a ⊕ b ) ⊕ (( a ∧ b ) ≪ 1 ) 8: d ← ( a ⊕ d ) ≫ r 2 9: c ← ( c ⊕ d ) ⊕ (( c ∧ d ) ≪ 1 ) 10: b ← ( c ⊕ b ) ≫ r 3 11: output: a , b , c , d 12: end function How can we improve the this permutation? 16/37
Code profiling A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization. Figure 6: Profiling results 17/37
Optimizing the F () function The G () function can be split and reorganized in order to better use the processor’s pipeline: G2() G2() G2() s 2 s 3 s 0 s 1 s 1 s 2 s 3 s 0 s 7 s 5 s 4 s 5 s 6 s 7 s 6 s 4 s 10 s 11 s 8 s 9 s 10 s 11 s 8 s 9 s 15 s 12 s 13 s 14 s 15 s 13 s 12 s 14 G2() Figure 7: Column and diagonal steps, with two way pipeline optimization. Notice that each call to function G 2 () operates over 8 words. 18/37
optimizing the F () function Or even further, with a single G () function operating in the whole state at once: G4() G4() s 1 s 2 s 3 s 1 s 2 s 3 s 0 s 0 s 4 s 5 s 6 s 7 s s s s 5 6 7 4 s 8 s 9 s 10 s 11 s s s s 10 11 8 9 s 12 s 13 s 14 s 15 s s s s 15 12 13 14 Figure 8: Column and diagonal steps, with four way pipeline optimization, operating over the whole state at once. 19/37
Additional optimizations A few extra steps were taken to improve code performance: • Extensive use of preprocessor macros and code inlining. • Avoiding use of extra or temporary variables, encrypting and decrypting in place. • Initialization of the sponge via loads of constant values instead of evaluating F 2 ( 0 ∥ 1 ∥ 2 ∥ · · · ∥ 15 ) . • Where possible, concatenate shift and rotate operations together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b<<2 will compile into ADD r1, r1, r2, LSL #2 . 20/37
Results
Benchmark i • Benchmarks were carried out on a Odroid XU4 device running Arch Linux for Cortex-A7 and Cortex-A15. • An Odroid-C2 device was used for tests with the 64-bit Cortex-A53, also running Arch Linux. • Codes were compiled with gcc 6.3.1. 21/37
Benchmark ii • Each test consists of the encryption of random data with lengths between 128 bytes and 1 megabyte, with a 128-bit key for NORX3261 and NORX3264, and a 256-bit key for NORX6461. • Our ests were also carried out on Cortex M4, M3 and M0 for consistency. • All measures were done using the processors’ cycle counter. 22/37
Recommend
More recommend