Pipeline Oriented Implementation of NORX for ARM Processors Luan - PowerPoint PPT Presentation

Pipeline Oriented Implementation of NORX for ARM Processors Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio López jlopez@ic.unicamp.br November 7, 2017 Institute of Computating - UNICAMP LASCA

Table of contents 1. Introduction 2. Target architecture 3. NORX family of AEAD algorithms 4. Pipeline optimization 5. Results 6. Future work 1/37

Introduction

Authenticated encryption (with additional data) • An AEAD scheme is an algorithm that uses a secret key and public nonce to process a plaintext and additional plain data to output ciphertext and authentication data [Rog02]. • Such a scheme is useful, for example, to encrypt the body of a message, keep a header in plaintext and authenticate the whole. Figure 1: Basic block design of an AEAD. 2/37

Authenticated encryption (with additional data) Formally: • An AEAD scheme is defined by Π = ( K , E , D ) and the associated sets Nonce = { 0 , 1 } n , Header ⊂ { 0 , 1 } ∗ and Message ⊆ { 0 , 1 } ∗ . • The keyspace K is a non-empty set of strings. • The message M ∈ Message ; The Nonce N ∈ Nonce ; The Header H ∈ Header . • The encryption algorithm E N , H ( M ) → C . K • The decryption algorithm D N , H ( C ) → { M , ⊥} . K • It is required that D N , H ( E N , H ( M )) = M for all K ∈ K , N , H K K and M . • And |E N , H ( M ) | = ℓ ( | M | ) for some linear-time length K function ℓ . 3/37

Cryptographic competitions: CAESAR • CAESAR (2013, –) stands for ”Competition for Authenticated Encryption: Security, Applicability, and Robustness” [CAE13]. • CAESAR aims to select a portfolio of AEAD ciphers, suited for widespread adoption and that offer advantages over NIST’s AES-GCM. • Following the footsteps of other cryptographic competitions, such as SHA-3 (2007-2012), AES (1997-2000) and eSTREAM (2004-2008), CAESAR also aims to promote research on AEAD algorithms. 4/37

Cryptographic sponges • A cryptographic sponge function is an algorithm with a finite internal state, that receives as input strings of any length and produces an output of desired length [BDPA11]. • Sponges can be used to creat hash functions, MACs, stream ciphers, RNGs and AEAD. Figure 2: The basic design of a sponge function [BDPA11]. 5/37

Target architecture

inside-the-numbers-100-billion-arm-based-chips-1345571105 ARM processors • The Advanced RISC Machine is a mainly 32-bit architecture owned by the British company ARM Holdings. • With more than 100 billion chips deployes up to 2017, it is one of the most widespread architectures nowadays. 1 • ARM follows a load/store architecture, and mostly a single clock cycle execution. • In this work, we focused on the Cortex-A family: Cortex-A7, Cortex-A15 and Cortex-A53. 1 https://community.arm.com/processors/b/blog/posts/ 6/37

ARM processors: Target cores i • Cortex-A7: The most efficient ARMv7-A core, with over a billion shipped units. Capable of 40-bit physical adressing, and features an eight-stage in-order pipeline. It can be featured in big.LITTLE technology together with other high-performance cores. • Cortex-A15: A high-performance ARMv7-A core, well suited to consumer items such as smartphones and embedded applications. As with other processors of the same line, it is capable of 40-bit physical addressing. It also features a fifteen-stage out-of-order superscalar pipeline for integer calculations. 7/37

ARM processors: Target cores ii • Cortex-A53: An ARMv8-A core capable of seamlessly running both 32-bit and 64-bit code, and is made as an efficient 64-bit core for a low area and power footprint. Like the Cortex-A7, it is capable of being deployed together with high-end CPUs for chips with heterogeneous cores. The Cortex-A53 uses an efficient eight-stage, 2-way superscalar, in-order pipeline. Our tests were also carried on Cortex-M4, M3 and M0, for completeness. 8/37

NORX family of AEAD algorithms

NORX AEAD • NORX is a family of AEAD algorithms, currently in the third round of CAESAR. • Based on a sponge design, it is a simple yet fast algorithm, optimized for both 32-bit and 64-bit architectures. • The design of NORX also allows for arbitrary parallelism in the payload processing. • Based on ARX 2 primitives, NORX is optimized for both software and hardware implementations, with a SIMD friendly core permutation and no secret-dependent memory access. 2 Addition-Rotation-Xor 9/37

NORX AEAD • The naming convention for NORX is NORX wlpt , where: • w is the bit size of the words in the internal state. • l is the number of rounds. • p is the parallelism degree. • t is the bitsize length of the authentication tag. When t = 4 w , it is omitted. • The key length of NORX is k = 4 w , therefore, the 32-bit algorithm has a security level of 128 bits, while the 64-bit algorithm has a security level of 256 bits. 10/37

NORX’s mode of operation i The state is transformed in each step of the cipher using a non linear permutation F ℓ . Figure 3: The layout of NORX.[AJN14]. 11/37

NORX’s mode of operation ii Figure 4: The layout of NORX, with parallel payload processing.[AJN14]. 12/37

NORX’s core permutation • The core of NORX is a 16-word internal state S, that can be viewed as a 4 × 4 matrix: s 0 s 1 s 2 s 3   s 4 s 5 s 6 s 7   S =   s 8 s 9 s 10 s 11     s 12 s 13 s 14 s 15 13/37

Pipeline optimization

Original permutation The permutation can be visually represented as: G() G() G() s 3 s 2 s 1 s 0 s 0 s 2 G() s 1 s 3 s 7 s 6 s 5 s 4 s 6 s 4 s 5 s 7 s 11 s 10 s 8 s 10 s 9 s 8 s 9 s 11 s 15 s 12 s 14 s 14 s 13 s 13 s 15 s 12 G() G() G() G() Figure 5: Column (left) and diagonal (right) steps. Source: Norx v3.0 specification [AJN14]. 14/37

Original permutation The Norx permutation is subdivided into a function G () , applied to the lines and diagonals of S : Algorithm 1 NORX F round function 1: function F 2: input: S , G () ▷ Norx State s 0 · · · s 15 and G () function 3: s 0 , s 4 , s 8 , s 12 ← G ( s 0 , s 4 , s 8 , s 12 ) ▷ Processing the columns 4: s 1 , s 5 , s 9 , s 13 ← G ( s 1 , s 5 , s 9 , s 13 ) 5: s 2 , s 6 , s 10 , s 14 ← G ( s 2 , s 6 , s 10 , s 14 ) 6: s 3 , s 7 , s 11 , s 15 ← G ( s 3 , s 7 , s 11 , s 15 ) 7: s 0 , s 5 , s 10 , s 15 ← G ( s 0 , s 5 , s 10 , s 15 ) ▷ Processing the diagonals 8: s 1 , s 6 , s 11 , s 12 ← G ( s 1 , s 6 , s 11 , s 12 ) 9: s 2 , s 7 , s 8 , s 13 ← G ( s 2 , s 7 , s 8 , s 13 ) 10: s 3 , s 4 , s 9 , s 14 ← G ( s 3 , s 4 , s 9 , s 14 ) 11: output: S 12: end function 15/37

Original permutation With G ( a , b , c , d ) being defined as: Algorithm 2 NORX G permutation function 1: function G 2: input: a , b , c , d ▷ Four words of the State 3: a ← ( a ⊕ b ) ⊕ (( a ∧ b ) ≪ 1 ) 4: d ← ( a ⊕ d ) ≫ r 0 5: c ← ( c ⊕ d ) ⊕ (( c ∧ d ) ≪ 1 ) 6: b ← ( c ⊕ b ) ≫ r 1 7: a ← ( a ⊕ b ) ⊕ (( a ∧ b ) ≪ 1 ) 8: d ← ( a ⊕ d ) ≫ r 2 9: c ← ( c ⊕ d ) ⊕ (( c ∧ d ) ≪ 1 ) 10: b ← ( c ⊕ b ) ≫ r 3 11: output: a , b , c , d 12: end function How can we improve the this permutation? 16/37

Code profiling A synthetic test, using encryptions of random data was profiled for identification of hotspots. roundF is the best target for optimization. Figure 6: Profiling results 17/37

Optimizing the F () function The G () function can be split and reorganized in order to better use the processor’s pipeline: G2() G2() G2() s 2 s 3 s 0 s 1 s 1 s 2 s 3 s 0 s 7 s 5 s 4 s 5 s 6 s 7 s 6 s 4 s 10 s 11 s 8 s 9 s 10 s 11 s 8 s 9 s 15 s 12 s 13 s 14 s 15 s 13 s 12 s 14 G2() Figure 7: Column and diagonal steps, with two way pipeline optimization. Notice that each call to function G 2 () operates over 8 words. 18/37

optimizing the F () function Or even further, with a single G () function operating in the whole state at once: G4() G4() s 1 s 2 s 3 s 1 s 2 s 3 s 0 s 0 s 4 s 5 s 6 s 7 s s s s 5 6 7 4 s 8 s 9 s 10 s 11 s s s s 10 11 8 9 s 12 s 13 s 14 s 15 s s s s 15 12 13 14 Figure 8: Column and diagonal steps, with four way pipeline optimization, operating over the whole state at once. 19/37

Additional optimizations A few extra steps were taken to improve code performance: • Extensive use of preprocessor macros and code inlining. • Avoiding use of extra or temporary variables, encrypting and decrypting in place. • Initialization of the sponge via loads of constant values instead of evaluating F 2 ( 0 ∥ 1 ∥ 2 ∥ · · · ∥ 15 ) . • Where possible, concatenate shift and rotate operations together with arithmetic ones, as to allow the use of ARM’s barrel shifter. For example a=a+b<<2 will compile into ADD r1, r1, r2, LSL #2 . 20/37

Results

Benchmark i • Benchmarks were carried out on a Odroid XU4 device running Arch Linux for Cortex-A7 and Cortex-A15. • An Odroid-C2 device was used for tests with the 64-bit Cortex-A53, also running Arch Linux. • Codes were compiled with gcc 6.3.1. 21/37

Benchmark ii • Each test consists of the encryption of random data with lengths between 128 bytes and 1 megabyte, with a 128-bit key for NORX3261 and NORX3264, and a 256-bit key for NORX6461. • Our ests were also carried out on Cortex M4, M3 and M0 for consistency. • All measures were done using the processors’ cycle counter. 22/37

Pipeline Oriented Implementation of NORX for ARM Processors Luan - PowerPoint PPT Presentation

Pipeline Oriented Implementation of NORX for ARM Processors Luan Cardoso dos Santos luan@lasca.ic.unicamp.br Julio Lpez jlopez@ic.unicamp.br November 7, 2017 Institute of Computating - UNICAMP LASCA Table of contents 1. Introduction 2.

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Cryptanalysis of NORX v2.0 Colin Chaigneau 1 Thomas Fuhr 2 Henri Gilbert 2 Jrmy Jean 2

Object oriented Object oriented Object oriented Object oriented approach and UML approach and

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

Podium 1.1 Arm support in sub-acute stroke rehabilitation ROBAR implementation study:

Secure Architecture and Secure Architecture and Implementation of Xen Xen on ARM on ARM

Graphics Pipeline Rendering approaches 1. object-oriented 3D rendering vertices image pipeline

ETH Vorlesung Systembau / Lecture System Construction Discussion: Why Build from Scratch? (2)

Massively Parallel Architectures MPP Specifics Cluster Computing No shared memory Scales

QcBits: constant-time small-key code-based cryptography Tung Chou Technische Universiteit

FPL 2019 High-performance Decoding of Variable-length Memory Data Packets for FPGA Stream

ATLAS Detector Commissioning Oslo EPF group aspect Y. Pylypchenko 1 , M. Pedersen 3 O. Rhne 2 ,

Munich Muon Spectrometer Calibration and Alignment Centre Oliver Kortner Max-Planck-Insitut f

CDC TRG Special Meeting in NTU Thank you very very much for the support from NTU!!! Yoshihito

Modelling Thin Film Growth Steven D Kenny Department of Mathematical Sciences Loughborough