Efficient Algorithms in Software
Julio López
jlopez@ic.unicamp.br
Institute of Computing, University of Campinas
September 2017, Habana, Cuba.
ASCrypto 2017
Efficient Algorithms in Software Julio Lpez jlopez@ic.unicamp.br - - PowerPoint PPT Presentation
Efficient Algorithms in Software Julio Lpez jlopez@ic.unicamp.br Institute of Computing, University of Campinas September 2017, Habana, Cuba. ASCrypto 2017 Agenda 1 Efficient Software Implementations Software Efficiency Parallel
jlopez@ic.unicamp.br
Institute of Computing, University of Campinas
September 2017, Habana, Cuba.
ASCrypto 2017
1 Efficient Software Implementations
Software Efficiency Parallel Computation -SIMD
2 Symmetric-Key Cryptography
Data Encryption Hash Functions SHA2 Implementation SHA3 Implementation
3 Elliptic Curve Cryptography
Elliptic Curves Elliptic Curve Diffie-Hellman Digital Signatures EdDSA Scheme
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 2 / 83
Efficient Software Implementations Software Efficiency
The optimization of a software implementation of a cryptographic algorithm is a task with several goals:
characteristics
Sometimes these goals are in conflict with each other. For example: accelerating an operation using look-up tables, it will increase code size, and it could result vulnerable against memory cache-attacks (if not implemented adequately).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 3 / 83
Efficient Software Implementations Software Efficiency
between different computers; instead, clock cycles are measured.
processor.
1 #include
<stdint.h>
2 uint64_t
get_cycles () {
3
uint32_t lo ,hi;
4
asm volatile ("rdtsc":"=a"(lo),"=d"(hi ));
5
return (( uint64_t)hi < <32) | lo;
6 }
recommended to turn off technologies such as Turbo Boost or Hyper-Threading.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 4 / 83
Efficient Software Implementations Parallel Computation -SIMD
instruction is applied simultaneously over a set of data.
registers, also known as vector registers.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 5 / 83
Efficient Software Implementations Parallel Computation -SIMD
Instructions associated to vector registers are known as vector instructions. These instructions operate over words packed in vector registers.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 6 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX Integer Arithmetic
MMX
(64)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE Integer Arithmetic Floating-point Arithmetic
MMX
(64)
XMM
(128)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 Integer Arithmetic Floating-point Arithmetic
MMX
(64)
XMM
(128)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 Integer Arithmetic Floating-point Arithmetic
MMX
(64)
XMM
(128)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 SSE4 Integer Arithmetic Floating-point Arithmetic String Manipulation
MMX
(64)
XMM
(128)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography
MMX
(64)
XMM
(128)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography
MMX
(64)
XMM
(128)
YMM
(256)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX AVX2 BMI Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography Bit Manipulation
MMX
(64)
XMM
(128)
YMM
(256)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX AVX2 BMI SHA1-SHA2 Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography Bit Manipulation
MMX
(64)
XMM
(128)
YMM
(256)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX AVX2 BMI SHA1-SHA2 AVX-512 Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography Bit Manipulation
MMX
(64)
XMM
(128)
YMM
(256)
ZMM
(512)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83
Efficient Software Implementations Parallel Computation -SIMD
Integer arithmetic for 64-bit words:
C = ADD(A, B) a0 a1 a2 a3 + + + + b0 b1 b2 b3 c0 c1 c2 c3
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83
Efficient Software Implementations Parallel Computation -SIMD
Integer arithmetic for 64-bit words:
Variable logic shifts.
C = VSHL(A, B) a0 a1 a2 a3 ≪ ≪ ≪ ≪ b0 b1 b2 b3 c0 c1 c2 c3
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83
Efficient Software Implementations Parallel Computation -SIMD
Integer arithmetic for 64-bit words:
Variable logic shifts.
Permutation of words.
C = PERM(A, M) a3 a2 a1 a0 m3 m2 m1 m0 am3 am2 am1 am0 {0, 1, 2, 3}
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83
Efficient Software Implementations Parallel Computation -SIMD
Integer arithmetic for 64-bit words:
Variable logic shifts.
Permutation of words.
Combination/selection of registers.
without dependencies. C = BLEND(A, B, M) a3 a2 a1 a0 b3 b2 b1 b0 0/1 0/1 0/1 0/1 c3 c2 c1 c0
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83
Efficient Software Implementations Parallel Computation -SIMD
Full documentation available at: http://software.intel.com/sites/landingpage/IntrinsicsGuide
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 9 / 83
Efficient Software Implementations Parallel Computation -SIMD
The Skylake processor has eight execution ports for instructions. This improves the Instruction-Level Parallelism (ILP).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 10 / 83
Symmetric-Key Cryptography Data Encryption
channel.
interchanged by Alice and Bob.
0111100001100010101011111010
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 11 / 83
Symmetric-Key Cryptography Data Encryption
Using a secret key k, Alice and Bob can interchange encrypted messages. Charles can not read the messages without the knowledge of the key k.
0111100001100010101011111010
encryption C = Ek(M) decryption M = Dk(C)
Key Generation
(M, k) M C C k k
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 12 / 83
Symmetric-Key Cryptography Data Encryption
symmetric key.
128-bit ciphertext (C) using a key k.
AES M C k
algorithms:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 13 / 83
Symmetric-Key Cryptography Data Encryption
AES keeps track of a 128-bit state, which can be seen as a 4 × 4 matrix of bytes.
k0 kNr M C . . .
In each round, AES applies a series of transformations over the matrix. Nr =
10 if |k| = 128 12 if |k| = 192 14 if |k| = 256 After Nr rounds, the last state is returned as the ciphertext.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 14 / 83
Symmetric-Key Cryptography Data Encryption
For decryption, transformations are inverted and applied in reverse order.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 15 / 83
Symmetric-Key Cryptography Data Encryption
pe = {03}x3 + {01}x2 + {01}x + {02} c = pe ⊗ c = Me ⊗ c
c0 c1 c2 c3
=
02 03 01 01 01 02 03 01 01 01 02 03 03 01 01 02
c0 c1 c2 c3
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 16 / 83
Symmetric-Key Cryptography Data Encryption
pd = {0b}x3 + {0d}x2 + {09}x + {0e} c = pd ⊗ c = Md ⊗ c
c0 c1 c2 c3
=
0e 0b 0d 09 09 0e 0b 0d 0d 09 02 0b 0b 0d 09 0e
c0 c1 c2 c3
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 17 / 83
Symmetric-Key Cryptography Data Encryption
In 2010, Intel released a set of instructions to perform the AES algorithm.
AESENC AESENCLAST
Plaintext
AddRoundKey SubBytes ShiftRows MixColumns AddRoundKey SubBytes ShiftRows AddRoundKey
Ciphertext Nr − 1
AESDECLAST AESDEC
Plaintext
AddRoundKey InvSubBytes InvShiftRows AddRoundKey InvMixColumns InvSubBytes InvShiftRows AddRoundKey
Ciphertext Nr − 1
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 18 / 83
Symmetric-Key Cryptography Data Encryption
Encrypting a 128-bit block (stored in xmm15) using the key schedule (stored in xmm0-xmm10). Nr = 10.
1
MOVQDA xmm15 , (% rsi) ; Load message block
2
PXOR xmm15 , xmm0 ; AddRoundKey
3
AESENC xmm15 , xmm1 ; Round 1
4
AESENC xmm15 , xmm2 ; Round 2
5
AESENC xmm15 , xmm3 ; Round 3
6
AESENC xmm15 , xmm4 ; Round 4
7
AESENC xmm15 , xmm5 ; Round 5
8
AESENC xmm15 , xmm6 ; Round 6
9
AESENC xmm15 , xmm7 ; Round 7
10
AESENC xmm15 , xmm8 ; Round 8
11
AESENC xmm15 , xmm9 ; Round 9
12
AESENCLAST xmm15 , xmm10 ; Round 10
13
MOVQDA (% rdi), xmm15 ; Store cipher block
Analogously, for decryption use AESDEC, AESDECLAST and invert the key schedule using AESIMC.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 19 / 83
Symmetric-Key Cryptography Data Encryption
Splitting a long message into 128-bit blocks and encrypting each one is not secure! (ECB Mode) Modes of operation are used for encrypting arbitrary-length messages using a block cipher as a building block.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 20 / 83
Symmetric-Key Cryptography Data Encryption
Ek P1 IV C1 Ek P2 C2 Ek P3 C3 Ek P4 C4 Dk C1 P1 IV Dk C2 P2 Dk C3 P3 Dk C4 P4
Encryption Decryption (sequential execution) (parallel execution)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 21 / 83
Symmetric-Key Cryptography Data Encryption
Ek IV+1 P1 C1 Ek IV+2 P2 C2 Ek IV+3 P3 C3 Ek IV+4 P4 C4 Ek IV+1 C1 P1 Ek IV+2 C2 P2 Ek IV+3 C3 P3 Ek IV+4 C4 P4
Encryption Decryption Either encryption and decryption can be executed in parallel. The block cipher encryption is used only.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 22 / 83
Symmetric-Key Cryptography Data Encryption
The performance is determined by the latency of the AESENC instruction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Clock
AESENC AESENC
· · · · · · · · ·
AESENC
Latency
µ-arch Latency CBC-ENC Intel Haswell 7 4.49 Intel Skylake 4 2.71 AMD Zen 4 2.44
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 23 / 83
Symmetric-Key Cryptography Data Encryption
The execution of AESENC instruction can be overlapped with other instructions of the same type.
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
AESENC AESENC
· · · · · · · · ·
AESENC
Latency Throughput
AESENC AESENC
· · · · · · · · ·
AESENC AESENC AESENC
· · · · · · · · ·
AESENC AESENC AESENC
· · · · · · · · ·
AESENC w = 4
Processor’s pipeline improves performance of CBC-DEC and CTR modes.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 24 / 83
Symmetric-Key Cryptography Data Encryption
Haswell Skylake Zen
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Running Time
(cycles-per-byte)
w = 1 w = 2 w = 4
Scheduling w = 4 AES-NI instructions, the performance of decryption is improved. Can we do better?
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83
Symmetric-Key Cryptography Data Encryption
Haswell Skylake Zen
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Running Time
(cycles-per-byte)
w = 1 w = 2 w = 4 w = 8
Yes! Zen has two execution units for AES-NI instructions.
µ-arch Latency CBC-ENC CBC-DEC Intel Haswell 7 4.49 0.63 Intel Skylake 4 2.71 0.62 AMD Zen 4 2.44 0.37
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83
Symmetric-Key Cryptography Data Encryption
Haswell Skylake Zen
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Running Time
(cycles-per-byte)
Sequential w = 2 w = 4 w = 8
µ-arch Latency CBC-ENC CBC-DEC CTR Intel Haswell 7 4.49 0.63 0.74 Intel Skylake 4 2.71 0.62 0.62 AMD Zen 4 2.44 0.37 0.39
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 26 / 83
Symmetric-Key Cryptography Hash Functions
A hash function maps an arbitrary-length bit-string into a n-bit string. h: {0, 1}∗ → {0, 1}n The output of a hash function is called as digest or hash value.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 27 / 83
Symmetric-Key Cryptography Hash Functions
1st pre-image. Given a hash value r it should be difficult to find any message M such that r = h(M). 2nd pre-image. Given an input M1 it should be difficult to find a different input M2 such that h(M1) = h(M2). Collision resistant. It should be difficult to find two different messages M1 and M2 such that h(M1) = h(M2).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 28 / 83
Symmetric-Key Cryptography Hash Functions
There is a large number of applications of cryptographic hash functions:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 29 / 83
Symmetric-Key Cryptography Hash Functions
1993 · · ·• SHA-0: Secure Hash Algorithm (160 bits). 1995 · · ·• SHA-1: output 160 bits. 2001 · · ·• SHA-2: output: 224, 256, 384, 512. 2015 · · ·• SHA-3 Keccak, output: 224, 256, 384, 512. 2015 · · ·• SHA-3 (SHAKE128, SHAKE256),
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 30 / 83
Symmetric-Key Cryptography SHA2 Implementation
SHA2-256 operates as follows.
M1, . . . , Mn.
Sj = Update(Sj−1, Mj) for 1 ≤ j ≤ n
Update consists of two phases:
1 Message Schedule. 2 State Update.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 31 / 83
Symmetric-Key Cryptography SHA2 Implementation
Let w0, . . . , w15 be the message block Mi split into 16 words of 32 bits, then, the message schedule calculates 48 new words: wi ← σ0(wi−15) + σ1(wi−2) + wi−7 + wi−16 , for 16 ≤ i < 64. where σ0(x) = Rot(x, 7) ⊕ Rot(x, 18) ⊕ Shr(x, 3) σ1(x) = Rot(x, 17) ⊕ Rot(x, 19) ⊕ Shr(x, 10)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 32 / 83
Symmetric-Key Cryptography SHA2 Implementation
(a0, b0, c0, d0, e0, f0, g0, h0) ← S for i ← 0 to 63 do T1 ← hi ⊞ Σ1(ei) ⊞ Ch(ei, fi, gi) ⊞ ki ⊞ wi T2 ← Σ0(ai) ⊞ Maj(ai, bi, ci) hi+1 ← gi, gi+1 ← fi fi+1 ← ei, ei+1 ← di ⊞ T1 di+1 ← ci, ci+1 ← bi bi+1 ← ai, ai+1 ← T1 ⊞ T2 end for S′ ← (a0 ⊞ a63, . . . , h0 ⊞ h63)
⊞ is addition modulo 232.
ai bi ci di ei fi gi hi ai+1 bi+1 ci+1 di+1 ei+1 fi+1 gi+1 hi+1 T2i T1i ki wi
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 33 / 83
Symmetric-Key Cryptography SHA2 Implementation
In 2013, Intel released the specification of the SHA New Instructions (SHA-NI).
SHA1:
SHA2-256 (and SHA2-224):
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 34 / 83
Symmetric-Key Cryptography SHA2 Implementation
The SHA256MSG1 instruction performs the following operation: xi = σ0(wi+1) + wi , for 0 ≤ i < 4.
w7 w6 w5 w4
xmm0
w3 w2 w1 w0
xmm1
+ + + +
σ0 σ0 σ0 σ0 x3 x2 x1 x0
xmm2
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 35 / 83
Symmetric-Key Cryptography SHA2 Implementation
The SHA256MSG2 instruction performs the following operation: wi+16 = σ1(wi+14) + yi , for 0 ≤ i < 4.
y3 y2 y1 y0
xmm0
w15 w14 w13 w12
xmm1
+ + + +
w16 w17 w18 w19
xmm2
σ1 σ1 σ1 σ1
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 36 / 83
Symmetric-Key Cryptography SHA2 Implementation
Let Ai = [ai, bi, ei, fi] and C = [ci, di, gi, hi] be the state at the i-th iteration. Then, it holds that: Ci+2 = Ai The remaining values Ai+2 = [ai+2, bi+2, ei+2, fi+2] are calculated by the SHA256RNDS2 instruction: Ai+2 = SHA256RNDS2(Ai, Ci, X) where X = [wi + ki, wi+1 + ki+1].
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 37 / 83
Symmetric-Key Cryptography SHA2 Implementation
ai bi ci di ei fi gi hi ai+1 bi+1 ci+1 di+1 ei+1 fi+1 gi+1 hi+1 ai+2 bi+2 ci+2 di+2 ei+2 fi+2 gi+2 hi+2 T2i T2i+1 T1i T1i+1 ki wi ki+1 wi+1
=
ai
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83
Symmetric-Key Cryptography SHA2 Implementation
ai bi ci di ei fi gi hi ai+1 bi+1 ci+1 di+1 ei+1 fi+1 gi+1 hi+1 ai+2 bi+2 ci+2 di+2 ei+2 fi+2 gi+2 hi+2 T2i T2i+1 T1i T1i+1 ki wi ki+1 wi+1
=
ai
=
ai
=
bi
=
ei
=
fi
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83
Symmetric-Key Cryptography SHA2 Implementation
Using two SHA256RNDS2 instructions, one can compute four iterations of the Update function: Ci+2 = Ai Ai+2 = SHA256RNDS2 (Ci, Ai, X) Ci+4 = Ai+2 Ai+4 = SHA256RNDS2 (Ci+2, Ai+2, Y ) where X = [wi + ki, wi+1 + ki+1] and Y = [wi+2 + ki+2, wi+3 + ki+3]. This is equivalent to: Ci+4 = SHA256RNDS2 (Ci, Ai, X) Ai+4 = SHA256RNDS2 (Ai, Ci+4, Y )
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 39 / 83
Symmetric-Key Cryptography SHA2 Implementation
SHA-NI is 4-5× faster than 64-bit implementations of SHA2-256.
1 16 256 4K 64K 1M 21 22 23 24 25 26 27 28 29 210
Message size (bytes) Running Time
(cycles-per-byte)
sphlib (supercop) OpenSSL SHA-NI
1 16 256 4K 64K 1M 1× 2× 3× 4× 5× Message size (bytes)
Speedup
Can we do better?
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 40 / 83
Symmetric-Key Cryptography SHA2 Implementation
Like AES-NI, SHA-NI instructions can be executed in pipeline.
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 SHA256RNDS2 SHA256RNDS2
· · · · · · · · ·
SHA256RNDS2 Latency Throughput SHA256RNDS2 SHA256RNDS2
· · · · · · · · ·
SHA256RNDS2 SHA256RNDS2 SHA256RNDS2
· · · · · · · · ·
SHA256RNDS2 SHA256RNDS2 SHA256RNDS2
· · · · · · · · ·
SHA256RNDS2
w = 4
Target scenario: multiple hashing ⇒ hash-based signatures (PQ-Crypto).
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 41 / 83
Symmetric-Key Cryptography SHA2 Implementation
Example: Calculating four hashes (pipelined) is 20% faster than a sequential implementation.
256 4K 64K 1M 1.0 1.5 2.0 2.5
Message size (bytes) Running Time
(cycles-per-byte)
Zen (Ryzen 7 1800X processor) 1 message 2 messages 4 messages 8 messages Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 42 / 83
Symmetric-Key Cryptography SHA3 Implementation
SHA-3 is composed of four hash functions and two XOF called as SHAKE.
Function Output size (n) Bit-rate (r) Security Level1 SHA-3224 224 1,152 112 SHA-3256 256 1,088 128 SHA-3384 384 832 192 SHA-3512 512 576 256 SHAKE128 n 1,344 min(n/2, 128) SHAKE256 n 1,088 min(n/2, 256)
The input of a SHA-3 is split into blocks of r bits. The larger bit-rate the faster execution.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 43 / 83
Symmetric-Key Cryptography SHA3 Implementation
An extendable-output function (XOF) maps an arbitrary length bit string producing a variable-length digest value. XOF: {0, 1}∗ × N → {0, 1}∗ (a, n) → {0, 1}n
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 44 / 83
Symmetric-Key Cryptography SHA3 Implementation
The SHA-3 was designed using a sponge construction proposed in 2009 by Bertoni et al. Initializing Absorbing Squeezing
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 45 / 83
Symmetric-Key Cryptography SHA3 Implementation
Initializing: The state has 1,600 bits that are initialized to 0; then, the input is split into blocks of r bits.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83
Symmetric-Key Cryptography SHA3 Implementation
Absorbing: Each block is added to the first r bits of the state; then, the state is processed by a permutation function P.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83
Symmetric-Key Cryptography SHA3 Implementation
Squeezing: After the input was consumed, the function P is used to produce ⌊n/r⌋ output blocks of r bits concatenated with n (mod r) bits taken from the last state.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83
Symmetric-Key Cryptography SHA3 Implementation
The state has 1, 600 bits and is represented by 5 × 5 matrix S, each entry
S =
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23 s24
; S[x, y] = s5x+y for 0 ≤ x, y < 5.
The permutation P consists of 24 rounds applying the transformations:
24
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 47 / 83
Symmetric-Key Cryptography SHA3 Implementation
The SHA-3 state is stored in seven 256-bit registers.
Y0 s0 s1 s2 s3 Y1 s5 s6 s7 s8 Y2 s10 s11 s12 s13 Y3 s15 s16 s17 s18 Y4 s20 s21 s22 s23 Y5 s24 s24 s24 s24 Y6 s4 s9 s14 s19
Pros:
vector registers. Cons:
instructions of AVX-2 are expensive.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 48 / 83
Symmetric-Key Cryptography SHA3 Implementation
X0 s0 s1 X1 s2 s3 X2 s5 s6 X3 s7 s8 X4 s4 s9 X5 s10 s11 X6 s12 s13 X7 s15 s16 X8 s17 s18 X9 s14 s19 X10 s20 s21 X11 s22 s23 X12 s24 s24
variables of 256 bits.
instructions of SSE4 are cheaper than AVX-2.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 49 / 83
Symmetric-Key Cryptography SHA3 Implementation
Y0 s1 s2 s3 s4 Y1 s1
1
s2
1
s3
1
s4
1
Y2 s1
2
s2
2
s3
2
s4
2
. . . . . . Y22 s1
22
s2
22
s3
22
s4
22
Y23 s1
23
s2
23
s3
23
s4
23
Y24 s1
24
s2
24
s3
24
s4
24
variables of 256 bits.
permutations.
and the processor has
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 50 / 83
Symmetric-Key Cryptography SHA3 Implementation
Cycles-per-bytes taken for hashing a message of 4096 bytes.
Haswell Skylake Zen 3 6 9 12 15 18 Running Time
(cycles-per-byte)
x64 x64shld AVX2 generic64 2M-SSE 4M-AVX2
Measurements were taken using the official Keccak Code Package.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 51 / 83
Symmetric-Key Cryptography SHA3 Implementation
(1M) 64-bit native instructions. (2M) 128-bit vector instructions [SSE2/AVX]. (4M) 256-bit vector instructions [AVX2].
1 2 3 4 1 2 3 4 Number of messages Speedup Haswell Skylake Zen
Performance of Zen does not scale well for hashing 4 messages.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 52 / 83
Elliptic Curve Cryptography Elliptic Curves
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 53 / 83
Elliptic Curve Cryptography Elliptic Curves
elliptic curves for cryptographic purposes.
keys sizes. For example: at the 128-bit security level:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 54 / 83
Elliptic Curve Cryptography Elliptic Curves
E/Fp : y2 + a1xy + a3y = x3 + a2x2 + a4x + a6 where a1, a2, a3, a4, a6 ∈ Fp and p is a prime number.
identity. (E, +) = {(x, y) ∈ E} ∪ {O}
calculated as: x3 =
y2 − y1
x2 − x1
2
− x1 − x2 y3 =
y2 − y1
x2 − x1
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 55 / 83
Elliptic Curve Cryptography Elliptic Curves
Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83
Elliptic Curve Cryptography Elliptic Curves
Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:
and Q.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83
Elliptic Curve Cryptography Elliptic Curves
Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:
and Q.
a point R.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83
Elliptic Curve Cryptography Elliptic Curves
Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:
and Q.
a point R.
through R.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83
Elliptic Curve Cryptography Elliptic Curves
Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:
and Q.
a point R.
through R.
intersects the curve will be defined as the addition P + Q.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83
Elliptic Curve Cryptography Elliptic Curves
The addition of a point P with itself can be computed as follows:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83
Elliptic Curve Cryptography Elliptic Curves
The addition of a point P with itself can be computed as follows:
at point P.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83
Elliptic Curve Cryptography Elliptic Curves
The addition of a point P with itself can be computed as follows:
at point P.
in a point R.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83
Elliptic Curve Cryptography Elliptic Curves
The addition of a point P with itself can be computed as follows:
at point P.
in a point R.
through R.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83
Elliptic Curve Cryptography Elliptic Curves
The addition of a point P with itself can be computed as follows:
at point P.
in a point R.
through R.
to the curve is defined as 2P.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83
Elliptic Curve Cryptography Elliptic Curves
Given an integer number k and a point P ∈ E, point multiplication is defined as: kP = P + P + · · · + P
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83
Elliptic Curve Cryptography Elliptic Curves
Given an integer number k and a point P ∈ E, point multiplication is defined as: kP = P + P + · · · + P
15P = (1111)2P = (23 + 22 + 21 + 1)P = 23P + 22P + 21P + P
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83
Elliptic Curve Cryptography Elliptic Curves
Given an integer number k and a point P ∈ E, point multiplication is defined as: kP = P + P + · · · + P
15P = (1111)2P = (23 + 22 + 21 + 1)P = 23P + 22P + 21P + P kP = kn−12n−1 + · · · + k12P + k0P
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83
Elliptic Curve Cryptography Elliptic Curves
Input: P ∈ E and k ∈ Z+. Output: kP (kn−1, . . . , k1, k0)2 ← k Q ← O for i ← n − 1 to 0 do Q ← 2Q Q ← Q + kiP end for return Q
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 59 / 83
Elliptic Curve Cryptography Elliptic Curves
The operation kP can be performed using different techniques:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 60 / 83
Elliptic Curve Cryptography Elliptic Curves
Given two points, P and Q, the problem of finding an integer k such that Q = kP is known as the elliptic curve discrete logarithm problem.
O
where #E(Fp) ≈ p is the number of points in the curve.
p ≈ 2256 then 2128 operations are required to solve ECDLP.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 61 / 83
Elliptic Curve Cryptography Elliptic Curves
signatures (ECDSA) and the key-agreement protocol (ECDH) [10].
E/Fp : y2 = x3 − 3x + b
P-256 P-384
Security 128-bit 192-bit p 2256 − 2224 + 2192 + 296 − 1 2384 − 2128 − 296 + 232 − 1 b 0x5ac635d...27d2604b 0xb3312fa...d3ec2aef #E 2256 − 2224 + 2192 − 2128 + t 2384 − t t 0xbce6faa...fc632551 0x389cb27...333ad68d
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 62 / 83
Elliptic Curve Cryptography Elliptic Curves
On January 2016, the RFC7748 recommends the use of Curve25519 and Curve448 in two elliptic curve models:
Curve25519 Bernstein [1, 2] Curve448 Hamburg [5]
Security 128-bit 224-bit p 2255 − 19 2448 − 2224 − 1 (a, d, A) (−1, − 121665
121666 , 486662)
(1, −39081, 156326) #E 8ℓ 4ℓ ℓ 2252−0x14def9dea2f79cd65812631a 2446−0x8335dc163bb124b65129c96fd 5cf5d3ed e933d8d723a70aadc873d6d54a7bb0d
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 63 / 83
Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman
The RFC 7748 recommends the use of two functions to compute a shared secret. X25519 Keys of 32 bytes. X448 Keys of 56 bytes.
a
$
← − {0, 1}256 b
$
← − {0, 1}256 KA ← X25519(9, a) KB ← X25519(9, b) K = X25519(KB, a) K = X25519(KA, b) K is the shared secret.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 64 / 83
Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman
Internally X is the calculation of an elliptic curve point multiplication kP. Montgomery ladder algorithm.
Input: P ∈ E and k ∈ Z+. Output: kP
1: (kn−1 = 1, . . . , k0)2 ← k 2: Q0 ← P 3: Q1 ← 2P 4: for i ← n − 2 to 0 do 5:
b ← ki ⊕ ki+1
6:
Q0, Q1 ← cswap(b, Q0, Q1)
7:
Q0, Q1 ← 2Q0, Q0 + Q1
8: end for 9: Q0, Q1 ← cswap(k0, Q0, Q1) 10: return Q0
Example: 22P.
ki Q0 ← O Q1 ← P 1 P 2P 2P 3P 1 5P 6P 1 11P 12P 22P 23P
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 65 / 83
Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman
Elements of Fp are split into words of size w: a ∈ Fp =
t−1
ai2wi = a0 + a12w + a222w + . . . where t =
|p|
w
Let W be the machine’s word size, then there are two cases: w = W Full-radix or saturated arithmetic. w < W Reduced-radix, redundant representation, unsaturated arith... E.g. for p = 2255 − 19 and a W = 64 instruction set, use an array of t = 5 words storing coefficients of w = 51 bits.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 66 / 83
Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman
Full-radix: Using MULX+ADCX/ADOX a 11-14% of time reduction of the fastest implementation reported in SUPERCOP. Reduced-radix: an additional 8-10% is obtained by using AVX2.
Moon (floodyberry) x64 Tung SAC 2015 x64+SSE2 Oliveira et al. SAC 2017 x64(MULX/ADCX) Our code AVX2 MULX/ADCX
25 50 75 100 125 150 175
100 Kcc Running Time (103 cycles)
Haswell Skylake
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 67 / 83
Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman
We reduce a 13% in Haswell and a 17% in Skylake the timings reported by Hamburg.
eBacs (supercop) x64 Hamburg x64 Our code AVX2
100 200 300 400 500
103 clock cycles
Haswell Skylake
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 68 / 83
Elliptic Curve Cryptography Digital Signatures
Sign Given a message there is an algorithm that computes a bit string, called signature, associated to the private key
Verify This step determines whether a signature is valid, i.e. the signature for the message was created using the private key corresponding to the referenced public key.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 69 / 83
Elliptic Curve Cryptography Digital Signatures
Hash Signing
Private Key
to obtain a digest value.
verification.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 70 / 83
Elliptic Curve Cryptography Digital Signatures
Public Key
Verification Valid Reject
whether a signature is valid.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 71 / 83
Elliptic Curve Cryptography Digital Signatures
1991 · · ·• PKCS#1: Rivest-Shamir-Adleman scheme (RSA). 1993 · · ·• FIPS 186: Digital Signature Algorithm (DSA). 1999 · · ·• ANSI X9.62: Elliptic Curve Digital Signature Algorithm (ECDSA). 2011 · · ·• Bernstein et. al. proposed the Edwards Digital Signature Algorithm (EdDSA). 2015 · · ·• EdDSA is in a draft of the IETF for discussion [6]. 2017 · · ·• EdDSA is described in RFC-8032 [7]. The use of EdDSA is increasing; for instance, OpenSSH now supports Ed25519 signatures.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 72 / 83
Elliptic Curve Cryptography EdDSA Scheme
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 73 / 83
Elliptic Curve Cryptography EdDSA Scheme
s = (x mod 2) y
y = s mod 2b−1 , x =
dy2 − a such that x ≡ sb−1 mod 2.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 74 / 83
Elliptic Curve Cryptography EdDSA Scheme
Computing the secret and public keys, (sk, pk):
1: sk ∈R [0, ℓ) 2: h = (h2b−1, . . . , h0)2 ← H(sk) 3: a ← 2n + 2ihi,
for c ≤ i < n; a : n + 1 bits, bottom c bits cleared.
4: pk ← aB 5: return (sk, pk)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 75 / 83
Elliptic Curve Cryptography EdDSA Scheme
Given a message M and the pair of keys (sk, pk) compute the signature (R, S) as:
1: h = (h2b−1, . . . , hb
, hb−1, . . . , h0
)2 ← H(sk)
2: a ← 2n + 2ihi,
for c ≤ i < n
3: r ← H(hH M) (mod ℓ) 4: R′ ← rB 5: R ← Encode(R′) 6: S ← r + H(R pk M) · a (mod ℓ) 7: return (R, S)
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 76 / 83
Elliptic Curve Cryptography EdDSA Scheme
Given a message M, a signature (R, S) and a public key pk: P ← Decode(pk) h ← H(R pk M) (mod ℓ) Accept signature if the following is true: P ∈ Ed(Fp) and S ∈ [0, ℓ) and SB = R + hP
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 77 / 83
Elliptic Curve Cryptography EdDSA Scheme
Focus on the optimization of two main operations:
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 78 / 83
Elliptic Curve Cryptography EdDSA Scheme
Input: k, a n-bit integer, w, an integer window size, P, a fixed point of order ℓ. Output: Q a point such that Q = kP. Off-line computation: 1: Compute the look-up tables {Ti ← d2wiP} for odd d ∈ [1, 2w−1] and all i ∈ [0, t). On-line computation: 1: t ← ⌈n/w⌉ 2: Q ← O 3: Let (K0, K1, . . . , Kt−1)w be the signed radix-w representation of k. 4: for i ← 0 to t − 1 do 5: P ← Query(Ti, Ki) 6: Q ← Q + P 7: end for 8: return Q
Query must be protected against side-channel attacks.
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 79 / 83
Elliptic Curve Cryptography EdDSA Scheme
One efficient algorithm is the interleaving method using ω-NAF.
R ← O for i ← n − 1 to 0 do R ← 2R if ki = 0 then R ← R + Tki if li = 0 then R ← R + Uli end for
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 80 / 83
Elliptic Curve Cryptography EdDSA Scheme
The synergy between AVX2, MULX, and ADCX/ADOX instructions increases the performance of the signing operation.
Moon (floodyberry) SSE2 24 KB Moon (floodyberry) x64 24 KB Schwabe (supercop) x64+SSE2 30 KB Our code AVX2 MULX/ADCX 12 KB Our code AVX2 MULX/ADCX 24 KB
20 40 60 80 100
Running time (103 cycles)
Haswell Skylake
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 81 / 83
Elliptic Curve Cryptography EdDSA Scheme
Running time was reduced in around 16-18% on Haswell and Skylake platforms.
supercop x64 Hamburg x64 Our code AVX2
40 80 120 160 200
Running time (103 cycles)
Haswell Skylake
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 82 / 83
Elliptic Curve Cryptography EdDSA Scheme
jlopez@ic.unicamp.br
Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83
[1] Daniel J. Bernstein. Curve25519: New Diffie-Hellman Speed Records. In Moti Yung, Yevgeniy Dodis, Aggelos Kiayias, and Tal Malkin, editors, Public Key Cryptography, volume 3958
Springer, 2006. [2]
Schwabe, and Bo-Yin Yang. High-speed high-security signatures. Journal of Cryptographic Engineering, 2(2):77–89, 2012. [3] Joppe W. Bos, J. Alex Halderman, Nadia Heninger, Jonathan Moore, Michael Naehrig, and Eric Wustrow. Elliptic Curve Cryptography in Practice. In Nicolas Christin and Reihaneh Safavi-Naini, editors, Financial Cryptography and Data Security: 18th International Conference, FC 2014, Christ Church, Barbados, March 3-7, 2014, Revised Selected Papers, pages 157–175, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. [4] Intel Corporation. Intel Instruction Set Architecture Extensions. Available at https: //software.intel.com/en-us/intel-isa-extensions, July 2013. [5] Mike Hamburg. Ed448-Goldilocks, a new elliptic curve. Cryptology ePrint Archive, Report 2015/625, 2015. http://eprint.iacr.org/. [6] Simmon Josefsson and NIels Moeller. EdDSA and Ed25519 draft-josefsson-eddsa-ed25519-03. Available on https://tools.ietf.org/html/ draft-josefsson-eddsa-ed25519-03, May 2015. [7] Simon Josefsson and Ilari Liusvaara. Edwards-Curve Digital Signature Algorithm (EdDSA). RFC 8032, January 2017. [8] Neal Koblitz. Elliptic Curve Cryptosystems. Mathematics of Computation, 48(177):203–209, January 1987. [9]
Use of Elliptic Curves in Cryptography. In HughC. Williams, editor, Advances in Cryptology — CRYPTO ’85 Proceedings, volume 218 of Lecture Notes in Computer Science, pages 417–426. Springer Berlin Heidelberg, 1986. [10] National Institute for Standards and Technology. Digital Signature Standard (DSS). http://csrc.nist.gov/publications/fips/archive/ fips186-2/fips186-2.pdf, January 2000. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83