[PPT] - Efficient Algorithms in Software Julio Lpez jlopez@ic.unicamp.br PowerPoint Presentation

SLIDE 1

Efficient Algorithms in Software

Julio López

jlopez@ic.unicamp.br

Institute of Computing, University of Campinas

September 2017, Habana, Cuba.

ASCrypto 2017

SLIDE 2

Agenda

1 Efficient Software Implementations

Software Efficiency Parallel Computation -SIMD

2 Symmetric-Key Cryptography

Data Encryption Hash Functions SHA2 Implementation SHA3 Implementation

3 Elliptic Curve Cryptography

Elliptic Curves Elliptic Curve Diffie-Hellman Digital Signatures EdDSA Scheme

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 2 / 83

SLIDE 3

Section 1

Efficient Software Implementations

SLIDE 4

1.1

Software Efficiency

SLIDE 5

Efficient Software Implementations Software Efficiency

Software Efficiency

The optimization of a software implementation of a cryptographic algorithm is a task with several goals:

Ensure security.
Running time.
Code size.
Memory consumption.
Computer platform

characteristics

Energy consumption.

Sometimes these goals are in conflict with each other. For example: accelerating an operation using look-up tables, it will increase code size, and it could result vulnerable against memory cache-attacks (if not implemented adequately).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 3 / 83

SLIDE 6

Efficient Software Implementations Software Efficiency

How Performance is Measured?

Measuring the elapsed time does not allow to compare timing

between different computers; instead, clock cycles are measured.

Use the RDTSC instruction to read the Time-Stamp Counter on

processor.

1 #include

<stdint.h>

2 uint64_t

get_cycles () {

3

uint32_t lo ,hi;

4

asm volatile ("rdtsc":"=a"(lo),"=d"(hi ));

5

return (( uint64_t)hi < <32) | lo;

6 }

To reduce certain sources of randomness during measurements it is

recommended to turn off technologies such as Turbo Boost or Hyper-Threading.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 4 / 83

SLIDE 7

1.2

Parallel Computation -SIMD

SLIDE 8

Efficient Software Implementations Parallel Computation -SIMD

Single Instruction Multiple Data

Single Instruction Multiple Data is a class of computers where a single

instruction is applied simultaneously over a set of data.

Latest processors support SIMD class by using a bank of wider

registers, also known as vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 5 / 83

SLIDE 9

Efficient Software Implementations Parallel Computation -SIMD

Vector instructions

Instructions associated to vector registers are known as vector instructions. These instructions operate over words packed in vector registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 6 / 83

SLIDE 10

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 11

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX Integer Arithmetic

MMX

(64)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 12

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE Integer Arithmetic Floating-point Arithmetic

MMX

(64)

XMM

(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 13

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 Integer Arithmetic Floating-point Arithmetic

MMX

(64)

XMM

(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 14

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 Integer Arithmetic Floating-point Arithmetic

MMX

(64)

XMM

(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 15

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 SSE4 Integer Arithmetic Floating-point Arithmetic String Manipulation

MMX

(64)

XMM

(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 16

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography

MMX

(64)

XMM

(128)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 17

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography

MMX

(64)

XMM

(128)

YMM

(256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 18

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX AVX2 BMI Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography Bit Manipulation

MMX

(64)

XMM

(128)

YMM

(256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 19

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX AVX2 BMI SHA1-SHA2 Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography Bit Manipulation

MMX

(64)

XMM

(128)

YMM

(256)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 20

Efficient Software Implementations Parallel Computation -SIMD

Releases of Vector Instructions

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

MMX SSE SSE2 SSE3 SSE4 AES-NI + CLMUL AVX AVX2 BMI SHA1-SHA2 AVX-512 Integer Arithmetic Floating-point Arithmetic String Manipulation Cryptography Bit Manipulation

MMX

(64)

XMM

(128)

YMM

(256)

ZMM

(512)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 7 / 83

SLIDE 21

Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:

1 cycle for add/sub.
5 cycles for multiplications.

C = ADD(A, B) a0 a1 a2 a3 + + + + b0 b1 b2 b3 c0 c1 c2 c3

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83

SLIDE 22

Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:

1 cycle for add/sub.
5 cycles for multiplications.

Variable logic shifts.

1 cycle for fixed shifts.
2 cycles for variable shifts.

C = VSHL(A, B) a0 a1 a2 a3 ≪ ≪ ≪ ≪ b0 b1 b2 b3 c0 c1 c2 c3

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83

SLIDE 23

Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:

1 cycle for add/sub.
5 cycles for multiplications.

Variable logic shifts.

1 cycle for fixed shifts.
2 cycles for variable shifts.

Permutation of words.

3 cycles for permutations.

C = PERM(A, M) a3 a2 a1 a0 m3 m2 m1 m0 am3 am2 am1 am0 {0, 1, 2, 3}

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83

SLIDE 24

Efficient Software Implementations Parallel Computation -SIMD

Relevant AVX2 Instructions

Integer arithmetic for 64-bit words:

1 cycle for add/sub.
5 cycles for multiplications.

Variable logic shifts.

1 cycle for fixed shifts.
2 cycles for variable shifts.

Permutation of words.

3 cycles for permutations.

Combination/selection of registers.

Up-to 3 instructions per cycle

without dependencies. C = BLEND(A, B, M) a3 a2 a1 a0 b3 b2 b1 b0 0/1 0/1 0/1 0/1 c3 c2 c1 c0

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 8 / 83

SLIDE 25

Efficient Software Implementations Parallel Computation -SIMD

Vector Instruction Guide

Full documentation available at: http://software.intel.com/sites/landingpage/IntrinsicsGuide

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 9 / 83

SLIDE 26

Efficient Software Implementations Parallel Computation -SIMD

Skylake Execution Engine

The Skylake processor has eight execution ports for instructions. This improves the Instruction-Level Parallelism (ILP).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 10 / 83

SLIDE 27

Section 2

Symmetric-Key Cryptography

SLIDE 28

2.1

Data Encryption

SLIDE 29

Symmetric-Key Cryptography Data Encryption

Secure Communication

Alice and Bob would like to communicate through an insecure

channel.

Charles is a malicious third party that has also access to the channel.
It is desired that Charles does not be able to read messages

interchanged by Alice and Bob.

0111100001100010101011111010

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 11 / 83

SLIDE 30

Symmetric-Key Cryptography Data Encryption

Symmetric Data Encryption

Using a secret key k, Alice and Bob can interchange encrypted messages. Charles can not read the messages without the knowledge of the key k.

0111100001100010101011111010

encryption C = Ek(M) decryption M = Dk(C)

Key Generation

(M, k) M C C k k

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 12 / 83

SLIDE 31

Symmetric-Key Cryptography Data Encryption

Advanced Encryption Standard (AES)

AES, 1998 (Daemen and Rijmen)
AES (2000) is the current NIST standard for encrypting data using a

symmetric key.

AES is a cipher that encrypts a 128-bit plaintext (M) producing a

128-bit ciphertext (C) using a key k.

AES M C k

AES supports three key sizes, |k| = {128, 192, 256}, leading to three

algorithms:

AES-128.
AES-192.
AES-256.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 13 / 83

SLIDE 32

Symmetric-Key Cryptography Data Encryption

AES State Representation

AES keeps track of a 128-bit state, which can be seen as a 4 × 4 matrix of bytes.

k0 kNr M C . . .

In each round, AES applies a series of transformations over the matrix. Nr =

      

10 if |k| = 128 12 if |k| = 192 14 if |k| = 256 After Nr rounds, the last state is returned as the ciphertext.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 14 / 83

SLIDE 33

Symmetric-Key Cryptography Data Encryption

AES State Transformations

SubBytes
ShiftRows
MixColumns
AddRoundKey

For decryption, transformations are inverted and applied in reverse order.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 15 / 83

SLIDE 34

Symmetric-Key Cryptography Data Encryption

AES Mix Column-Encryption

pe = {03}x3 + {01}x2 + {01}x + {02} c = pe ⊗ c = Me ⊗ c

    

c0 c1 c2 c3

     =     

02 03 01 01 01 02 03 01 01 01 02 03 03 01 01 02

         

c0 c1 c2 c3

    

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 16 / 83

SLIDE 35

Symmetric-Key Cryptography Data Encryption

AES Mix Column-Decryption

pd = {0b}x3 + {0d}x2 + {09}x + {0e} c = pd ⊗ c = Md ⊗ c

    

c0 c1 c2 c3

     =     

0e 0b 0d 09 09 0e 0b 0d 0d 09 02 0b 0b 0d 09 0e

         

c0 c1 c2 c3

    

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 17 / 83

SLIDE 36

Symmetric-Key Cryptography Data Encryption

The AES-NI Instruction Set

In 2010, Intel released a set of instructions to perform the AES algorithm.

AESENC AESENCLAST

Plaintext

AddRoundKey SubBytes ShiftRows MixColumns AddRoundKey SubBytes ShiftRows AddRoundKey

Ciphertext Nr − 1

AESDECLAST AESDEC

Plaintext

AddRoundKey InvSubBytes InvShiftRows AddRoundKey InvMixColumns InvSubBytes InvShiftRows AddRoundKey

Ciphertext Nr − 1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 18 / 83

SLIDE 37

Symmetric-Key Cryptography Data Encryption

AES-128 Encryption

Encrypting a 128-bit block (stored in xmm15) using the key schedule (stored in xmm0-xmm10). Nr = 10.

1

MOVQDA xmm15 , (% rsi) ; Load message block

2

PXOR xmm15 , xmm0 ; AddRoundKey

3

AESENC xmm15 , xmm1 ; Round 1

4

AESENC xmm15 , xmm2 ; Round 2

5

AESENC xmm15 , xmm3 ; Round 3

6

AESENC xmm15 , xmm4 ; Round 4

7

AESENC xmm15 , xmm5 ; Round 5

8

AESENC xmm15 , xmm6 ; Round 6

9

AESENC xmm15 , xmm7 ; Round 7

10

AESENC xmm15 , xmm8 ; Round 8

11

AESENC xmm15 , xmm9 ; Round 9

12

AESENCLAST xmm15 , xmm10 ; Round 10

13

MOVQDA (% rdi), xmm15 ; Store cipher block

Analogously, for decryption use AESDEC, AESDECLAST and invert the key schedule using AESIMC.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 19 / 83

SLIDE 38

Symmetric-Key Cryptography Data Encryption

Modes of Operation

Splitting a long message into 128-bit blocks and encrypting each one is not secure! (ECB Mode) Modes of operation are used for encrypting arbitrary-length messages using a block cipher as a building block.

CBC. Cipher block chaining.
CTR. Counter mode.
GCM. Galois-counter mode. (Authenticated encryption)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 20 / 83

SLIDE 39

Symmetric-Key Cryptography Data Encryption

Cipher Block Chaining (CBC)

Ek P1 IV C1 Ek P2 C2 Ek P3 C3 Ek P4 C4 Dk C1 P1 IV Dk C2 P2 Dk C3 P3 Dk C4 P4

Encryption Decryption (sequential execution) (parallel execution)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 21 / 83

SLIDE 40

Symmetric-Key Cryptography Data Encryption

Counter mode (CTR)

Ek IV+1 P1 C1 Ek IV+2 P2 C2 Ek IV+3 P3 C3 Ek IV+4 P4 C4 Ek IV+1 C1 P1 Ek IV+2 C2 P2 Ek IV+3 C3 P3 Ek IV+4 C4 P4

Encryption Decryption Either encryption and decryption can be executed in parallel. The block cipher encryption is used only.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 22 / 83

SLIDE 41

Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CBC Encryption

The performance is determined by the latency of the AESENC instruction.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Clock

AESENC AESENC

· · · · · · · · ·

AESENC

Latency

µ-arch Latency CBC-ENC Intel Haswell 7 4.49 Intel Skylake 4 2.71 AMD Zen 4 2.44

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 23 / 83

SLIDE 42

Symmetric-Key Cryptography Data Encryption

Pipelined AES Implementation

The execution of AESENC instruction can be overlapped with other instructions of the same type.

Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

AESENC AESENC

· · · · · · · · ·

AESENC

Latency Throughput

AESENC AESENC

· · · · · · · · ·

AESENC AESENC AESENC

· · · · · · · · ·

AESENC AESENC AESENC

· · · · · · · · ·

AESENC w = 4

Processor’s pipeline improves performance of CBC-DEC and CTR modes.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 24 / 83

SLIDE 43

Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CBC Decryption

Haswell Skylake Zen

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Running Time

(cycles-per-byte)

w = 1 w = 2 w = 4

Scheduling w = 4 AES-NI instructions, the performance of decryption is improved. Can we do better?

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83

SLIDE 44

Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CBC Decryption

Haswell Skylake Zen

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Running Time

(cycles-per-byte)

w = 1 w = 2 w = 4 w = 8

Yes! Zen has two execution units for AES-NI instructions.

µ-arch Latency CBC-ENC CBC-DEC Intel Haswell 7 4.49 0.63 Intel Skylake 4 2.71 0.62 AMD Zen 4 2.44 0.37

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 25 / 83

SLIDE 45

Symmetric-Key Cryptography Data Encryption

Performance of AES-128-CTR Mode

Haswell Skylake Zen

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Running Time

(cycles-per-byte)

Sequential w = 2 w = 4 w = 8

µ-arch Latency CBC-ENC CBC-DEC CTR Intel Haswell 7 4.49 0.63 0.74 Intel Skylake 4 2.71 0.62 0.62 AMD Zen 4 2.44 0.37 0.39

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 26 / 83

SLIDE 46

2.2

Hash Functions

SLIDE 47

Symmetric-Key Cryptography Hash Functions

Hash Function

A hash function maps an arbitrary-length bit-string into a n-bit string. h: {0, 1}∗ → {0, 1}n The output of a hash function is called as digest or hash value.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 27 / 83

SLIDE 48

Symmetric-Key Cryptography Hash Functions

Cryptographic Properties

1st pre-image. Given a hash value r it should be difficult to find any message M such that r = h(M). 2nd pre-image. Given an input M1 it should be difficult to find a different input M2 such that h(M1) = h(M2). Collision resistant. It should be difficult to find two different messages M1 and M2 such that h(M1) = h(M2).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 28 / 83

SLIDE 49

Symmetric-Key Cryptography Hash Functions

Applications of Hash Functions

There is a large number of applications of cryptographic hash functions:

Verifying the integrity of files or messages.
Password verification.
Pseudo-random number generation.
Key derivation functions.
Digital signatures.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 29 / 83

SLIDE 50

Symmetric-Key Cryptography Hash Functions

NIST Hash Functions

1993 · · ·• SHA-0: Secure Hash Algorithm (160 bits). 1995 · · ·• SHA-1: output 160 bits. 2001 · · ·• SHA-2: output: 224, 256, 384, 512. 2015 · · ·• SHA-3 Keccak, output: 224, 256, 384, 512. 2015 · · ·• SHA-3 (SHAKE128, SHAKE256),

utput: m (arbitrary) (FIPS) 180-4.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 30 / 83

SLIDE 51

2.3

SHA2 Implementation

SLIDE 52

Symmetric-Key Cryptography SHA2 Implementation

SHA2 Algorithm

SHA2-256 operates as follows.

Initialize state S0 with constant values.
After padding, the message is split into n 512-bit blocks:

M1, . . . , Mn.

For each block Mj:

Sj = Update(Sj−1, Mj) for 1 ≤ j ≤ n

The digest of M is H(M) = Sn.

Update consists of two phases:

1 Message Schedule. 2 State Update.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 31 / 83

SLIDE 53

Symmetric-Key Cryptography SHA2 Implementation

Update Phase 1: Message Schedule

Let w0, . . . , w15 be the message block Mi split into 16 words of 32 bits, then, the message schedule calculates 48 new words: wi ← σ0(wi−15) + σ1(wi−2) + wi−7 + wi−16 , for 16 ≤ i < 64. where σ0(x) = Rot(x, 7) ⊕ Rot(x, 18) ⊕ Shr(x, 3) σ1(x) = Rot(x, 17) ⊕ Rot(x, 19) ⊕ Shr(x, 10)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 32 / 83

SLIDE 54

Symmetric-Key Cryptography SHA2 Implementation

Update Phase 2: State Update

(a0, b0, c0, d0, e0, f0, g0, h0) ← S for i ← 0 to 63 do T1 ← hi ⊞ Σ1(ei) ⊞ Ch(ei, fi, gi) ⊞ ki ⊞ wi T2 ← Σ0(ai) ⊞ Maj(ai, bi, ci) hi+1 ← gi, gi+1 ← fi fi+1 ← ei, ei+1 ← di ⊞ T1 di+1 ← ci, ci+1 ← bi bi+1 ← ai, ai+1 ← T1 ⊞ T2 end for S′ ← (a0 ⊞ a63, . . . , h0 ⊞ h63)

⊞ is addition modulo 232.

ai bi ci di ei fi gi hi ai+1 bi+1 ci+1 di+1 ei+1 fi+1 gi+1 hi+1 T2i T1i ki wi

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 33 / 83

SLIDE 55

Symmetric-Key Cryptography SHA2 Implementation

SHA New Instructions (SHA-NI)

In 2013, Intel released the specification of the SHA New Instructions (SHA-NI).

Since 2016 it was supported by Goldmont Intel micro-architecture.
Zen AMD’s micro-architecture also added support in 2017.

SHA1:

SHA1MSG1
SHA1MSG2
SHA1NEXTE
SHA1RNDS4

SHA2-256 (and SHA2-224):

SHA256MSG1
SHA256MSG2
SHA256RNDS2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 34 / 83

SLIDE 56

Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 1a: Message Schedule

The SHA256MSG1 instruction performs the following operation: xi = σ0(wi+1) + wi , for 0 ≤ i < 4.

w7 w6 w5 w4

xmm0

w3 w2 w1 w0

xmm1

+ + + +

σ0 σ0 σ0 σ0 x3 x2 x1 x0

xmm2

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 35 / 83

SLIDE 57

Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 1b: Message Schedule

The SHA256MSG2 instruction performs the following operation: wi+16 = σ1(wi+14) + yi , for 0 ≤ i < 4.

y3 y2 y1 y0

xmm0

w15 w14 w13 w12

xmm1

+ + + +

w16 w17 w18 w19

xmm2

σ1 σ1 σ1 σ1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 36 / 83

SLIDE 58

Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Two Iterations

Let Ai = [ai, bi, ei, fi] and C = [ci, di, gi, hi] be the state at the i-th iteration. Then, it holds that: Ci+2 = Ai The remaining values Ai+2 = [ai+2, bi+2, ei+2, fi+2] are calculated by the SHA256RNDS2 instruction: Ai+2 = SHA256RNDS2(Ai, Ci, X) where X = [wi + ki, wi+1 + ki+1].

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 37 / 83

SLIDE 59

Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Two Iterations

ai bi ci di ei fi gi hi ai+1 bi+1 ci+1 di+1 ei+1 fi+1 gi+1 hi+1 ai+2 bi+2 ci+2 di+2 ei+2 fi+2 gi+2 hi+2 T2i T2i+1 T1i T1i+1 ki wi ki+1 wi+1

=

ai

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83

SLIDE 60

Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Two Iterations

ai bi ci di ei fi gi hi ai+1 bi+1 ci+1 di+1 ei+1 fi+1 gi+1 hi+1 ai+2 bi+2 ci+2 di+2 ei+2 fi+2 gi+2 hi+2 T2i T2i+1 T1i T1i+1 ki wi ki+1 wi+1

=

ai

=

ai

=

bi

=

ei

=

fi

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 38 / 83

SLIDE 61

Symmetric-Key Cryptography SHA2 Implementation

Implementation of Phase 2: Four Iterations

Using two SHA256RNDS2 instructions, one can compute four iterations of the Update function: Ci+2 = Ai Ai+2 = SHA256RNDS2 (Ci, Ai, X) Ci+4 = Ai+2 Ai+4 = SHA256RNDS2 (Ci+2, Ai+2, Y ) where X = [wi + ki, wi+1 + ki+1] and Y = [wi+2 + ki+2, wi+3 + ki+3]. This is equivalent to: Ci+4 = SHA256RNDS2 (Ci, Ai, X) Ai+4 = SHA256RNDS2 (Ai, Ci+4, Y )

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 39 / 83

SLIDE 62

Symmetric-Key Cryptography SHA2 Implementation

Performance of SHA2-256 using SHA-NI

SHA-NI is 4-5× faster than 64-bit implementations of SHA2-256.

1 16 256 4K 64K 1M 21 22 23 24 25 26 27 28 29 210

Message size (bytes) Running Time

(cycles-per-byte)

sphlib (supercop) OpenSSL SHA-NI

1 16 256 4K 64K 1M 1× 2× 3× 4× 5× Message size (bytes)

Speedup

Can we do better?

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 40 / 83

SLIDE 63

Symmetric-Key Cryptography SHA2 Implementation

Pipelined Implementation of SHA-NI

Like AES-NI, SHA-NI instructions can be executed in pipeline.

Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 SHA256RNDS2 SHA256RNDS2

· · · · · · · · ·

SHA256RNDS2 Latency Throughput SHA256RNDS2 SHA256RNDS2

· · · · · · · · ·

SHA256RNDS2 SHA256RNDS2 SHA256RNDS2

· · · · · · · · ·

SHA256RNDS2 SHA256RNDS2 SHA256RNDS2

· · · · · · · · ·

SHA256RNDS2

w = 4

Target scenario: multiple hashing ⇒ hash-based signatures (PQ-Crypto).

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 41 / 83

SLIDE 64

Symmetric-Key Cryptography SHA2 Implementation

Performance of Pipelined Implementation of SHA-NI

Example: Calculating four hashes (pipelined) is 20% faster than a sequential implementation.

256 4K 64K 1M 1.0 1.5 2.0 2.5

Message size (bytes) Running Time

(cycles-per-byte)

Zen (Ryzen 7 1800X processor) 1 message 2 messages 4 messages 8 messages Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 42 / 83

SLIDE 65

2.4

SHA3 Implementation

SLIDE 66

Symmetric-Key Cryptography SHA3 Implementation

The SHA-3 Family of Functions

SHA-3 is composed of four hash functions and two XOF called as SHAKE.

Function Output size (n) Bit-rate (r) Security Level1 SHA-3224 224 1,152 112 SHA-3256 256 1,088 128 SHA-3384 384 832 192 SHA-3512 512 576 256 SHAKE128 n 1,344 min(n/2, 128) SHAKE256 n 1,088 min(n/2, 256)

The input of a SHA-3 is split into blocks of r bits. The larger bit-rate the faster execution.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 43 / 83

SLIDE 67

Symmetric-Key Cryptography SHA3 Implementation

Extendable-Output Function

An extendable-output function (XOF) maps an arbitrary length bit string producing a variable-length digest value. XOF: {0, 1}∗ × N → {0, 1}∗ (a, n) → {0, 1}n

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 44 / 83

SLIDE 68

Symmetric-Key Cryptography SHA3 Implementation

The SHA-3 Design

The SHA-3 was designed using a sponge construction proposed in 2009 by Bertoni et al. Initializing Absorbing Squeezing

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 45 / 83

SLIDE 69

Symmetric-Key Cryptography SHA3 Implementation

Sponge Construction

Initializing: The state has 1,600 bits that are initialized to 0; then, the input is split into blocks of r bits.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83

SLIDE 70

Symmetric-Key Cryptography SHA3 Implementation

Sponge Construction

Absorbing: Each block is added to the first r bits of the state; then, the state is processed by a permutation function P.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83

SLIDE 71

Symmetric-Key Cryptography SHA3 Implementation

Sponge Construction

Squeezing: After the input was consumed, the function P is used to produce ⌊n/r⌋ output blocks of r bits concatenated with n (mod r) bits taken from the last state.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 46 / 83

SLIDE 72

Symmetric-Key Cryptography SHA3 Implementation

Permutation Function P

The state has 1, 600 bits and is represented by 5 × 5 matrix S, each entry

f the matrix is 64-bit word.

S =

   

s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 s20 s21 s22 s23 s24

    ; S[x, y] = s5x+y for 0 ≤ x, y < 5.

The permutation P consists of 24 rounds applying the transformations:

θ ρ π χ ι

24

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 47 / 83

SLIDE 73

Symmetric-Key Cryptography SHA3 Implementation

Using 256-bit instructions

The SHA-3 state is stored in seven 256-bit registers.

Y0 s0 s1 s2 s3 Y1 s5 s6 s7 s8 Y2 s10 s11 s12 s13 Y3 s15 s16 s17 s18 Y4 s20 s21 s22 s23 Y5 s24 s24 s24 s24 Y6 s4 s9 s14 s19

Yi: 256-bit vector registers.
si: 64-bit words.

Pros:

It uses just few 256-bit

vector registers. Cons:

The permutation

instructions of AVX-2 are expensive.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 48 / 83

SLIDE 74

Symmetric-Key Cryptography SHA3 Implementation

Using 128-bit instructions

State representation.

X0 s0 s1 X1 s2 s3 X2 s5 s6 X3 s7 s8 X4 s4 s9 X5 s10 s11 X6 s12 s13 X7 s15 s16 X8 s17 s18 X9 s14 s19 X10 s20 s21 X11 s22 s23 X12 s24 s24

Xi: 128-bit vector registers.
The state uses 12

variables of 256 bits.

Pros:
The permutation

instructions of SSE4 are cheaper than AVX-2.

Cons:
It uses more variables.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 49 / 83

SLIDE 75

Symmetric-Key Cryptography SHA3 Implementation

4-way implementation

State representation.

Y0 s1 s2 s3 s4 Y1 s1

1

s2

1

s3

1

s4

1

Y2 s1

2

s2

2

s3

2

s4

2

. . . . . . Y22 s1

22

s2

22

s3

22

s4

22

Y23 s1

23

s2

23

s3

23

s4

23

Y24 s1

24

s2

24

s3

24

s4

24

Yi: 256-bit vector registers.
The state uses 25

variables of 256 bits.

Pros:
There is no 64-bit

permutations.

Cons:
It uses many variables

and the processor has

nly 16 registers.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 50 / 83

SLIDE 76

Symmetric-Key Cryptography SHA3 Implementation

Performance of SHA3-128 Function

Cycles-per-bytes taken for hashing a message of 4096 bytes.

Haswell Skylake Zen 3 6 9 12 15 18 Running Time

(cycles-per-byte)

x64 x64shld AVX2 generic64 2M-SSE 4M-AVX2

Measurements were taken using the official Keccak Code Package.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 51 / 83

SLIDE 77

Symmetric-Key Cryptography SHA3 Implementation

SHA3 Parallel Hashing: Two and Four Messages

(1M) 64-bit native instructions. (2M) 128-bit vector instructions [SSE2/AVX]. (4M) 256-bit vector instructions [AVX2].

1 2 3 4 1 2 3 4 Number of messages Speedup Haswell Skylake Zen

Performance of Zen does not scale well for hashing 4 messages.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 52 / 83

SLIDE 78

Section 3

Elliptic Curve Cryptography

SLIDE 79

3.1

Elliptic Curves

SLIDE 80

Elliptic Curve Cryptography Elliptic Curves

ECC: Software Implementation

Introduction
Point Multiplication kP
Elliptic Curve Diffie-Hellman (X25519, X448)
Digital Signature (EdDSA)
Performance (vector instructions on Intel Haswell/Skylake)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 53 / 83

SLIDE 81

Elliptic Curve Cryptography Elliptic Curves

Elliptic Curve Cryptography (ECC)

In 1985, Koblitz [8] and Miller [9] independently suggested the use of

elliptic curves for cryptographic purposes.

ECC achieves the same security as RSA-based protocols using shorter

keys sizes. For example: at the 128-bit security level:

RSA uses keys of 3,072 bits
ECC uses keys of 256 bits.
Applications of ECC:
Key-agreement protocols.
Digital signatures.
Bitcoin.
End-to-end encryption.
Smart cards security.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 54 / 83

SLIDE 82

Elliptic Curve Cryptography Elliptic Curves

Mathematical Aspects of Elliptic Curves

An elliptic curve is defined by the following equation:

E/Fp : y2 + a1xy + a3y = x3 + a2x2 + a4x + a6 where a1, a2, a3, a4, a6 ∈ Fp and p is a prime number.

The points of an elliptic curve form a commutative group, with O as

identity. (E, +) = {(x, y) ∈ E} ∪ {O}

The addition of two different points (x3, y3) = (x1, y1) + (x2, y2) is

calculated as: x3 =

y2 − y1

x2 − x1

2

− x1 − x2 y3 =

y2 − y1

x2 − x1

(x1 − x3) − y1

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 55 / 83

SLIDE 83

Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83

SLIDE 84

Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

Trace a line passing through P

and Q.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83

SLIDE 85

Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

Trace a line passing through P

and Q.

This line will intersect the curve in

a point R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83

SLIDE 86

Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

Trace a line passing through P

and Q.

This line will intersect the curve in

a point R.

Trace a vertical line passing

through R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83

SLIDE 87

Elliptic Curve Cryptography Elliptic Curves

Point Addition

Let P and Q two points in the curve, then we can compute P + Q using a geometric construction:

Trace a line passing through P

and Q.

This line will intersect the curve in

a point R.

Trace a vertical line passing

through R.

The point where this line

intersects the curve will be defined as the addition P + Q.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 56 / 83

SLIDE 88

Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83

SLIDE 89

Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

Trace a line tangent to the curve

at point P.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83

SLIDE 90

Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

Trace a line tangent to the curve

at point P.

The line will intersect to the curve

in a point R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83

SLIDE 91

Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

Trace a line tangent to the curve

at point P.

The line will intersect to the curve

in a point R.

Trace a vertical line passing

through R.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83

SLIDE 92

Elliptic Curve Cryptography Elliptic Curves

Point Doubling

The addition of a point P with itself can be computed as follows:

Trace a line tangent to the curve

at point P.

The line will intersect to the curve

in a point R.

Trace a vertical line passing

through R.

The point were this line intersects

to the curve is defined as 2P.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 57 / 83

SLIDE 93

Elliptic Curve Cryptography Elliptic Curves

Point Multiplication kP

Given an integer number k and a point P ∈ E, point multiplication is defined as: kP = P + P + · · · + P

k times

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83

SLIDE 94

Elliptic Curve Cryptography Elliptic Curves

Point Multiplication kP

Given an integer number k and a point P ∈ E, point multiplication is defined as: kP = P + P + · · · + P

k times

15P = (1111)2P = (23 + 22 + 21 + 1)P = 23P + 22P + 21P + P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83

SLIDE 95

Elliptic Curve Cryptography Elliptic Curves

Point Multiplication kP

Given an integer number k and a point P ∈ E, point multiplication is defined as: kP = P + P + · · · + P

k times

15P = (1111)2P = (23 + 22 + 21 + 1)P = 23P + 22P + 21P + P kP = kn−12n−1 + · · · + k12P + k0P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 58 / 83

SLIDE 96

Elliptic Curve Cryptography Elliptic Curves

Point Multiplication: Double-and-Add algorithm

Input: P ∈ E and k ∈ Z+. Output: kP (kn−1, . . . , k1, k0)2 ← k Q ← O for i ← n − 1 to 0 do Q ← 2Q Q ← Q + kiP end for return Q

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 59 / 83

SLIDE 97

Elliptic Curve Cryptography Elliptic Curves

Techniques for kP

The operation kP can be performed using different techniques:

Double-and-Add Algorithm (right-to-left)
Montgomery Algorithm.
w-NAF representations.
Fixed recoding representations.
Elliptic curves with endomorphism, GLV/GLS curves.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 60 / 83

SLIDE 98

Elliptic Curve Cryptography Elliptic Curves

Elliptic Curve Discrete Logarithm Problem (ECDLP)

Given two points, P and Q, the problem of finding an integer k such that Q = kP is known as the elliptic curve discrete logarithm problem.

The Pollard’s algorithm is the best known algorithm that solves
ECDLP. The complexity of this algorithm is:

O

#E(Fp)
,

where #E(Fp) ≈ p is the number of points in the curve.

For example: an elliptic curve defined over a prime field such that

p ≈ 2256 then 2128 operations are required to solve ECDLP.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 61 / 83

SLIDE 99

Elliptic Curve Cryptography Elliptic Curves

The Standardized Elliptic Curves by NIST

In 1999, NIST standardized a set of elliptic curves to compute digital

signatures (ECDSA) and the key-agreement protocol (ECDH) [10].

NIST’s curves have the following equation:

E/Fp : y2 = x3 − 3x + b

Prime curves: P-256 and P-384

P-256 P-384

Security 128-bit 192-bit p 2256 − 2224 + 2192 + 296 − 1 2384 − 2128 − 296 + 232 − 1 b 0x5ac635d...27d2604b 0xb3312fa...d3ec2aef #E 2256 − 2224 + 2192 − 2128 + t 2384 − t t 0xbce6faa...fc632551 0x389cb27...333ad68d

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 62 / 83

SLIDE 100

Elliptic Curve Cryptography Elliptic Curves

RFC7748: Edwards/Montgomery Elliptic Curves

On January 2016, the RFC7748 recommends the use of Curve25519 and Curve448 in two elliptic curve models:

Edwards curves: E : ax2 + y2 = 1 + dx2y2.
Montgomery curves: E : v2 = u3 + Au2 + u.

Curve25519 Bernstein [1, 2] Curve448 Hamburg [5]

Security 128-bit 224-bit p 2255 − 19 2448 − 2224 − 1 (a, d, A) (−1, − 121665

121666 , 486662)

(1, −39081, 156326) #E 8ℓ 4ℓ ℓ 2252−0x14def9dea2f79cd65812631a 2446−0x8335dc163bb124b65129c96fd 5cf5d3ed e933d8d723a70aadc873d6d54a7bb0d

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 63 / 83

SLIDE 101

3.2

Elliptic Curve Diffie-Hellman

SLIDE 102

Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

Diffie-Hellman Protocol using Montgomery Curves

The RFC 7748 recommends the use of two functions to compute a shared secret. X25519 Keys of 32 bytes. X448 Keys of 56 bytes.

a

$

← − {0, 1}256 b

$

← − {0, 1}256 KA ← X25519(9, a) KB ← X25519(9, b) K = X25519(KB, a) K = X25519(KA, b) K is the shared secret.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 64 / 83

SLIDE 103

Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

The X Function

Internally X is the calculation of an elliptic curve point multiplication kP. Montgomery ladder algorithm.

Input: P ∈ E and k ∈ Z+. Output: kP

1: (kn−1 = 1, . . . , k0)2 ← k 2: Q0 ← P 3: Q1 ← 2P 4: for i ← n − 2 to 0 do 5:

b ← ki ⊕ ki+1

6:

Q0, Q1 ← cswap(b, Q0, Q1)

7:

Q0, Q1 ← 2Q0, Q0 + Q1

8: end for 9: Q0, Q1 ← cswap(k0, Q0, Q1) 10: return Q0

Example: 22P.

ki Q0 ← O Q1 ← P 1 P 2P 2P 3P 1 5P 6P 1 11P 12P 22P 23P

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 65 / 83

SLIDE 104

Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

Representation of Prime Field Elements

Elements of Fp are split into words of size w: a ∈ Fp =

t−1

i=0

ai2wi = a0 + a12w + a222w + . . . where t =

|p|

w

.

Let W be the machine’s word size, then there are two cases: w = W Full-radix or saturated arithmetic. w < W Reduced-radix, redundant representation, unsaturated arith... E.g. for p = 2255 − 19 and a W = 64 instruction set, use an array of t = 5 words storing coefficients of w = 51 bits.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 66 / 83

SLIDE 105

Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

X25519 Shared Secret Computation

Full-radix: Using MULX+ADCX/ADOX a 11-14% of time reduction of the fastest implementation reported in SUPERCOP. Reduced-radix: an additional 8-10% is obtained by using AVX2.

Moon (floodyberry) x64 Tung SAC 2015 x64+SSE2 Oliveira et al. SAC 2017 x64(MULX/ADCX) Our code AVX2 MULX/ADCX

25 50 75 100 125 150 175

100 Kcc Running Time (103 cycles)

Haswell Skylake

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 67 / 83

SLIDE 106

Elliptic Curve Cryptography Elliptic Curve Diffie-Hellman

X448 Shared Secret Computation

We reduce a 13% in Haswell and a 17% in Skylake the timings reported by Hamburg.

eBacs (supercop) x64 Hamburg x64 Our code AVX2

100 200 300 400 500

103 clock cycles

Haswell Skylake

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 68 / 83

SLIDE 107

3.3

Digital Signatures

SLIDE 108

Elliptic Curve Cryptography Digital Signatures

Digital Signatures

They are used to verify both integrity and authenticity of a message.
Basic operations:

Sign Given a message there is an algorithm that computes a bit string, called signature, associated to the private key

f the signer.

Verify This step determines whether a signature is valid, i.e. the signature for the message was created using the private key corresponding to the referenced public key.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 69 / 83

SLIDE 109

Elliptic Curve Cryptography Digital Signatures

Signature Generation

Hash Signing

Private Key

The message is processed through a cryptographic hash function H

to obtain a digest value.

The digest along with the private key are used to generate a signature.
Both message and signature must be sent together for further

verification.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 70 / 83

SLIDE 110

Elliptic Curve Cryptography Digital Signatures

Signature Verification

Public Key

Verification Valid Reject

Using the signer’s public key, the verification algorithm determines

whether a signature is valid.

Ensuring authenticity of the signer and integrity of the message.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 71 / 83

SLIDE 111

Elliptic Curve Cryptography Digital Signatures

Digital Signatures

1991 · · ·• PKCS#1: Rivest-Shamir-Adleman scheme (RSA). 1993 · · ·• FIPS 186: Digital Signature Algorithm (DSA). 1999 · · ·• ANSI X9.62: Elliptic Curve Digital Signature Algorithm (ECDSA). 2011 · · ·• Bernstein et. al. proposed the Edwards Digital Signature Algorithm (EdDSA). 2015 · · ·• EdDSA is in a draft of the IETF for discussion [6]. 2017 · · ·• EdDSA is described in RFC-8032 [7]. The use of EdDSA is increasing; for instance, OpenSSH now supports Ed25519 signatures.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 72 / 83

SLIDE 112

3.4

EdDSA Scheme

SLIDE 113

Elliptic Curve Cryptography EdDSA Scheme

Edwards Digital Signature Algorithm

This is a novel signature scheme based on the Edwards curves.
The RFC-8032[7] describes the usage of two instances of EdDSA.
EdDSA delivers digital signatures faster than the ECDSA.
It consists of three primitive operations:
Key Generation.
Signing.
Verification.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 73 / 83

SLIDE 114

Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Domain parameters

Public key of b bits and signature size of 2b bits.
Ed(Fp), an Edwards curve over a prime field.
ℓ · h = #Ed(Fp), the number of points in the curve.
B = (0, 1), a generator point.
c ∈ {2, 3} and n = log2(ℓ), two constants. c ≤ n < b
s = Encode(P), converts a point P = (x, y) into a string s.

s = (x mod 2) y

(x, y)=Decode(s), converts a string s into a pair (x, y).

y = s mod 2b−1 , x =

y2 − 1

dy2 − a such that x ≡ sb−1 mod 2.

H, a hash function producing 2b bits.
Ex: use of the SHAKE128 function which is part of the SHA3 standard.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 74 / 83

SLIDE 115

Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Key Generation

Computing the secret and public keys, (sk, pk):

1: sk ∈R [0, ℓ) 2: h = (h2b−1, . . . , h0)2 ← H(sk) 3: a ← 2n + 2ihi,

for c ≤ i < n; a : n + 1 bits, bottom c bits cleared.

4: pk ← aB 5: return (sk, pk)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 75 / 83

SLIDE 116

Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Signing

Given a message M and the pair of keys (sk, pk) compute the signature (R, S) as:

1: h = (h2b−1, . . . , hb

hH

, hb−1, . . . , h0

hL

)2 ← H(sk)

2: a ← 2n + 2ihi,

for c ≤ i < n

3: r ← H(hH M) (mod ℓ) 4: R′ ← rB 5: R ← Encode(R′) 6: S ← r + H(R pk M) · a (mod ℓ) 7: return (R, S)

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 76 / 83

SLIDE 117

Elliptic Curve Cryptography EdDSA Scheme

EdDSA: Verification

Given a message M, a signature (R, S) and a public key pk: P ← Decode(pk) h ← H(R pk M) (mod ℓ) Accept signature if the following is true: P ∈ Ed(Fp) and S ∈ [0, ℓ) and SB = R + hP

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 77 / 83

SLIDE 118

Elliptic Curve Cryptography EdDSA Scheme

Optimization Techniques for EdDSA

Focus on the optimization of two main operations:

kP, when P is known.
kP + lQ, when P is known and Q is an arbitrary point.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 78 / 83

SLIDE 119

Elliptic Curve Cryptography EdDSA Scheme

Fixed-point mult: computing kP when P is known

Input: k, a n-bit integer, w, an integer window size, P, a fixed point of order ℓ. Output: Q a point such that Q = kP. Off-line computation: 1: Compute the look-up tables {Ti ← d2wiP} for odd d ∈ [1, 2w−1] and all i ∈ [0, t). On-line computation: 1: t ← ⌈n/w⌉ 2: Q ← O 3: Let (K0, K1, . . . , Kt−1)w be the signed radix-w representation of k. 4: for i ← 0 to t − 1 do 5: P ← Query(Ti, Ki) 6: Q ← Q + P 7: end for 8: return Q

Query must be protected against side-channel attacks.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 79 / 83

SLIDE 120

Elliptic Curve Cryptography EdDSA Scheme

Double-point mult: computing kP + lQ when P is known and Q is an arbitrary point

One efficient algorithm is the interleaving method using ω-NAF.

Obtain the ω-NAF of k and l, {ki} ← k and {li} ← l.
There exists a pair (ωk, ωl) that minimizes the number of operations.
Precompute Td = dP for odd d ∈ [1, 2ωk−1].
Compute Ud = dQ for odd d ∈ [1, 2ωl−1].

R ← O for i ← n − 1 to 0 do R ← 2R if ki = 0 then R ← R + Tki if li = 0 then R ← R + Uli end for

R is the required point kP + lQ.

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 80 / 83

SLIDE 121

Elliptic Curve Cryptography EdDSA Scheme

Improvements on Ed25519 Signature Generation

The synergy between AVX2, MULX, and ADCX/ADOX instructions increases the performance of the signing operation.

Moon (floodyberry) SSE2 24 KB Moon (floodyberry) x64 24 KB Schwabe (supercop) x64+SSE2 30 KB Our code AVX2 MULX/ADCX 12 KB Our code AVX2 MULX/ADCX 24 KB

20 40 60 80 100

Running time (103 cycles)

Haswell Skylake

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 81 / 83

SLIDE 122

Elliptic Curve Cryptography EdDSA Scheme

Improvements on Ed448 Signature Generation

Running time was reduced in around 16-18% on Haswell and Skylake platforms.

supercop x64 Hamburg x64 Our code AVX2

40 80 120 160 200

Running time (103 cycles)

Haswell Skylake

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 82 / 83

SLIDE 123

Elliptic Curve Cryptography EdDSA Scheme

Thanks for your attention!

jlopez@ic.unicamp.br

Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83

SLIDE 124

References

[1] Daniel J. Bernstein. Curve25519: New Diffie-Hellman Speed Records. In Moti Yung, Yevgeniy Dodis, Aggelos Kiayias, and Tal Malkin, editors, Public Key Cryptography, volume 3958

f Lecture Notes in Computer Science, pages 207–228.

Springer, 2006. [2]

DanielJ. Bernstein, Niels Duif, Tanja Lange, Peter

Schwabe, and Bo-Yin Yang. High-speed high-security signatures. Journal of Cryptographic Engineering, 2(2):77–89, 2012. [3] Joppe W. Bos, J. Alex Halderman, Nadia Heninger, Jonathan Moore, Michael Naehrig, and Eric Wustrow. Elliptic Curve Cryptography in Practice. In Nicolas Christin and Reihaneh Safavi-Naini, editors, Financial Cryptography and Data Security: 18th International Conference, FC 2014, Christ Church, Barbados, March 3-7, 2014, Revised Selected Papers, pages 157–175, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg. [4] Intel Corporation. Intel Instruction Set Architecture Extensions. Available at https: //software.intel.com/en-us/intel-isa-extensions, July 2013. [5] Mike Hamburg. Ed448-Goldilocks, a new elliptic curve. Cryptology ePrint Archive, Report 2015/625, 2015. http://eprint.iacr.org/. [6] Simmon Josefsson and NIels Moeller. EdDSA and Ed25519 draft-josefsson-eddsa-ed25519-03. Available on https://tools.ietf.org/html/ draft-josefsson-eddsa-ed25519-03, May 2015. [7] Simon Josefsson and Ilari Liusvaara. Edwards-Curve Digital Signature Algorithm (EdDSA). RFC 8032, January 2017. [8] Neal Koblitz. Elliptic Curve Cryptosystems. Mathematics of Computation, 48(177):203–209, January 1987. [9]

VictorS. Miller.

Use of Elliptic Curves in Cryptography. In HughC. Williams, editor, Advances in Cryptology — CRYPTO ’85 Proceedings, volume 218 of Lecture Notes in Computer Science, pages 417–426. Springer Berlin Heidelberg, 1986. [10] National Institute for Standards and Technology. Digital Signature Standard (DSS). http://csrc.nist.gov/publications/fips/archive/ fips186-2/fips186-2.pdf, January 2000. Julio López (IC-UNICAMP) Efficient Algorithms in Software ASCrypto 2017 83 / 83