[PPT] - Use of the AES instruction set ? PowerPoint Presentation

SLIDE 1

文語選ロ日ールカ日ロロを文コ本トコでコ、でーネ選ル日ロをンでルネロをカのん本 ¡ で ¿ ¿ 文パ本ル語でをル文で本、ト、ル本 ¿ 文ーのーの ¿ 語ルをを ¡ ルの本 ¡ ネコ語カー ¡ 選ロでールでん字ルカでの ¿ 文字ネロ ¿ ンロロ ¿ 本をルーロ ? 文でので文ルをントールパで ? ででん語日語日ルん ¿ 文字字 ? ル選ト語ーをで ? ロロ字 ¿ カ語でネルパコんを所コ字パ ? ン字日ん文カー語で ¡ 、ルロ文ルン語ル所、ん、ネネトで ¿ ¿ ルネトでロのルででで文パで、所コルん本文ネ字でんロののンネ ? ルロ ¡ ルをネ文字ンル日ーンので日 ¡ んコルネ ? パネ字をのトルン所ロ ¿ ン所パローでル日でで ¡ ル ¡ 選日パ ? で ? でパでト語字カ ¿ 字 ? 本選ネパ語でルルでー ? カ ? ? 選ー選コ ¿ 選で ¡ ルで、のーでルンをトンパロでカのカ文選でトト字ール字トローー ¿ コ本ん文でネ ¡ ロ、でロでーール ? ル所でコ語ルル文ンパ所選 ? で ? をん ? 、語本ー本日選ネ ¿ 選ー ¡ ルトんルコルカルカ所選ル日ん日 ¿ ¡ 字ーカネパコ所のトのネ、ロ日 ¡ ルでロロ ¿ ? 語ーネで ¡ でーでーパで、 ¿ ででをロンーカロのネト所カ ¿ の字カ日 ¿ ¿ ? ト語を、選所ネ字ル ¿ ルネ本ロん語字ーーでー ? ンル ¿ 日のルーでででロ ¿ 語日カでんパロンカカカローパをルんん本日んのー本ル所でン語コの ? ルネん選選 ¡ でカ ¡ 語パンパ字所コ字の日日、で本で ? 語ルネ本んー語語ントの ? でロルーン所文ん ? ¿ 字ネーパ所ーコ文で選、のーロ ? 文ロ、ーで本ーで語所ネーパコンルトの字、日んル ¿ カ ? ンンンパをル日文本ルををロパの語パ ¡ ¿ 日ンロを本文ロ本選選ルル文日トで選日をロ所ルコルルでトでをんで選 ? ル字ん選コ ¿ ルをコで字ト日選文 ? ンン ¡ ー選ートで本本日で、ル本ンネでロ ¡ ト語ルー選のトー ¡ 字カ、本選文ン、の ¡ 選所ルパーーネ ¡ ん ? でをパ所ルト語のでパ選日 ¿ カ、コーパの ¡ 文でンパル文選 ? ルーんーを日日コネ語所文日ト選でンロを文 ? 本でンルの日ル ¿ でカトんーコロん日字、本字ルでの ¿ カ ¡ ロルでー所で本でル ¿ ル日カーカを本ルのネ本コル本ル ¡ 本をををんパでんでロ所ロ字ー ? ルコルネんを所ルを ? ーーパパロロ所を文ロ ¡ ルカ字ロパ ¿ でんん ¿ をル選ののローで ¡ 所ーん、ルロで文でロル字文で ¡ ロロ所ル字ん ? 字本ー ¡ ル所コルネー字 ¡ ルル本カ ? ? パ日所日、でロ選ーんルロんコ本日 ¡ ネカコでン本パでトルト選ル、ート ¿ 本のパン所で所字ルロで日ンルネ日日パカ ¡ 文ネ日字文日語ロー文選本ロ文 ? 所ロロ ¡ カー文カカーカネでー ¡ ¡ ーをパ所で日ールーで日 ¿ ル ? 選んル ¡ 文ルでで本語カネーーんネネーでネ日文ルでト所文ンコ所パのト、語ル選選でパ字ロルト ¿ ででパーパでルロ語をで日ルネロパ本所ロロ所コールネ文所日選選ネ字で字パカトロ本ロ ? ルー文んルロのカ日ロ ¡ ンルを所文トーーロ ? ンロをートルコ文 ¡ カコネー所ネ字日ートトロ語ルんで選、パー ¡ で字ールんで、ートー所字で所 ? 、字カ日文語選ロ日カ日ロロを文コ本トコでコ、でーネ選ル日ロをンでルネロをカのん本 ¡ で ¿ ¿ 文パ本ル語でをロ本ル文で本、ト、 ¿ 文ーのーの ¿ 語ルをを ¡ ルの本 ¡ ネコ語カー ¡ 選ロでールでん字ルカでの ¿ 文字ネロ ¿ ンロカカロ ¿ 本をルーロでので文ルをントールパで ? ででん語日語日ルん ¿ 文字字 ? ル選ト語ーをで ? ロロ字 ¿ カ語でコ字コネルパコんを所パ ? ン字日ん文カー語で ¡ 、ルロ文ルン語ル所、ん、ネネトで ¿ ¿ ルネトでロのルででで文パでーーで、所コルん本字でんロののンネ ? ルロ ¡ ルをネ文字ンル日ーンので日 ¡ んコルネ ? パネ字をのトルン所ロ ¿ 語ルン所パローでルで ¡ ル ¡ 選日パ ? で ? でパでト語字カ ¿ 字 ? 本選ネパ語でルルでー ? カ ? ? 選ー選コ ¿ 選で ¡ ネロ本ルで、のーでルトンパロでカのカ文選でトト字ール字トローー ¿ コ本ん文でネ ¡ ロ、でロでーール ? ル所でコロン語ルル文ンパ所選 ? をん ? 、語本ー本日選ネ ¿ 選ー ¡ ルトんルコルカルカ所選ル日ん日 ¿ ¡ 字ーカネパコ所のトロんーのネ、ロ日 ¡ ルロ ¿ ? 語ーネで ¡ でーでーパで、 ¿ ででをロンーカロのネト所カ ¿ の字カ日 ¿ ¿ ? ト語を、選ロ選所所ネ字ル ¿ ルネん語字ーーでー ? ンル ¿ 日のルーでででロ ¿ 語日カでんパロンカカカローパをルんん本日んのル日所ー本ル所でン語 ? ルネん選選 ¡ でカ ¡ 語パンパ字所コ字の日日、で本で ? 語ルネ本んー語語ントの ? でロルートをロン所文ん ? ¿ 字パ所ーコ文で選、のーロ ? 文ロ、ーで本ーで語所ネーパコンルトの字、日んル ¿ カ ? ンンンパ文ルルをル日文本ルをパの語パ ¡ ¿ 日ンロを本文ロ本選選ルル文日トで選日をロ所ルコルルでトでをんで選 ? ル字ん ? でカ選コ ¿ ルをコで日選文 ? ンン ¡ ー選ートで本本日で、ル本ンネでロ ¡ ト語ルー選のトー ¡ 字カ、本選文ン、のトロ ¡ 選所ルパーーネ ? でをパ所ルト語のでパ選日 ¿ カ、コーパの ¡ 文でンパル文選 ? ルーんーを日日コネ語所文日 ¡ ロコト選でンロを文でンルの日ル ¿ でカトんーコロん日字、本字ルでの ¿ カ ¡ ロルでー所で本でル ¿ ル日カーカを語ンコ本ルのネ本コル ¡ 本をををんパでんでロ所ロ字ー ? ルコルネんを所ルを ? ーーパパロロ所を文ロ ¡ ルカ字ロパンロ ¿ でんん ¿ をル選ローで ¡ 所ーん、ルロで文でロル字文で ¡ ロロ所ル字ん ? 字本ー ¡ ル所コルネー字 ¡ ルル本カでの ? ? パ日所日、選ーんルロんコ本日 ¡ ネカコでン本パでトルト選ル、ート ¿ 本のパン所で所字ルロで日ンルネコ ? 日日

Use of the AES instruction set

(ECRYPT II AES Day - Bruges, Belgium)

Ryad BENADJILA

Agence Nationale de la Sécurité des Systèmes

d’Information 18 October 2012

SLIDE 2

1 AES-NI

a. Overview
b. Chronology

2 Instructions detail

a. xmm and SSE
b. Encrypt
c. Decrypt
d. Key Schedule

3 Performance

a. Latency/throughput
b. Core™ and ✖ops
c. Results
d. GCM

4 Beyond AES

a. Rijndael
b. Building blocks
c. SHA-3

5 Conclusion

SLIDE 3

AES-NI

AES-NI stands for “AES New Instructions” Introduced by Intel as:

◮ A hardware accelerated implementation of AES subparts ◮ Ways of implementing efficient versions of the algorithm

with constant time operations, offering a mitigation against timing side channel attacks (especially cache based attacks)

1/60

Use of the AES instruction set - 18 October 2012

SLIDE 4

AES-NI

AES-NI stands for “AES New Instructions” Introduced by Intel as:

◮ A hardware accelerated implementation of AES subparts ◮ Ways of implementing efficient versions of the algorithm

with constant time operations, offering a mitigation against timing side channel attacks (especially cache based attacks)

Access to the instructions from the userland (Ring3) level

1/60

Use of the AES instruction set - 18 October 2012

SLIDE 5

AES-NI

AES-NI stands for “AES New Instructions” Introduced by Intel as:

◮ A hardware accelerated implementation of AES subparts ◮ Ways of implementing efficient versions of the algorithm

with constant time operations, offering a mitigation against timing side channel attacks (especially cache based attacks)

Access to the instructions from the userland (Ring3) level Six new instructions over the previous SSE4 set:

◮ 4 for encryption and decryption:

aesenc, aesdec, aesenclast and aesdeclast

◮ 2 for the Key Schedule:

aeskeygenassist and aesimc

Plus a companion carry-less multiplication instruction clmul

1/60

Use of the AES instruction set - 18 October 2012

SLIDE 6

2005 2006 2007 2008 2009 2010 2011 2012 2013 “Cache-timing attacks on AES” Bernstein’s remote cache attacks on AES table based implementations

2/60

Use of the AES instruction set - 18 October 2012

SLIDE 7

2005 2006 2007 2008 2009 2010 2011 2012 2013 “Cache-timing attacks on AES” Bernstein’s remote cache attacks on AES table based implementations “Cache Attacks and Countermeasures: The Case

f AES”

Osvik, Shamir and Tromer local cache timing attacks Software counter measures with performance im- pact Table based implementations using masking Bitslice implementations: interesting but efficiency related to block size Mastsui and Nakajima: ✘10c/B for 2KB data blocks on Core 2 CPU

3/60

Use of the AES instruction set - 18 October 2012

SLIDE 8

2005 2006 2007 2008 2009 2010 2011 2012 2013 “Cache-timing attacks on AES” Bernstein’s remote cache attacks on AES table based implementations “Cache Attacks and Countermeasures: The Case

f AES”

Osvik, Shamir and Tromer local cache timing attacks Software counter measures with performance im- pact Table based implementations using masking Bitslice implementations: interesting but efficiency related to block size Mastsui and Nakajima: ✘10c/B for 2KB data blocks on Core 2 CPU Intel’s proposal for AES-NI instructions White Paper by Gueron AES table based implementation new speed records Bernstein and Shwabe: ✘10c/B on Core 2 (using a custom high level assembler) Bitslice implementations: very efficient SSSE3 implementation K¨ asper and Shwabe: ✘7c/B on Core 2 and Nehalem Core i7 by using SSSE3 new instructions, for 128 bytes data blocks

4/60

Use of the AES instruction set - 18 October 2012

SLIDE 9

2005 2006 2007 2008 2009 2010 2011 2012 2013 “Cache-timing attacks on AES” Bernstein’s remote cache attacks on AES table based implementations “Cache Attacks and Countermeasures: The Case

f AES”

Osvik, Shamir and Tromer local cache timing attacks Software counter measures with performance im- pact Table based implementations using masking Bitslice implementations: interesting but efficiency related to block size Mastsui and Nakajima: ✘10c/B for 2KB data blocks on Core 2 CPU Intel’s proposal for AES-NI instructions White Paper by Gueron AES table based implementation new speed records Bernstein and Shwabe: ✘10c/B on Core 2 (using a custom high level assembler) Bitslice implementations: very efficient SSSE3 implementation K¨ asper and Shwabe: ✘7c/B on Core 2 and Nehalem Core i7 by using SSSE3 new instructions, for 128 bytes data blocks Intel’s first generation CPUs with AES-NI (and clmul) Westmere microarchitecture (not all CPUs concerned) Intel’s second generation CPUs with AES-NI, with AVX support Sandy Bridge microarchitecture (almost all CPUs) AMD’s first generation CPUs with AES-NI (with clmul and AVX support) Bulldozer microarchitecture (all CPUs) Intel’s third generation CPUs with AES-NI Ivy Bridge microarchitecture (almost all CPUs)

5/60

Use of the AES instruction set - 18 October 2012

SLIDE 10

xmm and ymm registers

xmm are 128-bit registers:

◮ 8 in 32-bit mode xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

128 bits

3 2 1 7 6 5 4 11 10 9 8 15 14 13 12

6/60

Use of the AES instruction set - 18 October 2012

SLIDE 11

xmm and ymm registers

xmm are 128-bit registers:

◮ 16 in 64-bit mode xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15

128 bits

3 2 1 7 6 5 4 11 10 9 8 15 14 13 12

7/60

Use of the AES instruction set - 18 October 2012

SLIDE 12

xmm and ymm registers

ymm are 256-bit registers (only in 64-bit mode):

◮ xmm extended to 256 bits with AVX new extensions ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 ymm8 ymm9 ymm10 ymm11 ymm12 ymm13 ymm14 ymm15

256 bits

3 2 1 7 6 5 4 11 10 9 8 15 14 13 12 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 xmm = low part of ymm

8/60

Use of the AES instruction set - 18 October 2012

SLIDE 13

Some useful SSE instructions

SSE = Streaming SIMD (Single Instruction Multiple Data) Extensions

SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AES-NI

Pentium III Pentium 4 Pentium 4 Prescott Core 2 Core 2 Penryn Nehalem Westmere

clmul SSE4a FMA,XOP,CVT16 SSE5 AVX

Sandy Bridge K6 (3D Now!) AMD64 Athlon 64 Venice Bobcat Barcelona Bulldozer Bulldozer Bulldozer Bulldozer Intel AMD 9/60

Use of the AES instruction set - 18 October 2012

SLIDE 14

Some useful SSE instructions

SSE instructions work on bytes, 16-bit shorts, 32-bit double words, 64-bit quad words and full 128-bit xmm words Moving memory data to and from a xmm register:

xmm1 ✥ [mem128]

r

[mem128] ✥ xmm2 movdqu xmm1/[mem128], [mem128]/xmm2

“Xoring” two registers or a register and memory:

xmm1 ✥ xmm1 ✟ (xmm2/[mem128]) pxor xmm1, xmm2/[mem128]

10/60

Use of the AES instruction set - 18 October 2012

SLIDE 15

Some useful SSE instructions

Packed Shuffle Bytes: byte-wise shuffling in xmm according to a mask in xmm (SSSE3)

for(i=0; i<16; i++){ xmm1[i] ✥ xmm1[xmm2[i]] } /∗ With xmm1[i] = 0 for i ✕ 16 ∗/ pshufb xmm1, xmm2/[mem128]

xmm1

3 2 1 7 6 5 4 11 10 9 8 15 14 13 12

xmm2/[mem128]

1 3 1 5 20 30 10 9 4 5 5 15

xmm1

1 3 1 5 10 9 4 5 5 15

... ...

11/60

Use of the AES instruction set - 18 October 2012

SLIDE 16

Some useful SSE instructions

Packed Shuffle Double words: shuffling in xmm according to an immediate bitmask (SSE2)

for(i=0; i<4; i++){ (double word)xmm1[i] ✥ (double word)xmm2[(imm8>>(2∗i)) & 0x3] } pshufd xmm1, xmm2/[mem128], imm8

xmm2/[mem128] xmm1 imm8 = 0xc0 1 1

12/60

Use of the AES instruction set - 18 October 2012

SLIDE 17

Some useful SSE instructions

Blending two 16-bit words xmm registers according to a mask (SSE4.1):

for(i=0; i<8; i++){ if((imm8>>8) & 0x1 == 1){ (short word)xmm1[i] ✥ (short word)xmm2[i] } } pblendw xmm1, xmm2/[mem128], imm8

13/60

Use of the AES instruction set - 18 October 2012

SLIDE 18

AVX extensions

AVX extensions use the VEX prefix that is not compatible with 32-bit mode

✮

✥ ✟ ✥ ✟

✮

14/60

Use of the AES instruction set - 18 October 2012

SLIDE 19

AVX extensions

AVX extensions use the VEX prefix that is not compatible with 32-bit mode Two main advantages over previous SSE:

◮ Twice more data in ymm, which means twice more

“vectorization”

◮ New AVX extensions allow most compatible instructions to

use 3 operands ✮ non destructive operations Legacy pxor AVX extended vpxor

xmm1 ✥ xmm1 ✟ (xmm2/[mem128]) pxor xmm1, xmm2/[mm128] xmm1 ✥ xmm2 ✟ (xmm3/[mem128]) vpxor xmm1, xmm2, xmm3/[mm128]

✮

14/60

Use of the AES instruction set - 18 October 2012

SLIDE 20

AVX extensions

AVX extensions use the VEX prefix that is not compatible with 32-bit mode Two main advantages over previous SSE:

◮ Twice more data in ymm, which means twice more

“vectorization”

◮ New AVX extensions allow most compatible instructions to

use 3 operands ✮ non destructive operations Legacy pxor AVX extended vpxor

xmm1 ✥ xmm1 ✟ (xmm2/[mem128]) pxor xmm1, xmm2/[mm128] xmm1 ✥ xmm2 ✟ (xmm3/[mem128]) vpxor xmm1, xmm2, xmm3/[mm128]

However:

◮ Not all legacy instructions with VEX extension benefit

from ymm (e.g. vpshufd does, vpxor doesn’t) ✮ ymm high part is zeroed then

◮ Possible latencies during AVX and legacy SSE switch

14/60

Use of the AES instruction set - 18 October 2012

SLIDE 21

AES-NI encryption instructions

aesenc for rounds:

Tmp ✥ xmm1 Tmp ✥ SubBytes(Tmp) Tmp ✥ ShiftRows(Tmp) Tmp ✥ MixColumns(Tmp) xmm1 ✥ Tmp ✟ xmm2/[mem128] aesenc xmm1, xmm2/[mem128]

S(.) xmm1

4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

SubBytes ShiftRows MixColumns

4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

xmm2/[mem128] AddRoundKey aesenc xmm1,xmm2/[mem128]

<

<< <<< x

2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2

15/60

Use of the AES instruction set - 18 October 2012

SLIDE 22

AES-NI encryption instructions

aesenclast for the last round:

Tmp ✥ xmm1 Tmp ✥ SubBytes(Tmp) Tmp ✥ ShiftRows(Tmp) xmm1 ✥ Tmp ✟ xmm2/[mem128] aesenclast xmm1, xmm2/[mem128]

S(.) xmm1

4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

SubBytes ShiftRows

4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

xmm2/[mem128] AddRoundKey aesenclast xmm1, xmm2/[mem128]

<

<< <<<

16/60

Use of the AES instruction set - 18 October 2012

SLIDE 23

Block encryption

AES128 (naive) encryption of one plaintext block:

xmm0 ✥ plaintext xmm1xmm11 ✥ scheduled keys pxor xmm0, xmm1 /∗ Round 0 (whitening) ∗/ aesenc xmm0, xmm2 /∗ Round 1 ∗/ aesenc xmm0, xmm3 /∗ Round 2 ∗/ aesenc xmm0, xmm4 /∗ Round 3 ∗/ aesenc xmm0, xmm5 /∗ Round 4 ∗/ aesenc xmm0, xmm6 /∗ Round 5 ∗/ aesenc xmm0, xmm7 /∗ Round 6 ∗/ aesenc xmm0, xmm8 /∗ Round 7 ∗/ aesenc xmm0, xmm9 /∗ Round 8 ∗/ aesenc xmm0, xmm10 /∗ Round 9 ∗/ aesenclast xmm0, xmm11 /∗ Round 10 ∗/ AES128 Encryption (128-bit block)

17/60

Use of the AES instruction set - 18 October 2012

SLIDE 24

AES-NI decryption instructions

AES-NI implements the equivalent inverse cipher for decryption

SubBytes ShiftRows MixColumns RK SubBytes ShiftRows Last Ciphertext Plaintext First Encryption InvSubBytes InvShiftRows InvMixColumns InvMixColumns(RK) InvSubBytes InvShiftRows Last Ciphertext Plaintext First Decryption x(Nr-1) x(Nr-1)

18/60

Use of the AES instruction set - 18 October 2012

SLIDE 25

AES-NI decryption instructions

aesdec:

Tmp ✥ xmm1 Tmp ✥ SubBytes1(Tmp) Tmp ✥ ShiftRows1(Tmp) Tmp ✥ MixColumns1(Tmp) xmm1 ✥ Tmp ✟ xmm2/[mem128] aesdec xmm1, xmm2/[mem128]

aesdeclast:

Tmp ✥ xmm1 Tmp ✥ SubBytes1(Tmp) Tmp ✥ ShiftRows1(Tmp) xmm1 ✥ Tmp ✟ xmm2/[mem128] aesdeclast xmm1, xmm2/[mem128]

We feed aesdec with the equivalent inverse cipher keys

19/60

Use of the AES instruction set - 18 October 2012

SLIDE 26

Block decryption

AES128 (naive) decryption of one plaintext block:

xmm0 ✥ plaintext xmm1xmm11 ✥ scheduled keys (inverse cipher) pxor xmm0, xmm1 /∗ Round 0 (whitening) ∗/ aesdec xmm0, xmm2 /∗ Round 1 ∗/ aesdec xmm0, xmm3 /∗ Round 2 ∗/ aesdec xmm0, xmm4 /∗ Round 3 ∗/ aesdec xmm0, xmm5 /∗ Round 4 ∗/ aesdec xmm0, xmm6 /∗ Round 5 ∗/ aesdec xmm0, xmm7 /∗ Round 6 ∗/ aesdec xmm0, xmm8 /∗ Round 7 ∗/ aesdec xmm0, xmm9 /∗ Round 8 ∗/ aesdec xmm0, xmm10 /∗ Round 9 ∗/ aesdeclast xmm0, xmm11 /∗ Round 10 ∗/ AES128 Encryption (128-bit block)

20/60

Use of the AES instruction set - 18 October 2012

SLIDE 27

Rijndael Key Schedule

Key Schedule for AES128 and AES192

KeyExpansion(byte Key[4∗Nk] word W[Nb∗(Nr+1)]) { /∗ AES128 => (Nk=4, Nr=10, Nb=4) ∗/ for(i = 0; i < Nk; i++) W[i] = (Key[4∗i],Key[4∗i+1],Key[4∗i+2],Key[4∗i+3]); for(i = Nk; i < Nb ∗ (Nr + 1); i++) { temp = W[i 1]; if (i % Nk == 0) temp = SubByte(RotByte(temp)) ^ Rcon[i / Nk]; W[i] = W[i Nk] ^ temp; } } Rijndael Key Schedule (Nk✔6, i.e. AES128 and AES192)

KEY0 KEY1

W[0] W[1] W[2] W[3] W[4] W[5] W[6] W[7] ...

S(rot(.))+Rcon

AES128

4 bytes 21/60

Use of the AES instruction set - 18 October 2012

SLIDE 28

Rijndael Key Schedule

Key Schedule for AES128 and AES192

KeyExpansion(byte Key[4∗Nk] word W[Nb∗(Nr+1)]) { /∗ AES192 => (Nk=6, Nr=12, Nb=4) ∗/ for(i = 0; i < Nk; i++) W[i] = (Key[4∗i],Key[4∗i+1],Key[4∗i+2],Key[4∗i+3]); for(i = Nk; i < Nb ∗ (Nr + 1); i++) { temp = W[i 1]; if (i % Nk == 0) temp = SubByte(RotByte(temp)) ^ Rcon[i / Nk]; W[i] = W[i Nk] ^ temp; } } Rijndael Key Schedule (Nk✔6, i.e. AES128 and AES192)

KEY0 KEY1

W[0] W[1] W[2] W[3] W[6] W[7] W[8] W[9] ...

S(rot(.))+Rcon

W[4] W[5] ... ...

AES192

4 bytes 22/60

Use of the AES instruction set - 18 October 2012

SLIDE 29

Rijndael Key Schedule

Key Schedule for AES256

KeyExpansion(byte Key[4∗Nk] word W[Nb∗(Nr+1)]) { /∗ AES256 => (Nk=8, Nr=14, Nb=4) ∗/ for(i = 0; i < Nk; i++) W[i] = (Key[4∗i],Key[4∗i+1],Key[4∗i+2],Key[4∗i+3]); for(i = Nk; i < Nb ∗ (Nr + 1); i++) { temp = W[i 1]; if (i % Nk == 0) temp = SubByte(RotByte(temp)) ^ Rcon[i / Nk]; else if (i % Nk == 4) temp = SubByte(temp); W[i] = W[i Nk] ^ temp; } } Rijndael Key Schedule (Nk>6, i.e. AES256)

KEY0 KEY1

W[0] W[1] W[2] W[3] W[8] W[9]

S(rot(.))+Rcon

W[4] W[5]

AES256

W[6] W[7] W[10]W[11]W[12]

S(.)

... ...

23/60

Use of the AES instruction set - 18 October 2012

SLIDE 30

AES-NI Key Schedule instructions

aeskeygenassist:

aesimc: for the equivalent inverse cipher key schedule (apply inverse MixColumns to all the keys scheduled for encryption, except first and last ones)

/∗ Round key scheduled for encryption in xmm2 ∗/ Tmp ✥ xmm2/[mem128] xmm1 ✥ MixColumns1(Tmp) aesimc xmm1, xmm2/[mem128]

24/60

Use of the AES instruction set - 18 October 2012

SLIDE 31

AES-NI Key Schedule for AES128

/∗ Key in xmm1 ∗/ xmm1 ✥ Key /∗ Prepare value for W[4] in xmm2 ∗/ aeskeygenassist xmm2, xmm1, Rcon /∗ Rcon=0x1 for the first iteration ∗/ /∗ We only keep the last double word SubByte(RotByte(xmm2[3]))✟Rcon ∗/ pshufd xmm2, xmm2, 0xff // movdqa xmm3, xmm1 pslldq xmm3, 0x4 /∗ W[i1] goes to W[i] place ∗/ pxor xmm1, xmm3 /∗ xor all W[i] with W[i1] ∗/ // pslldq xmm3, 0x4 /∗ W[i2] goes to W[i] place ∗/ pxor xmm1, xmm3 /∗ xor all W[i] with W[i2] ∗/ // pslldq xmm3, 0x4 /∗ W[i3] goes to W[i] place ∗/ pxor xmm1, xmm3 /∗ xor all W[i] with W[i3] ∗/ /∗ Finalize ∗/ pxor xmm1, xmm2 // KeySchedule[16∗i] ✥ xmm1 LOOP /∗ loop with next Rcon ∗/ ... AES-NI Key Schedule for AES128

25/60

Use of the AES instruction set - 18 October 2012

SLIDE 32

AES-NI Key Schedule for AES128

xmm2

SR(W[3]) S(W[3]) SR(W[1]) S(W[1])

pshufd

SR(W[3]) SR(W[3]) SR(W[3]) SR(W[3])

xmm2 xmm1 = scheduled key i

W[3] W[2] W[1] W[0]

aeskeygenassist xmm3

W[2] W[1] W[0]

xmm3

W[1]

xmm3

W[0] W[0]

pslldq pxor pxor pxor xmm1

W[0] W[i] (0,1) W[i] (0,2) W[i] (0,3) W[7] W[6] W[5] W[4]

xmm1 = scheduled key i+1

26/60

Use of the AES instruction set - 18 October 2012

SLIDE 33

Some definitions

Interdependent instructions: scheduled instructions that share a data dependency forcing a “stall” in the pipeline

movdqu xmm1 , xmm2 pxor xmm3, xmm1

Timeline (cycles)

movdqu pxor

xmm1 updated

27/60

Use of the AES instruction set - 18 October 2012

SLIDE 34

Some definitions

Interdependent instructions: scheduled instructions that share a data dependency forcing a “stall” in the pipeline

movdqu xmm1 , xmm2 pxor xmm3, xmm1

Timeline (cycles)

movdqu pxor

xmm1 updated

Independent instructions: scheduled instructions that don’t share data dependency, and that can be parallelized

movdqu xmm1, xmm2 pxor xmm3, xmm4

Timeline (cycles)

movdqu pxor

27/60

Use of the AES instruction set - 18 October 2012

SLIDE 35

Some definitions

Latency of an instruction: number of cycles taken by the instruction to complete in the worst case

latency

28/60

Use of the AES instruction set - 18 October 2012

SLIDE 36

Some definitions

Latency of an instruction: number of cycles taken by the instruction to complete in the worst case

latency

(Reciprocal) Throughput of an instruction: number of cycles to complete in the best case

throughput

28/60

Use of the AES instruction set - 18 October 2012

SLIDE 37

Some definitions

Latency of an instruction: number of cycles taken by the instruction to complete in the worst case

latency

(Reciprocal) Throughput of an instruction: number of cycles to complete in the best case

throughput

Intel’s Optimization Manual states that aesenc, aesdec, aesenclast, aesdeclast have:

◮ latency of 6 cycles, throughput of 2 cycles (Westmere) ◮ latency of 8 cycles, throughput of 1 cycle (Sandy and

Ivy Bridge)

◮ let’s understand why ...

28/60

Use of the AES instruction set - 18 October 2012

SLIDE 38

29/60

Use of the AES instruction set - 18 October 2012

SLIDE 39

30/60

Use of the AES instruction set - 18 October 2012

SLIDE 40

31/60

Use of the AES instruction set - 18 October 2012

SLIDE 41

32/60

Use of the AES instruction set - 18 October 2012

SLIDE 42

33/60

Use of the AES instruction set - 18 October 2012

SLIDE 43

Throughput and ✖ops

Each Port (execution unit entry) has a 1 cycle latency The latency of an instruction represents the sum of the latencies of its sequential ✖ops

◮ pxor xmm1, [mem128] has to wait the resulting data

from the memory load before doing the xor operation

✖

✖ ✖ ✮ ❂ ❂ ✿

34/60

Use of the AES instruction set - 18 October 2012

SLIDE 44

Throughput and ✖ops

Each Port (execution unit entry) has a 1 cycle latency The latency of an instruction represents the sum of the latencies of its sequential ✖ops

◮ pxor xmm1, [mem128] has to wait the resulting data

from the memory load before doing the xor operation

The throughput of an instruction is directly related to its independent ✖ops decomposition, as well as to its port binding

◮ pxor is composed of a unique ✖op, and can be dispatched

n Port0, 1 or 5

◮ latency of the ✖op is 1 cycle ✮ throughput is

1❂3 ❂ 0✿33 cycle

34/60

Use of the AES instruction set - 18 October 2012

SLIDE 45

AES-NI: ✖op analysis for Westmere

IACA tool (Intel Architecture Code Analyzer):

Throughput Analysis Report

Block Throughput: 6.00 Cycles

Throughput Bottleneck: InterIteration | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2

D

| 3

D

| 4 | 5 | |

|

3 | 2.0 | | | | | 1.0 | CP | aesenc xmm0, xmm1 Throughput Analysis Report

Block Throughput: 6.00 Cycles

Throughput Bottleneck: InterIteration | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2

D

| 3

D

| 4 | 5 | |

|

3 | 2.0 | | | | | 1.0 | CP | aesdec xmm0, xmm1

3 ✖ops: two on Port0 and one on Port5

35/60

Use of the AES instruction set - 18 October 2012

SLIDE 46

AES-NI: ✖op analysis for Westmere

Decomposition and latency of aesenc:

xmm1 5 Port5 Implicit ShiftRows Port0 Timeline (cycles) 4 6 SubBytes/MixColumns AddRoundKey Latency 6 Latency Port0 Dependency Data

36/60

Use of the AES instruction set - 18 October 2012

SLIDE 47

AES-NI: ✖op analysis for Westmere

Throughput of aesenc:

xmm1 5 2 Port5 Implicit ShiftRows Port0 Timeline (cycles) 4 6 AES round on block i AES round on block i+1 Throughput Port5 Latency Port0

37/60

Use of the AES instruction set - 18 October 2012

SLIDE 48

AES-NI: ✖op analysis for Sandy Bridge

IACA tool:

Throughput Analysis Report

Block Throughput: 7.00 Cycles

Throughput Bottleneck: InterIteration | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2

D

| 3

D

| 4 | 5 | |

|

2 | 0.5 | 0.5 | | | | 1.0 | CP | aesenc xmm0, xmm1

Latency is actually 8 cycles (Intel’s Optimization Manual) An AES execution subunit has been added behind Port1 The two ✖ops operating on half states seem to have fused in one 7 cycles latency ✖op that can be dispatched on Port0 or Port1: pressure on Port0 is decreased

38/60

Use of the AES instruction set - 18 October 2012

SLIDE 49

AES-NI: ✖op analysis for Sandy Bridge

The throughput is reduced to 1 cycle

xmm1 7 1 1 Port0 Timeline (cycles) 8 AES round on block i AES round on block i+1 Throughput Port5 Port5 Port1 xmm1

39/60

Use of the AES instruction set - 18 October 2012

SLIDE 50

Latencies and throughputs summary

Westmere Sandy and Ivy Bridge

Instruction Latency Throughput aesenc 6 2 aesdec 6 2 aesenclast 6 2 aesdeclast 6 2 aeskeygenassist 6 2 aesimc 6 2 pxor 1 0.33 Instructions latencies and reciprocal throughputs (in cycles) Instruction Latency Throughput aesenc 8 1 aesdec 8 1 aesenclast 8 1 aesdeclast 8 1 aeskeygenassist 81 81 aesimc 2 2 pxor 1 0.33 Instructions latencies and reciprocal throughputs (in cycles)

1 Only reported by Agner Fog’s

experimental results

Reported by Intel documentation and confirmed experimentally (Agner Fog)

40/60

Use of the AES instruction set - 18 October 2012

SLIDE 51

Exploiting instruction-level parallelism

AES optimal parallel encryption for Westmere:

xmm4xmm15 ✥ scheduled keys LOOP: xmm0, xmm1, xmm2, xmm3 ✥ 4 plaintext blocks pxor xmm0, xmm4 /∗ Block0 whitening ∗/ pxor xmm1, xmm4 /∗ Block1 whitening ∗/ pxor xmm2, xmm4 /∗ Block2 whitening ∗/ pxor xmm3, xmm4 /∗ Block3 whitening ∗/ 6 cycles aesenc xmm0, xmm5 /∗ Block0 Round 1 ∗/ 2 cycles aesenc xmm1, xmm5 /∗ Block1 Round 1 ∗/ aesenc xmm2, xmm5 /∗ Block2 Round 1 ∗/ aesenc xmm3, xmm5 /∗ Block3 Round 1 ∗/ Port0 delay aesenc xmm0, xmm6 /∗ Block0 Round 2 ∗/ aesenc xmm1, xmm6 /∗ Block1 Round 2 ∗/ aesenc xmm2, xmm6 /∗ Block2 Round 2 ∗/ aesenc xmm3, xmm6 /∗ Block3 Round 2 ∗/ ... aesenclast xmm0, xmm15 /∗ Block0 Round 10 ∗/ aesenclast xmm1, xmm15 /∗ Block1 Round 10 ∗/ aesenclast xmm2, xmm15 /∗ Block2 Round 10 ∗/ aesenclast xmm3, xmm15 /∗ Block3 Round 10 ∗/ CHAIN jmp LOOP AES128 Parallel Encryption (4 blocks in parallel)

41/60

Use of the AES instruction set - 18 October 2012

SLIDE 52

Exploiting instruction-level parallelism

Parallel encryption for Sandy and Ivy Bridge:

xmm8xmm15, [mem][mem+3∗16] ✥ scheduled keys LOOP: xmm0, xmm1, xmm2, xmm3, xmm4,\ xmm5, xmm6, xmm7 ✥ 8 plaintext blocks pxor xmm0, xmm8 /∗ Block0 whitening ∗/ pxor xmm1, xmm8 /∗ Block1 whitening ∗/ pxor xmm2, xmm8 /∗ Block2 whitening ∗/ pxor xmm3, xmm8 /∗ Block3 whitening ∗/ pxor xmm4, xmm8 /∗ Block4 whitening ∗/ pxor xmm5, xmm8 /∗ Block5 whitening ∗/ pxor xmm6, xmm8 /∗ Block6 whitening ∗/ pxor xmm7, xmm8 /∗ Block7 whitening ∗/ 8 cycles aesenc xmm0, xmm9 /∗ Block0 Round 1 ∗/ 1 cycle aesenc xmm1, xmm9 /∗ Block1 Round 1 ∗/ aesenc xmm2, xmm9 /∗ Block2 Round 1 ∗/ aesenc xmm3, xmm9 /∗ Block3 Round 1 ∗/ aesenc xmm4, xmm9 /∗ Block4 Round 1 ∗/ aesenc xmm5, xmm9 /∗ Block5 Round 1 ∗/ aesenc xmm6, xmm9 /∗ Block6 Round 1 ∗/ aesenc xmm7, xmm9 /∗ Block7 Round 1 ∗/ aesenc xmm0, xmm10 /∗ Block0 Round 2 ∗/ ... AES128 Parallel Encryption (8 blocks in parallel)

42/60

Use of the AES instruction set - 18 October 2012

SLIDE 53

Theoretical performance (Westmere)

Mode Formula AES128 AES1922 AES2562 ECB Enc/Dec ✭Nrounds ✂ 2 ✰ 0✿33✮❂16 ❂ 1.27 c/B 1.52 c/B 1.77 c/B CBC Encrypt1 ✭Nrounds ✂ 6 ✰ 0✿33✮❂16 ❂ 3.77 c/B 4.52 c/B 5.27 c/B CBC Decrypt1 ✭Nrounds ✂ 2 ✰ 0✿33✮❂16 ❂ 1.27 c/B 1.52 c/B 1.77 c/B CTR Enc/Dec1 ✭Nrounds ✂ 2 ✰ 0✿33✮❂16 ❂ 1.27 c/B 1.52 c/B 1.77 c/B Theoretical performance for 4-parallel blocks encryption and decryption (in cycles per byte) for ECB and CBC modes (Westmere)

1 Plus a small overhead for chaining operations 2 Plus a small overhead because of register starvation

PCBC and CFB modes are like CBC (non parallel encryption and possible parallel decryption) OFB mode can’t be parallelized (but precomputed for a given key)

43/60

Use of the AES instruction set - 18 October 2012

SLIDE 54

Theoretical performance (Sandy/Ivy Bridge)

Mode Formula AES1282 AES1922 AES2562 ECB Enc/Dec ✭Nrounds ✂ 1 ✰ 0✿33✮❂16 ❂ 0.64 c/B 0.77 c/B 0.89 c/B CBC Encrypt1 ✭Nrounds ✂ 8 ✰ 0✿33✮❂16 ❂ 5.02 c/B 6.02 c/B 7.02 c/B CBC Decrypt1 ✭Nrounds ✂ 1 ✰ 0✿33✮❂16 ❂ 0.64 c/B 0.77 c/B 0.89 c/B CTR Enc/Dec1 ✭Nrounds ✂ 1 ✰ 0✿33✮❂16 ❂ 0.64 c/B 0.77 c/B 0.89 c/B Theoretical performance for 8-parallel blocks encryption and decryption (in cycles per byte) for ECB and CBC modes (Sandy and Ivy Bridge)

1 Plus a small overhead for chaining operations 2 Plus a small overhead because of register starvation

44/60

Use of the AES instruction set - 18 October 2012

SLIDE 55

Practical results

Official results in Intel’s White Paper on Westmere, 4 parallel blocks:

1 2 3 4 5 6 7 8 9 10 11 ECB CBC CTR table based AES128 CTR bitslice AES128 CTR #cycles per byte

AES-NI results (Westmere@2.67GHz, 1KB data)

Encrypt128 Encrypt192 Encrypt256 Decrypt128 Decrypt192 Decrypt256

45/60

Use of the AES instruction set - 18 October 2012

SLIDE 56

Practical results

Differences with theory:

1 2 3 4 5 6 7 8 9 10 11 ECB CBC CTR #cycles per byte

AES-NI results (Westmere@2.67GHz, 1KB data)

Results versus Theory

46/60

Use of the AES instruction set - 18 October 2012

SLIDE 57

What about the Key Schedule?

Intel has developped AES-NI with encryption and decryption using the same key in mind Key Schedule becomes negligible when encrypting multiple blocks with the same key This explains why aeskeygenassist performs quite poorly on Ivy/Sandy bridge AES-NI provides however better performance - with constant time implementation - than table based Key Schedule: ✘100 cycles against ✘160 cycles

47/60

Use of the AES instruction set - 18 October 2012

SLIDE 58

VEX encoded AES (AVX extensions)

There are VEX extensions of AES-NI instructions: vaesenc, vaesdec ... However, the instructions only work on the low part xmm of ymm registers The advantage of using three operands versions of the instructions remains:

◮ the Key Schedule can benefit from the extended

instructions ...

◮ ... at the cost of using VEX only instructions (to

avoid VEX/SSE switch latencies)

48/60

Use of the AES instruction set - 18 October 2012

SLIDE 59

GCM mode

Counter0 Counter1 Counter2 Plaintext1 Plaintext2 Ciphertext2 Ciphertext2 Key Key Encrypt Encrypt Encrypt mult mult mult mult AuthData1 len(A)||len(C) AuthTag incr incr Multiplication of input with fixed hash key in GF(2 )

128

49/60

Use of the AES instruction set - 18 October 2012

SLIDE 60

pclmulqdq instruction

Not an AES-NI instruction per se Performs a “carry-less multiplication” (polynomial multiplication over GF(2))

/∗ Split in two quadwords ∗/ xmm1 := [xmm1[1]|xmm1[0]] xmm2 := [xmm2[1]|xmm2[0]] if(imm8 == 0x00) xmm1 ✥ xmm2[0] ✂ xmm1[0] if(imm8 == 0x01) xmm1 ✥ xmm2[0] ✂ xmm1[1] if(imm8 == 0x10) xmm1 ✥ xmm2[1] ✂ xmm1[0] if(imm8 == 0x11) xmm1 ✥ xmm2[1] ✂ xmm1[1] pclmulqdq xmm1, xmm2/[mem128], imm8

50/60

Use of the AES instruction set - 18 October 2012

SLIDE 61

Using pclmulqdq

GCM multiplies two 128-bit values over GF(2128) Two issues:

◮ Carry-less multiplication of two 128-bit operands to

give a 255-bit value ✮ use schoolbook or Karatsuba algorithms

◮ Reduction of the resulting value over GF(2128) with the

GCM irreductible polynomial (x 128 ✰ x 7 ✰ x 2 ✰ x ✰ 1) ✮ Intel’s manual gives many optimized reduction algorithms

51/60

Use of the AES instruction set - 18 October 2012

SLIDE 62

Using pclmulqdq

GCM multiplies two 128-bit values over GF(2128) Two issues:

◮ Carry-less multiplication of two 128-bit operands to

give a 255-bit value ✮ use schoolbook or Karatsuba algorithms

◮ Reduction of the resulting value over GF(2128) with the

GCM irreductible polynomial (x 128 ✰ x 7 ✰ x 2 ✰ x ✰ 1) ✮ Intel’s manual gives many optimized reduction algorithms

Result on Westmere: AES GCM performs at 3.54 c/B with 4 parallel blocks CTR encryption, to compare with 10.68 c/B bitsliced AES GCM with table lookups (21.99 c/B without table lookups, Käsper et al.)

51/60

Use of the AES instruction set - 18 October 2012

SLIDE 63

Rijndael

Rijndael uses the same building blocks as AES, with a state up to 256-bit and extended possible key lengths AES-NI Key Schedule instructions fit the Rijndael Key Schedule The main issue comes from the ShiftRows on states of length > 128-bit that don’t fit the AES

52/60

Use of the AES instruction set - 18 October 2012

SLIDE 64

Rijndael (256-bit state example)

Solution: prepare the state (xmm1, xmm2) with AESShiftRows1(RijndaelShiftRows(state))

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 10 11 12 13 14 15 4 8 12 5 9 13 14 2 6 10 1 3 7 11 15 4 8 12 5 9 13 14 2 10 1 3 7 11 15 1 6 7 4 5 10 11 8 9 14 15 12 13 2 3 1 6 7 4 5 10 11 8 14 15 12 13 2 3 9 6 9 4 8 12 5 9 13 14 2 6 10 1 3 7 11 15 4 8 12 5 9 13 14 2 10 1 3 7 11 15 6

xmm1 xmm2 pshufb pblendw pshufb Rijndael State 2 2 2 aesenc aesenc AESInvSR(RijndaelSR(state)) AESSR SubBytes MixColumns AESSR SubBytes MixColumns

53/60

Use of the AES instruction set - 18 October 2012

SLIDE 65

Building blocks isolation

We can apply the same function composition strategy to isolate ShiftRows, MixColumns, SubBytes, RotByte

aesdeclast xmm1, 0x00... aesenc xmm1, 0x00... /∗ Tmp ✥ xmm1 Tmp ✥ InvShiftRows(Tmp) xmm1 ✥ InvSubBytes(Tmp) ✟ 0x00... Tmp ✥ xmm1 Tmp ✥ ShiftRows(Tmp) Tmp ✥ SubBytes(Tmp) xmm1 ✥ MixColumns(Tmp) ✟ 0x00... ∗/ MixColumns(xmm1) CST=0x0306090c0f0205080b0e0104070a0d00 pshufb xmm1, CST /∗ pshufb = InvShiftRows∗/ aesenclast xmm1, 0x00... SubBytes(xmm1)

SWAP InvShiftRows

54/60

Use of the AES instruction set - 18 October 2012

SLIDE 66

Building blocks isolation

The same building block can have multiple decompositions: InvMixColumns = {aesimc} = {aesenclast + aesdec}

55/60

Use of the AES instruction set - 18 October 2012

SLIDE 67

Building blocks isolation

The same building block can have multiple decompositions: InvMixColumns = {aesimc} = {aesenclast + aesdec} One must check the resulting latency and throughput, and use the optimal decomposition (or combine decompositions)

◮ using aesenclast and pshufb to compose SubBytes

seems clearly more efficient than composing aesenc, aesimc and pshufb

◮ depends on the microarchitectural details

55/60

Use of the AES instruction set - 18 October 2012

SLIDE 68

Building blocks isolation

In order to achieve maximum throughput, composed building blocks must be parallelized atomically for each instruction (remove critical paths)

aesdeclast xmm0, 0x00... aesdeclast xmm1, 0x00... aesdeclast xmm2, 0x00... aesdeclast xmm3, 0x00... aesenc xmm0, 0x00... aesenc xmm1, 0x00... aesenc xmm2, 0x00... aesenc xmm3, 0x00... Maximum throughput MixColumns (Westmere) aesdeclast xmm0, 0x00... aesenc xmm0, 0x00... aesdeclast xmm1, 0x00... aesenc xmm1, 0x00... aesdeclast xmm2, 0x00... aesenc xmm2, 0x00... aesdeclast xmm3, 0x00... aesenc xmm3, 0x00... MixColumns whith critical paths

56/60

Use of the AES instruction set - 18 October 2012

SLIDE 69

Building blocks isolation

Parts of the building blocks can also be isolated MixColumns sub-matrix multiplication isolation:

2 1 1 2

✵ ❅ ✶ ❆ ✂

x0 x1

✵ ❅ ✶ ❆

2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2

✵ ❇ ❇ ❇ ❇ ❇ ❅ ✶ ❈ ❈ ❈ ❈ ❈ ❆

✂ x0 x1

✵ ❇ ❇ ❇ ❇ ❅ ✶ ❈ ❈ ❈ ❈ ❆

57/60

Use of the AES instruction set - 18 October 2012

SLIDE 70

Hash functions

The versatility of AES-NI instructions allows them to be used in other areas than AES or Rijndael:

◮ All cryptographic algorithms that use AES building

blocks can benefit from AES-NI ...

◮ ... with performance benefits and/or constant time

implementation

More specifically, many candidates of the recent SHA-3 competition have used AES-NI to improve performance or provide resisance against side channel attacks

58/60

Use of the AES instruction set - 18 October 2012

SLIDE 71

Hash functions

Some SHA-3 candidates results:

Algorithm Previous 256/512 AES-NI 256/512 Grøstl 19.9 / 29.2 11.3 / 16.2 ECHO 28.5 / 53.5 6.8 / 12.6 Shavite-3 26.7 / 38.2 5.6 / 5.5 Cheetah 9.3 / 13.1 7.6 /

Lane

25.7 / 56.5 4.9 / 13.5 Lesamnta 52.7 / 51.2 29.5 / 19.0 LUX 10.5 / 9.26 6.6 /

Vortex

46.3 / 56.1 4.4 / 5.2 AES-NI performance benefits on Westmere (results in cycles per byte)

Final Round Round 2 Round 1

59/60

Use of the AES instruction set - 18 October 2012

SLIDE 72

Concluding thoughts

Since Intel’s White Paper in 2008, AES-NI has become a reality with Westmere and Sandy/Ivy Bridge Adding AES in the ISA rather than in a dedicated coprocessor has advantages (software compliance across platforms)

◮ ARM and SPARC plan to add similar instructions in their

next generation CPUs

What could be the future of AES-NI?

◮ AVX2 (in the forthcoming Haswell microarchitecture)

don’t include 256-bit AES ymm support: it might be planned for future release (?)

◮ the latency (8 cycles) can be improved, and ✖op

decomposition reduced

60/60

Use of the AES instruction set - 18 October 2012