Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - - PowerPoint PPT Presentation

implementing aes on a
SMART_READER_LITE
LIVE PREVIEW

Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - - PowerPoint PPT Presentation

Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, Belgium Tim Gneysu Hardware Security Group 10/23/2012 Horst Grtz Institute for IT-Security Outline Introduction Processor Platforms Tricks, Tweaks and Codes


slide-1
SLIDE 1

Tim Güneysu Hardware Security Group Horst Görtz Institute for IT-Security

10/23/2012

Implementing AES on a Bunch of Processors

ECRYPT AES Day – Bruges, Belgium

slide-2
SLIDE 2

Outline

  • Introduction
  • Processor Platforms
  • Tricks, Tweaks and Codes
  • Benchmarks and Results
  • Conclusions
slide-3
SLIDE 3

The AES Crib Sheet

slide-4
SLIDE 4

AES Implementation: General Representation

  • Two different representations
  • f AES in original proposal

– 8-bit standard implementation – 32-bit T-Table implementation, e.g.,

                                

j j j j

k k k k A T A T A T A T E E E E

, 3 , 2 , 1 , 15 3 10 2 5 1 3 2 1

) ( ) ( ) ( ) (

¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR 1 round: 4 x 4 = 16 TLU + XOR AES: 160 TLU+XOR per block encryption Memory: 4 T-Boxes, 1kB each

slide-5
SLIDE 5

AES Implementation : Choice of Processor

  • A dedicated AES processor in HW

is not always the preferred option

– AES is often just a supplementary function to a software application – HW development is too costly

  • r necessary skills are not available
  • But when doing AES in software:

which processor is the best?

slide-6
SLIDE 6

AES Implementation : Parameters

  • Key size of AES

– 128, 192, 256 bit

  • Applied mode of operation

– ECB, CBC, GCM, CTR,…

  • Blocks concurrently processed

– Single block (limited data transfers) – Multiple blocks (overhead reduction, bitslicing)

  • Round key computation

– Precomputed (when processing bulk data) – On-the-fly (when changing keys frequently)

slide-7
SLIDE 7

Outline

  • Introduction
  • Processor Platforms
  • Tricks, Tweaks and Codes
  • Benchmarks and Results
  • Conclusion
slide-8
SLIDE 8

Processors and Platforms

  • Native bit sizes of General-Purpose Processors (GPP)

4-Bit, e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines 8-Bit, e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems 16-Bit, e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems 32-Bit, e.g., ARM, TriCore in smart phones and automobiles 64-Bit, e.g., Intel i3/5/7, AMD A-Series in PCs and workstations 128-Bit, e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs )

  • Myth or Fact: AES is always most efficient on native

8-bit and 32-bit processors!?

slide-9
SLIDE 9

Processor Architectures

  • General processor design

RISC vs. CISC (Reduced/Complex Instruction Set Computer) Single-Instruction Multiple Data (SIMD) operation Super-scalar devices processing more than one instruction per cycle

  • Processor interface to memory

Von-Neumann vs. Harvard: shared memory for data and program? Cache for data and/or program? ( Cache attacks!) Static/dynamic external or built-in RAM?

  • Additional processor extensions

Multimedia/integer co-processor Special/native Instruction Set Extensions (ISE)

slide-10
SLIDE 10

Other Processor Architectures

  • Streaming processors, such as GPUs

– Multi-processors run hundreds concurrent threads – High memory bandwidth, but high latency to global memory

  • Digital Signal Processors (DSP)

– Supports fast combined arithmetic instructions – Are improved arithmetic instructions useful for AES?

  • Other array/tile-based processors

– Synchronous/asynchronous processing cores – Processor-based systolic array cores (Tilera, GreenArrays)

slide-11
SLIDE 11

Outline

  • Introduction
  • Processor Platforms
  • Tricks, Tweaks and Codes
  • Benchmarks and Results
  • Conclusions
slide-12
SLIDE 12

AES Software Optimization

  • General requirements for secure implementation in software

– Disable (or control) cache to prevent cache attacks – Avoid conditional branches to counter timing attacks

  • Common tweaks to achieve high-performance

– Make particular use of specialized instructions – Unroll rounds and loops to reduce instruction cycle count – Optimize register allocation – Precompute and store values in tables (e.g., T-Tables, round keys and constants)

  • Common tweaks to minimize code size

– Reuse code by functions to minimize instruction count – Limit amount of precomputed and stored values

  • Common tweaks for low energy consumption

– Reduce number of costly load and store operations to memory – General approach often similar to the optimization for high-performance

slide-13
SLIDE 13

Coding Intermezzo: Have you ever tried to implement AES on a Commodore C64?

encrypt ldx #$07 .addfirst lda aesblock+0,x ; 4 eor expkey+0,x ; 8 sta tmpblock+0,x ; 13 lda aesblock+8,x ; 17 eor expkey+8,x ; 21 sta tmpblock+8,x ; 26 dex ; 28 bpl .addfirst ; 31 ldy #$10 .round lda expkey+$00,y ; 4 ldx tmpblock+4*0+0 ; 7 eor ssm0,x ; 11 ldx tmpblock+4*1+1 ; 14 eor ssm3,x ; 18 ldx tmpblock+4*2+2 ; 21 eor ssm2,x ; 25 ldx tmpblock+4*3+3 ; 28 eor ssm1,x ; 32 sta aesblock+$00 ; 36 lda expkey+$01,y ldx tmpblock+4*0+0 eor ssm1,x ldx tmpblock+4*1+1 eor ssm0,x ldx tmpblock+4*2+2 eor ssm3,x ldx tmpblock+4*3+3 eor ssm2,x sta aesblock+$01 lda expkey+$02,y ldx tmpblock+4*0+0 eor ssm2,x ldx tmpblock+4*1+1 eor ssm1,x ldx tmpblock+4*2+2 eor ssm0,x ldx tmpblock+4*3+3 eor ssm3,x sta aesblock+$02 lda expkey+$03,y ldx tmpblock+4*0+0 eor ssm3,x ldx tmpblock+4*1+1 eor ssm2,x ldx tmpblock+4*2+2 eor ssm1,x ldx tmpblock+4*3+3 eor ssm0,x sta aesblock+$03 lda expkey+$04,y ldx tmpblock+4*1+0 eor ssm0,x ldx tmpblock+4*2+1 eor ssm3,x ldx tmpblock+4*3+2 eor ssm2,x ldx tmpblock+4*0+3 eor ssm1,x sta aesblock+$04 lda expkey+$05,y ldx tmpblock+4*1+0 eor ssm1,x ldx tmpblock+4*2+1 eor ssm0,x ldx tmpblock+4*3+2 eor ssm3,x ldx tmpblock+4*0+3 eor ssm2,x sta aesblock+$05

AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog]

Commodore C64 8-bit CPU with 64 KB RAM

slide-14
SLIDE 14

Real Coding: Sample T-Table AES in C

(Reference code by Brain Gladman)

  • High-performance AES

for processors ≥ 32 bit with interleaved T-tables

  • Per round, 4 instances of

code snippet required

  • AES has 720 instructions (INS)

– 208 loads – 4 stores – 508 integer instructions

  • 160 shifts
  • 176 masks (+16 for last rnd)
  • 168 XORs
  • 4 overhead for CTR mode

z0 = roundkeys[i * 4 + 0]; z1 = roundkeys[i * 4 + 1]; z2 = roundkeys[i * 4 + 2]; z3 = roundkeys[i * 4 + 3]; p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; p02 = (uint32) y0 >> 4; p03 = (uint32) y0 << 4; p00 &= 0xff0; p01 &= 0xff0; p02 &= 0xff0; p03 &= 0xff0; p00 = *(uint32 *) (table0 + p00); p01 = *(uint32 *) (table1 + p01); p02 = *(uint32 *) (table2 + p02); p03 = *(uint32 *) (table3 + p03); z0 ^= p00; z3 ^= p01; z2 ^= p02; z1 ^= p03; …

Table 1 Table 0 Interleaved Memory Layout (32-bit entries) Table 2 Table 3 Table 1 Table 0 Table 2 Table 3 Table 1 Table 0 Table 2 Table 3 Table 1 Table 0 Table 2 Table 3

16 32 48 Offset (byte) Access j-th table entry

  • f table i via table<i>+16j

table0 table1 table2 table3 Read Round keys Extract and mask input bytes Perform TLU Add TLU to round keys

(only ¼ round shown)

slide-15
SLIDE 15

Optimizing AES for High-Performance

  • Special instruction: Combined Shift-and-Mask

– On PPC, rlwinm is available as single instruction – Saves 160 instructions for separate masking [BS08] – AES on PPC has now 540 instructions

p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; p02 = (uint32) y0 >> 4; p03 = (uint32) y0 << 4; p00 &= 0xff0; p01 &= 0xff0; p02 &= 0xff0; p03 &= 0xff0; Extract and mask input bytes p00 = (uint32) y0 >> 20 & 0xff0; p01 = (uint32) y0 >> 12 & 0xff0; p02 = (uint32) y0 >> 4 & 0xff0; p03 = (uint32) y0 << 4 & 0xff0;

slide-16
SLIDE 16

Optimizing AES for High-Performance [cont.]

  • Special instruction: Scaled Index Loads

– On x86, shift and load instructions can be combined – Saves 80 instructions for separate shifting top and bottom bytes [BS08] – AES on x86 has 640 instructions (not to be combined with previous method!!)

p03 = (uint32) y0 << 4 … p03 &= 0xff0 … p03 = *(uint32 *) (table3 + p03) Extract and mask input bytes Perform TLU p03 = y0 & 0xff p03 = *(uint32 *) (table3 + (p03 << 4)) Mask first and do shifted TLU

slide-17
SLIDE 17

Optimizing AES for High-Performance [cont.]

  • Availability of 64-bit Registers

– On AMD64 and UltraSparcV9, use padded values in 64-registers

– Padding implicitly includes the shift by 4 bit (aka multiplication by 16) – Padding is applied consistently through entire AES – Saves 80 instructions (no need to mask top bytes anymore) [BS08] – AES now has 640 instructions (again, not to be combined) 0xc66363a5 0x0c60063006300a50

slide-18
SLIDE 18

Optimizing AES for High-Performance [cont.]

  • Other ways to optimize the T-Table AES in software…

– Special instruction: Combined Load-XOR (x86/AM64)  saves 168 instructions – Special instruction: second byte extraction instruction (x86)  saves 40 instructions – Special instruction: two-bytes loads  saves 4 instructions – Byte extraction via loads  trades 160-320 integer instructions against 200 loads/stores – Round key caching in extra registers  saves about 44 instructions – Utilize SSE processor extensions instead of plain CPU ALU – …

  • Results for common ≥32-bit processors (encrypting 4kByte of data)

– IBM PPC G4 7410: 459 instructions  14.57 cycles/byte [BS08] – Intel Pentium 4 f12: 414 instructions  14.13 cycles/byte [BS08] – Intel Core 2 Quad Q6600 with SSE3: ≈ 278 instructions*  9.32 cycles/byte [KS09] – Sun UltraSparc III: 505 instructions  12.06 cycles/byte [BS08] – Intel Core i7 920: ≈ 278 instructions*  6.92 cycles/byte [KS09]

*estimated since not provided in the original work

slide-19
SLIDE 19

Coding Intermezzo: Have you ever programmed AES-128 in Whitespace

Whitespace is a programming language developed by Edwin Brady and Chris Morris. The Whitespace interpreter ignores any non-whitespace characters. Only spaces, tabs and linefeeds have a meaning.

slide-20
SLIDE 20

AES on IBM‘s Cell Processor

  • Hybrid Processor Architecture

– PPC Processor (main/control unit) – 8 Synergistic Processing Elements (SPE) as work horse with 128-bit registers – Fast ring interconnect between PPC and SPE units – SPEs support efficient byte extraction and manipulation instructions for their 128-bit SIMD registers (e.g., shuffle, select) – 16-fold byte-sliced implementation per 128-bit register [BOS09] – Single AES encryption in ≈283 instructions (1752+2764 INS per 16 streams)  11.7 cycles/byte

slide-21
SLIDE 21

AES on NVIDIA‘s GTX 295 GPU

  • Streaming Processor Architecture

– 2x240 streaming processor units – Memory design including local, shared, texture and (large) global memory – Cache for some memories – Runs a large number of concurrent, synchronized threads – AES implementation using CUDA language (or OpenCL as alternative) – 32-bit integer instructions and T-Tables stored in shared memory – Benchmarking is not precise due to uncontrolled scheduling: throughput up to 59.6 GB/s reported  0.17 cycles/byte [BOS09]

slide-22
SLIDE 22

AES on Embedded Systems

  • TI TMS320-C6201: 16/32-bit DSP @200MHz

– Parallelized T-Table implementation on four pairs of ALUs – Encryption in 228 cycles  14.25 cycles/byte [WWGP00]

  • AVR ATMega: 8-bit RISC microcontroller @8MHz

– 8-bit AES implementation – Speed-up by 1 cycle per TLU by placing S-box in RAM (not Flash) – Fast encryption takes 2,153 cycles  134.56 cycles/byte [BOS09]

  • MARC4: 4-bit RISC microcontroller @1MHz

– 8-bit AES implementation with 2 registers per entry – „Fast“ encryption takes 23,828 cycles  1,489 cycles/byte [KP12]

slide-23
SLIDE 23

Coding Intermezzo: Programming AES in Colors (aka ColorForth)

MixColum Layer in ColorForth

slide-24
SLIDE 24

AES on Embedded Systems (cont.)

  • GreenArrays GA144 Tile processor

– Asynchronous 144 core device (nodes) – Each F18A core has 18-bit ALU – 128 words of memory + 20 words stack – Up to 4 instructions per word

  • AES implementation in ColorForth,

spread over 17 of 144 nodes

  • Asynchronous processor operation

disables cycle count metric

  • Absolute time per 128-bit encryption:

38 µs@2.2V supply voltage at 0.9µJ

slide-25
SLIDE 25

Outline

  • Introduction
  • Processor Platforms
  • Tricks, Tweaks and Codes
  • Benchmarks and Results
  • Conclusions
slide-26
SLIDE 26

AES-128 Platform Ranking on High-Performance

1) 32-bit: GTX 295 GPU [BOS09]: 0.17 cycles/byte 2) 64-bit (with 128-bit SSE3): Core i7 920 [KS09]: 6.92 cycles/byte 3) 128-bit: IBM Cell SPE [BOS09]: 11.7 cycles/byte 4) 8-bit: AVR ATMega [BOS09]: 134.56 cycles/byte 5) 16-bit: TI C5420 [TI]: 219 cycles/byte 6) 4-bit: MARC4 [KP12]: 1,489 cycles/byte X) 18-bit: GreenArray GA144: 38µs@2.2V

Important Remark: Beware of distortions, e.g., due to little interest in platform (4) or backward applied metrics for platform (1)

AES Performance when encrypting large packets (4Kb)

slide-27
SLIDE 27

AES-128 Platform Ranking on Green Cryptography

1) 18-bit: GreenArray GA144: 0.63µJ@1.8V 2) 32-bit: GTX 295 GPU (TDP 289W): 0.67µJ* [max load@1.2GHz] 3) 128-bit: IBM Cell (TDP 110W): 0.91µJ* [max load@3.2GHz] 4) 64-bit: Core i7 920 (TDP 130W): 1.35µJ* [max load@2.66GHz] 5) 4-bit: MARC4: 8.58µJ@1MHz 6) 8-bit: AVR ATMega: ≈10 µJ@8MHz** 7) 16-bit: TI C5420 (TDP 266mW): 47 µJ* [max load@20MHz]

*Extrapolated from TDP with all cores running AES encryption at 100% utilization **Extrapolated using an averaged AVR power model based on given cycle count

Energy required to encrypt a single 128-bit AES block

slide-28
SLIDE 28

AES Benchmarking: More Results

Source for further (symmetric) crypto benchmarks: eBACS: ECRYPT Benchmarking of Cryptographic Systems: http://bench.cr.yp.to/primitives-stream.html Contains latest benchmarks for (currently) 27 different processors running AES-128/192/256

slide-29
SLIDE 29

Outline

  • Introduction
  • Processor Platforms
  • Tricks, Tweaks and Codes
  • Benchmarks and Results
  • Conclusions
slide-30
SLIDE 30

Conclusions

  • You can find for (nearly) any processing device an

AES implementation in (nearly) any language

  • What I couldn‘t find (yet)

– AES in Brainfuck programming language (sorry - no intermezzo slide!) – AES on 2-bit or 256-bit processors

  • Processors supporting natively the operands in AES (8/32-bit) are still
  • n the top of the list (Fact!)
  • Processor extensions (such as AES NI or SSEx) greatly support AES

encryption in software (see Ryad‘s talk in the afternoon!)

slide-31
SLIDE 31

Tim Güneysu Hardware Security Group Horst Görtz Institute for IT-Security

10/23/2012

Implementing AES on a Bunch of Processors

ECRYPT AES Day – Bruges, Belgium

Questions?

slide-32
SLIDE 32

Bibliography

  • [WWGP00] Thomas Wollinger, M. Wang, Jorge Guajardo Merchan, Christof Paar:

HOW WELL ARE HIGH-END DSPS SUITED FOR THE AES ALGORITHMS? AES ALGO- RITHMS ON THE TMS320C6X DSP The Third Advanced Encryption Standard (AES3) Candidate Conference, New York, USA, April 13-14, 2000.

  • [BS08] Daniel J. Bernstein, Peter Schwabe: New AES Software Speed Records.

INDOCRYPT 2008: 322-336

  • [BOS09] Joppe W. Bos, Dag Arne Osvik, Deian Stefan: Fast Implementations of AES
  • n Various Platforms. IACR Cryptology ePrint Archive 2009: 501 (2009)
  • [KS09] Emilia Käsper, Peter Schwabe: Faster and Timing-Attack Resistant AES-GCM.

CHES 2009: 1-17

  • [KP12] Tino Kaufmann, Axel Poschmenn: Enabling Standardized Cryptography on

Ultra-Constrained 4-bit Microcontrollers, IEEE RFID 2012: 32-39