Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - - PowerPoint PPT Presentation
Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - - PowerPoint PPT Presentation
Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, Belgium Tim Gneysu Hardware Security Group 10/23/2012 Horst Grtz Institute for IT-Security Outline Introduction Processor Platforms Tricks, Tweaks and Codes
Outline
- Introduction
- Processor Platforms
- Tricks, Tweaks and Codes
- Benchmarks and Results
- Conclusions
The AES Crib Sheet
AES Implementation: General Representation
- Two different representations
- f AES in original proposal
– 8-bit standard implementation – 32-bit T-Table implementation, e.g.,
j j j j
k k k k A T A T A T A T E E E E
, 3 , 2 , 1 , 15 3 10 2 5 1 3 2 1
) ( ) ( ) ( ) (
¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR 1 round: 4 x 4 = 16 TLU + XOR AES: 160 TLU+XOR per block encryption Memory: 4 T-Boxes, 1kB each
AES Implementation : Choice of Processor
- A dedicated AES processor in HW
is not always the preferred option
– AES is often just a supplementary function to a software application – HW development is too costly
- r necessary skills are not available
- But when doing AES in software:
which processor is the best?
AES Implementation : Parameters
- Key size of AES
– 128, 192, 256 bit
- Applied mode of operation
– ECB, CBC, GCM, CTR,…
- Blocks concurrently processed
– Single block (limited data transfers) – Multiple blocks (overhead reduction, bitslicing)
- Round key computation
– Precomputed (when processing bulk data) – On-the-fly (when changing keys frequently)
Outline
- Introduction
- Processor Platforms
- Tricks, Tweaks and Codes
- Benchmarks and Results
- Conclusion
Processors and Platforms
- Native bit sizes of General-Purpose Processors (GPP)
4-Bit, e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines 8-Bit, e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems 16-Bit, e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems 32-Bit, e.g., ARM, TriCore in smart phones and automobiles 64-Bit, e.g., Intel i3/5/7, AMD A-Series in PCs and workstations 128-Bit, e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs )
- Myth or Fact: AES is always most efficient on native
8-bit and 32-bit processors!?
Processor Architectures
- General processor design
RISC vs. CISC (Reduced/Complex Instruction Set Computer) Single-Instruction Multiple Data (SIMD) operation Super-scalar devices processing more than one instruction per cycle
- Processor interface to memory
Von-Neumann vs. Harvard: shared memory for data and program? Cache for data and/or program? ( Cache attacks!) Static/dynamic external or built-in RAM?
- Additional processor extensions
Multimedia/integer co-processor Special/native Instruction Set Extensions (ISE)
Other Processor Architectures
- Streaming processors, such as GPUs
– Multi-processors run hundreds concurrent threads – High memory bandwidth, but high latency to global memory
- Digital Signal Processors (DSP)
– Supports fast combined arithmetic instructions – Are improved arithmetic instructions useful for AES?
- Other array/tile-based processors
– Synchronous/asynchronous processing cores – Processor-based systolic array cores (Tilera, GreenArrays)
Outline
- Introduction
- Processor Platforms
- Tricks, Tweaks and Codes
- Benchmarks and Results
- Conclusions
AES Software Optimization
- General requirements for secure implementation in software
– Disable (or control) cache to prevent cache attacks – Avoid conditional branches to counter timing attacks
- Common tweaks to achieve high-performance
– Make particular use of specialized instructions – Unroll rounds and loops to reduce instruction cycle count – Optimize register allocation – Precompute and store values in tables (e.g., T-Tables, round keys and constants)
- Common tweaks to minimize code size
– Reuse code by functions to minimize instruction count – Limit amount of precomputed and stored values
- Common tweaks for low energy consumption
– Reduce number of costly load and store operations to memory – General approach often similar to the optimization for high-performance
Coding Intermezzo: Have you ever tried to implement AES on a Commodore C64?
encrypt ldx #$07 .addfirst lda aesblock+0,x ; 4 eor expkey+0,x ; 8 sta tmpblock+0,x ; 13 lda aesblock+8,x ; 17 eor expkey+8,x ; 21 sta tmpblock+8,x ; 26 dex ; 28 bpl .addfirst ; 31 ldy #$10 .round lda expkey+$00,y ; 4 ldx tmpblock+4*0+0 ; 7 eor ssm0,x ; 11 ldx tmpblock+4*1+1 ; 14 eor ssm3,x ; 18 ldx tmpblock+4*2+2 ; 21 eor ssm2,x ; 25 ldx tmpblock+4*3+3 ; 28 eor ssm1,x ; 32 sta aesblock+$00 ; 36 lda expkey+$01,y ldx tmpblock+4*0+0 eor ssm1,x ldx tmpblock+4*1+1 eor ssm0,x ldx tmpblock+4*2+2 eor ssm3,x ldx tmpblock+4*3+3 eor ssm2,x sta aesblock+$01 lda expkey+$02,y ldx tmpblock+4*0+0 eor ssm2,x ldx tmpblock+4*1+1 eor ssm1,x ldx tmpblock+4*2+2 eor ssm0,x ldx tmpblock+4*3+3 eor ssm3,x sta aesblock+$02 lda expkey+$03,y ldx tmpblock+4*0+0 eor ssm3,x ldx tmpblock+4*1+1 eor ssm2,x ldx tmpblock+4*2+2 eor ssm1,x ldx tmpblock+4*3+3 eor ssm0,x sta aesblock+$03 lda expkey+$04,y ldx tmpblock+4*1+0 eor ssm0,x ldx tmpblock+4*2+1 eor ssm3,x ldx tmpblock+4*3+2 eor ssm2,x ldx tmpblock+4*0+3 eor ssm1,x sta aesblock+$04 lda expkey+$05,y ldx tmpblock+4*1+0 eor ssm1,x ldx tmpblock+4*2+1 eor ssm0,x ldx tmpblock+4*3+2 eor ssm3,x ldx tmpblock+4*0+3 eor ssm2,x sta aesblock+$05
AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog]
Commodore C64 8-bit CPU with 64 KB RAM
Real Coding: Sample T-Table AES in C
(Reference code by Brain Gladman)
- High-performance AES
for processors ≥ 32 bit with interleaved T-tables
- Per round, 4 instances of
code snippet required
- AES has 720 instructions (INS)
– 208 loads – 4 stores – 508 integer instructions
- 160 shifts
- 176 masks (+16 for last rnd)
- 168 XORs
- 4 overhead for CTR mode
z0 = roundkeys[i * 4 + 0]; z1 = roundkeys[i * 4 + 1]; z2 = roundkeys[i * 4 + 2]; z3 = roundkeys[i * 4 + 3]; p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; p02 = (uint32) y0 >> 4; p03 = (uint32) y0 << 4; p00 &= 0xff0; p01 &= 0xff0; p02 &= 0xff0; p03 &= 0xff0; p00 = *(uint32 *) (table0 + p00); p01 = *(uint32 *) (table1 + p01); p02 = *(uint32 *) (table2 + p02); p03 = *(uint32 *) (table3 + p03); z0 ^= p00; z3 ^= p01; z2 ^= p02; z1 ^= p03; …
Table 1 Table 0 Interleaved Memory Layout (32-bit entries) Table 2 Table 3 Table 1 Table 0 Table 2 Table 3 Table 1 Table 0 Table 2 Table 3 Table 1 Table 0 Table 2 Table 3
16 32 48 Offset (byte) Access j-th table entry
- f table i via table<i>+16j
table0 table1 table2 table3 Read Round keys Extract and mask input bytes Perform TLU Add TLU to round keys
(only ¼ round shown)
Optimizing AES for High-Performance
- Special instruction: Combined Shift-and-Mask
– On PPC, rlwinm is available as single instruction – Saves 160 instructions for separate masking [BS08] – AES on PPC has now 540 instructions
p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; p02 = (uint32) y0 >> 4; p03 = (uint32) y0 << 4; p00 &= 0xff0; p01 &= 0xff0; p02 &= 0xff0; p03 &= 0xff0; Extract and mask input bytes p00 = (uint32) y0 >> 20 & 0xff0; p01 = (uint32) y0 >> 12 & 0xff0; p02 = (uint32) y0 >> 4 & 0xff0; p03 = (uint32) y0 << 4 & 0xff0;
Optimizing AES for High-Performance [cont.]
- Special instruction: Scaled Index Loads
– On x86, shift and load instructions can be combined – Saves 80 instructions for separate shifting top and bottom bytes [BS08] – AES on x86 has 640 instructions (not to be combined with previous method!!)
p03 = (uint32) y0 << 4 … p03 &= 0xff0 … p03 = *(uint32 *) (table3 + p03) Extract and mask input bytes Perform TLU p03 = y0 & 0xff p03 = *(uint32 *) (table3 + (p03 << 4)) Mask first and do shifted TLU
Optimizing AES for High-Performance [cont.]
- Availability of 64-bit Registers
– On AMD64 and UltraSparcV9, use padded values in 64-registers
– Padding implicitly includes the shift by 4 bit (aka multiplication by 16) – Padding is applied consistently through entire AES – Saves 80 instructions (no need to mask top bytes anymore) [BS08] – AES now has 640 instructions (again, not to be combined) 0xc66363a5 0x0c60063006300a50
Optimizing AES for High-Performance [cont.]
- Other ways to optimize the T-Table AES in software…
– Special instruction: Combined Load-XOR (x86/AM64) saves 168 instructions – Special instruction: second byte extraction instruction (x86) saves 40 instructions – Special instruction: two-bytes loads saves 4 instructions – Byte extraction via loads trades 160-320 integer instructions against 200 loads/stores – Round key caching in extra registers saves about 44 instructions – Utilize SSE processor extensions instead of plain CPU ALU – …
- Results for common ≥32-bit processors (encrypting 4kByte of data)
– IBM PPC G4 7410: 459 instructions 14.57 cycles/byte [BS08] – Intel Pentium 4 f12: 414 instructions 14.13 cycles/byte [BS08] – Intel Core 2 Quad Q6600 with SSE3: ≈ 278 instructions* 9.32 cycles/byte [KS09] – Sun UltraSparc III: 505 instructions 12.06 cycles/byte [BS08] – Intel Core i7 920: ≈ 278 instructions* 6.92 cycles/byte [KS09]
*estimated since not provided in the original work
Coding Intermezzo: Have you ever programmed AES-128 in Whitespace
Whitespace is a programming language developed by Edwin Brady and Chris Morris. The Whitespace interpreter ignores any non-whitespace characters. Only spaces, tabs and linefeeds have a meaning.
AES on IBM‘s Cell Processor
- Hybrid Processor Architecture
– PPC Processor (main/control unit) – 8 Synergistic Processing Elements (SPE) as work horse with 128-bit registers – Fast ring interconnect between PPC and SPE units – SPEs support efficient byte extraction and manipulation instructions for their 128-bit SIMD registers (e.g., shuffle, select) – 16-fold byte-sliced implementation per 128-bit register [BOS09] – Single AES encryption in ≈283 instructions (1752+2764 INS per 16 streams) 11.7 cycles/byte
AES on NVIDIA‘s GTX 295 GPU
- Streaming Processor Architecture
– 2x240 streaming processor units – Memory design including local, shared, texture and (large) global memory – Cache for some memories – Runs a large number of concurrent, synchronized threads – AES implementation using CUDA language (or OpenCL as alternative) – 32-bit integer instructions and T-Tables stored in shared memory – Benchmarking is not precise due to uncontrolled scheduling: throughput up to 59.6 GB/s reported 0.17 cycles/byte [BOS09]
AES on Embedded Systems
- TI TMS320-C6201: 16/32-bit DSP @200MHz
– Parallelized T-Table implementation on four pairs of ALUs – Encryption in 228 cycles 14.25 cycles/byte [WWGP00]
- AVR ATMega: 8-bit RISC microcontroller @8MHz
– 8-bit AES implementation – Speed-up by 1 cycle per TLU by placing S-box in RAM (not Flash) – Fast encryption takes 2,153 cycles 134.56 cycles/byte [BOS09]
- MARC4: 4-bit RISC microcontroller @1MHz
– 8-bit AES implementation with 2 registers per entry – „Fast“ encryption takes 23,828 cycles 1,489 cycles/byte [KP12]
Coding Intermezzo: Programming AES in Colors (aka ColorForth)
MixColum Layer in ColorForth
AES on Embedded Systems (cont.)
- GreenArrays GA144 Tile processor
– Asynchronous 144 core device (nodes) – Each F18A core has 18-bit ALU – 128 words of memory + 20 words stack – Up to 4 instructions per word
- AES implementation in ColorForth,
spread over 17 of 144 nodes
- Asynchronous processor operation
disables cycle count metric
- Absolute time per 128-bit encryption:
38 µs@2.2V supply voltage at 0.9µJ
Outline
- Introduction
- Processor Platforms
- Tricks, Tweaks and Codes
- Benchmarks and Results
- Conclusions
AES-128 Platform Ranking on High-Performance
1) 32-bit: GTX 295 GPU [BOS09]: 0.17 cycles/byte 2) 64-bit (with 128-bit SSE3): Core i7 920 [KS09]: 6.92 cycles/byte 3) 128-bit: IBM Cell SPE [BOS09]: 11.7 cycles/byte 4) 8-bit: AVR ATMega [BOS09]: 134.56 cycles/byte 5) 16-bit: TI C5420 [TI]: 219 cycles/byte 6) 4-bit: MARC4 [KP12]: 1,489 cycles/byte X) 18-bit: GreenArray GA144: 38µs@2.2V
Important Remark: Beware of distortions, e.g., due to little interest in platform (4) or backward applied metrics for platform (1)
AES Performance when encrypting large packets (4Kb)
AES-128 Platform Ranking on Green Cryptography
1) 18-bit: GreenArray GA144: 0.63µJ@1.8V 2) 32-bit: GTX 295 GPU (TDP 289W): 0.67µJ* [max load@1.2GHz] 3) 128-bit: IBM Cell (TDP 110W): 0.91µJ* [max load@3.2GHz] 4) 64-bit: Core i7 920 (TDP 130W): 1.35µJ* [max load@2.66GHz] 5) 4-bit: MARC4: 8.58µJ@1MHz 6) 8-bit: AVR ATMega: ≈10 µJ@8MHz** 7) 16-bit: TI C5420 (TDP 266mW): 47 µJ* [max load@20MHz]
*Extrapolated from TDP with all cores running AES encryption at 100% utilization **Extrapolated using an averaged AVR power model based on given cycle count
Energy required to encrypt a single 128-bit AES block
AES Benchmarking: More Results
Source for further (symmetric) crypto benchmarks: eBACS: ECRYPT Benchmarking of Cryptographic Systems: http://bench.cr.yp.to/primitives-stream.html Contains latest benchmarks for (currently) 27 different processors running AES-128/192/256
Outline
- Introduction
- Processor Platforms
- Tricks, Tweaks and Codes
- Benchmarks and Results
- Conclusions
Conclusions
- You can find for (nearly) any processing device an
AES implementation in (nearly) any language
- What I couldn‘t find (yet)
– AES in Brainfuck programming language (sorry - no intermezzo slide!) – AES on 2-bit or 256-bit processors
- Processors supporting natively the operands in AES (8/32-bit) are still
- n the top of the list (Fact!)
- Processor extensions (such as AES NI or SSEx) greatly support AES
encryption in software (see Ryad‘s talk in the afternoon!)
Tim Güneysu Hardware Security Group Horst Görtz Institute for IT-Security
10/23/2012
Implementing AES on a Bunch of Processors
ECRYPT AES Day – Bruges, Belgium
Questions?
Bibliography
- [WWGP00] Thomas Wollinger, M. Wang, Jorge Guajardo Merchan, Christof Paar:
HOW WELL ARE HIGH-END DSPS SUITED FOR THE AES ALGORITHMS? AES ALGO- RITHMS ON THE TMS320C6X DSP The Third Advanced Encryption Standard (AES3) Candidate Conference, New York, USA, April 13-14, 2000.
- [BS08] Daniel J. Bernstein, Peter Schwabe: New AES Software Speed Records.
INDOCRYPT 2008: 322-336
- [BOS09] Joppe W. Bos, Dag Arne Osvik, Deian Stefan: Fast Implementations of AES
- n Various Platforms. IACR Cryptology ePrint Archive 2009: 501 (2009)
- [KS09] Emilia Käsper, Peter Schwabe: Faster and Timing-Attack Resistant AES-GCM.
CHES 2009: 1-17
- [KP12] Tino Kaufmann, Axel Poschmenn: Enabling Standardized Cryptography on