implementing aes on a
play

Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - PowerPoint PPT Presentation

Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, Belgium Tim Gneysu Hardware Security Group 10/23/2012 Horst Grtz Institute for IT-Security Outline Introduction Processor Platforms Tricks, Tweaks and Codes


  1. Implementing AES on a Bunch of Processors ECRYPT AES Day – Bruges, Belgium Tim Güneysu Hardware Security Group 10/23/2012 Horst Görtz Institute for IT-Security

  2. Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusions

  3. The AES Crib Sheet

  4. AES Implementation: General Representation • Two different representations of AES in original proposal – 8-bit standard implementation – 32-bit T-Table implementation, e.g.,     k E   0 , j   0     k E 1 ,      j 1   ( ) ( ) ( ) ( )   T A T A T A T A 0 0 1 5 2 10 3 15 E  k    2 2 , j       E k   3 3 , j ¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR 1 round: 4 x 4 = 16 TLU + XOR AES: 160 TLU+XOR per block encryption Memory: 4 T-Boxes, 1kB each

  5. AES Implementation : Choice of Processor • A dedicated AES processor in HW is not always the preferred option – AES is often just a supplementary function to a software application – HW development is too costly or necessary skills are not available • But when doing AES in software: which processor is the best?

  6. AES Implementation : Parameters • Key size of AES – 128 , 192, 256 bit • Applied mode of operation – ECB , CBC, GCM, CTR ,… • Blocks concurrently processed – Single block (limited data transfers) – Multiple blocks (overhead reduction, bitslicing) • Round key computation – Precomputed (when processing bulk data) – On-the-fly (when changing keys frequently)

  7. Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusion

  8. Processors and Platforms • Native bit sizes of General-Purpose Processors (GPP) 4-Bit , e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines 8-Bit , e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems 16-Bit , e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems 32-Bit , e.g., ARM, TriCore in smart phones and automobiles 64-Bit , e.g., Intel i3/5/7, AMD A-Series in PCs and workstations 128-Bit , e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs ) • Myth or Fact: AES is always most efficient on native 8-bit and 32-bit processors!?

  9. Processor Architectures • General processor design RISC vs. CISC (Reduced/Complex Instruction Set Computer) Single-Instruction Multiple Data (SIMD) operation Super-scalar devices processing more than one instruction per cycle • Processor interface to memory Von-Neumann vs. Harvard: shared memory for data and program? Cache for data and/or program? (  Cache attacks!) Static/dynamic external or built-in RAM? • Additional processor extensions Multimedia/integer co-processor Special/native Instruction Set Extensions (ISE)

  10. Other Processor Architectures • Streaming processors, such as GPUs – Multi-processors run hundreds concurrent threads – High memory bandwidth, but high latency to global memory • Digital Signal Processors (DSP) – Supports fast combined arithmetic instructions – Are improved arithmetic instructions useful for AES? • Other array/tile-based processors – Synchronous/asynchronous processing cores – Processor-based systolic array cores (Tilera, GreenArrays)

  11. Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusions

  12. AES Software Optimization • General requirements for secure implementation in software – Disable (or control) cache to prevent cache attacks – Avoid conditional branches to counter timing attacks • Common tweaks to achieve high-performance – Make particular use of specialized instructions – Unroll rounds and loops to reduce instruction cycle count – Optimize register allocation – Precompute and store values in tables (e.g., T-Tables, round keys and constants) • Common tweaks to minimize code size – Reuse code by functions to minimize instruction count – Limit amount of precomputed and stored values • Common tweaks for low energy consumption – Reduce number of costly load and store operations to memory – General approach often similar to the optimization for high-performance

  13. Coding Intermezzo: Have you ever tried to implement AES on a Commodore C64? AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog] lda expkey+$02,y encrypt ldx tmpblock+4*0+0 ldx #$07 eor ssm2,x .addfirst ldx tmpblock+4*1+1 lda aesblock+0,x ; 4 eor ssm1,x ldx tmpblock+4*2+2 eor expkey+0,x ; 8 eor ssm0,x sta tmpblock+0,x ; 13 ldx tmpblock+4*3+3 lda aesblock+8,x ; 17 eor ssm3,x eor expkey+8,x ; 21 sta aesblock+$02 sta tmpblock+8,x ; 26 lda expkey+$03,y dex ; 28 ldx tmpblock+4*0+0 bpl .addfirst ; 31 eor ssm3,x ldx tmpblock+4*1+1 eor ssm2,x ldy #$10 ldx tmpblock+4*2+2 .round eor ssm1,x lda expkey+$00,y ; 4 ldx tmpblock+4*3+3 ldx tmpblock+4*0+0 ; 7 eor ssm0,x sta aesblock+$03 eor ssm0,x ; 11 ldx tmpblock+4*1+1 ; 14 lda expkey+$04,y eor ssm3,x ; 18 ldx tmpblock+4*1+0 ldx tmpblock+4*2+2 ; 21 eor ssm0,x ldx tmpblock+4*2+1 eor ssm2,x ; 25 eor ssm3,x ldx tmpblock+4*3+3 ; 28 ldx tmpblock+4*3+2 eor ssm1,x ; 32 eor ssm2,x sta aesblock+$00 ; 36 ldx tmpblock+4*0+3 Commodore C64 eor ssm1,x sta aesblock+$04 lda expkey+$01,y 8-bit CPU with 64 KB RAM ldx tmpblock+4*0+0 lda expkey+$05,y eor ssm1,x ldx tmpblock+4*1+0 eor ssm1,x ldx tmpblock+4*1+1 ldx tmpblock+4*2+1 eor ssm0,x eor ssm0,x ldx tmpblock+4*2+2 ldx tmpblock+4*3+2 eor ssm3,x eor ssm3,x ldx tmpblock+4*0+3 ldx tmpblock+4*3+3 eor ssm2,x eor ssm2,x sta aesblock+$05 sta aesblock+$01

  14. Real Coding: Sample T-Table AES in C (Reference code by Brain Gladman) • High-performance AES Interleaved Round keys for processors ≥ 32 bit with z0 = roundkeys[i * 4 + 0]; Memory Layout Read z1 = roundkeys[i * 4 + 1]; interleaved T-tables (32-bit entries) table0 z2 = roundkeys[i * 4 + 2]; 0 Table 0 z3 = roundkeys[i * 4 + 3]; table1 Table 1 table2 • Per round, 4 instances of Table 2 p00 = (uint32) y0 >> 20; table3 Table 3 Extract and mask code snippet required p01 = (uint32) y0 >> 12; input bytes 16 Table 0 p02 = (uint32) y0 >> 4; Table 1 p03 = (uint32) y0 << 4; p00 &= 0xff0; Table 2 • AES has 720 instructions (INS) p01 &= 0xff0; Table 3 – 208 loads p02 &= 0xff0; 32 Table 0 p03 &= 0xff0; Table 1 – 4 stores p00 = *(uint32 *) (table0 + p00); Perform Table 2 – 508 integer instructions p01 = *(uint32 *) (table1 + p01); TLU Table 3 p02 = *(uint32 *) (table2 + p02); • 160 shifts 48 Table 0 p03 = *(uint32 *) (table3 + p03); • 176 masks (+16 for last rnd) Table 1 round keys z0 ^= p00; Add TLU to • Table 2 168 XORs z3 ^= p01; Table 3 • z2 ^= p02; 4 overhead for CTR mode z1 ^= p03; Offset (byte) … (only ¼ round shown) Access j -th table entry of table i via table<i>+16j

  15. Optimizing AES for High-Performance • Special instruction: Combined Shift-and-Mask – On PPC, rlwinm is available as single instruction p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; Extract and mask input bytes p02 = (uint32) y0 >> 4; p00 = (uint32) y0 >> 20 & 0xff0; p03 = (uint32) y0 << 4; p01 = (uint32) y0 >> 12 & 0xff0; p00 &= 0xff0; p02 = (uint32) y0 >> 4 & 0xff0; p01 &= 0xff0; p03 = (uint32) y0 << 4 & 0xff0; p02 &= 0xff0; p03 &= 0xff0; – Saves 160 instructions for separate masking [BS08] – AES on PPC has now 540 instructions

  16. Optimizing AES for High-Performance [cont.] • Special instruction: Scaled Index Loads – On x86, shift and load instructions can be combined Extract and mask p03 = (uint32) y0 << 4 input bytes … and do shifted TLU p03 &= 0xff0 Mask first p03 = y0 & 0xff … p03 = *(uint32 *) (table3 + (p03 << 4)) Perform p03 = *(uint32 *) (table3 + p03) TLU – Saves 80 instructions for separate shifting top and bottom bytes [BS08] – AES on x86 has 640 instructions ( not to be combined with previous method!!)

  17. Optimizing AES for High-Performance [cont.] • Availability of 64-bit Registers – On AMD64 and UltraSparcV9, use padded values in 64-registers 0xc66363a5 0x0c60063006300a50 – Padding implicitly includes the shift by 4 bit (aka multiplication by 16) – Padding is applied consistently through entire AES – Saves 80 instructions (no need to mask top bytes anymore) [BS08] – AES now has 640 instructions (again, not to be combined)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend