Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - PowerPoint PPT Presentation

Implementing AES on a Bunch of Processors ECRYPT AES Day – Bruges, Belgium Tim Güneysu Hardware Security Group 10/23/2012 Horst Görtz Institute for IT-Security

Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusions

The AES Crib Sheet

AES Implementation: General Representation • Two different representations of AES in original proposal – 8-bit standard implementation – 32-bit T-Table implementation, e.g.,     k E   0 , j   0     k E 1 ,      j 1   ( ) ( ) ( ) ( )   T A T A T A T A 0 0 1 5 2 10 3 15 E  k    2 2 , j       E k   3 3 , j ¼ round: 4 Table Look-Ups (TLU) + 4 x 32-bit XOR 1 round: 4 x 4 = 16 TLU + XOR AES: 160 TLU+XOR per block encryption Memory: 4 T-Boxes, 1kB each

AES Implementation : Choice of Processor • A dedicated AES processor in HW is not always the preferred option – AES is often just a supplementary function to a software application – HW development is too costly or necessary skills are not available • But when doing AES in software: which processor is the best?

AES Implementation : Parameters • Key size of AES – 128 , 192, 256 bit • Applied mode of operation – ECB , CBC, GCM, CTR ,… • Blocks concurrently processed – Single block (limited data transfers) – Multiple blocks (overhead reduction, bitslicing) • Round key computation – Precomputed (when processing bulk data) – On-the-fly (when changing keys frequently)

Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusion

Processors and Platforms • Native bit sizes of General-Purpose Processors (GPP) 4-Bit , e.g., MARC-4, NEC uPD75X in pocket calculators/washing machines 8-Bit , e.g., Atmel ATMegaXX, Intel 8051 in (many) embedded systems 16-Bit , e.g., TI MSP430, DEC PDP-11 in (fewer) embedded systems 32-Bit , e.g., ARM, TriCore in smart phones and automobiles 64-Bit , e.g., Intel i3/5/7, AMD A-Series in PCs and workstations 128-Bit , e.g., IBM Cell in PS3 (actually, only 128-bit SIMD on SPEs ) • Myth or Fact: AES is always most efficient on native 8-bit and 32-bit processors!?

Processor Architectures • General processor design RISC vs. CISC (Reduced/Complex Instruction Set Computer) Single-Instruction Multiple Data (SIMD) operation Super-scalar devices processing more than one instruction per cycle • Processor interface to memory Von-Neumann vs. Harvard: shared memory for data and program? Cache for data and/or program? (  Cache attacks!) Static/dynamic external or built-in RAM? • Additional processor extensions Multimedia/integer co-processor Special/native Instruction Set Extensions (ISE)

Other Processor Architectures • Streaming processors, such as GPUs – Multi-processors run hundreds concurrent threads – High memory bandwidth, but high latency to global memory • Digital Signal Processors (DSP) – Supports fast combined arithmetic instructions – Are improved arithmetic instructions useful for AES? • Other array/tile-based processors – Synchronous/asynchronous processing cores – Processor-based systolic array cores (Tilera, GreenArrays)

Outline • Introduction • Processor Platforms • Tricks, Tweaks and Codes • Benchmarks and Results • Conclusions

AES Software Optimization • General requirements for secure implementation in software – Disable (or control) cache to prevent cache attacks – Avoid conditional branches to counter timing attacks • Common tweaks to achieve high-performance – Make particular use of specialized instructions – Unroll rounds and loops to reduce instruction cycle count – Optimize register allocation – Precompute and store values in tables (e.g., T-Tables, round keys and constants) • Common tweaks to minimize code size – Reuse code by functions to minimize instruction count – Limit amount of precomputed and stored values • Common tweaks for low energy consumption – Reduce number of costly load and store operations to memory – General approach often similar to the optimization for high-performance

Coding Intermezzo: Have you ever tried to implement AES on a Commodore C64? AES-256 in ACME Assembler [Extract of source at http://www.robos.org/prog] lda expkey+$02,y encrypt ldx tmpblock+4*0+0 ldx #$07 eor ssm2,x .addfirst ldx tmpblock+4*1+1 lda aesblock+0,x ; 4 eor ssm1,x ldx tmpblock+4*2+2 eor expkey+0,x ; 8 eor ssm0,x sta tmpblock+0,x ; 13 ldx tmpblock+4*3+3 lda aesblock+8,x ; 17 eor ssm3,x eor expkey+8,x ; 21 sta aesblock+$02 sta tmpblock+8,x ; 26 lda expkey+$03,y dex ; 28 ldx tmpblock+4*0+0 bpl .addfirst ; 31 eor ssm3,x ldx tmpblock+4*1+1 eor ssm2,x ldy #$10 ldx tmpblock+4*2+2 .round eor ssm1,x lda expkey+$00,y ; 4 ldx tmpblock+4*3+3 ldx tmpblock+4*0+0 ; 7 eor ssm0,x sta aesblock+$03 eor ssm0,x ; 11 ldx tmpblock+4*1+1 ; 14 lda expkey+$04,y eor ssm3,x ; 18 ldx tmpblock+4*1+0 ldx tmpblock+4*2+2 ; 21 eor ssm0,x ldx tmpblock+4*2+1 eor ssm2,x ; 25 eor ssm3,x ldx tmpblock+4*3+3 ; 28 ldx tmpblock+4*3+2 eor ssm1,x ; 32 eor ssm2,x sta aesblock+$00 ; 36 ldx tmpblock+4*0+3 Commodore C64 eor ssm1,x sta aesblock+$04 lda expkey+$01,y 8-bit CPU with 64 KB RAM ldx tmpblock+4*0+0 lda expkey+$05,y eor ssm1,x ldx tmpblock+4*1+0 eor ssm1,x ldx tmpblock+4*1+1 ldx tmpblock+4*2+1 eor ssm0,x eor ssm0,x ldx tmpblock+4*2+2 ldx tmpblock+4*3+2 eor ssm3,x eor ssm3,x ldx tmpblock+4*0+3 ldx tmpblock+4*3+3 eor ssm2,x eor ssm2,x sta aesblock+$05 sta aesblock+$01

Real Coding: Sample T-Table AES in C (Reference code by Brain Gladman) • High-performance AES Interleaved Round keys for processors ≥ 32 bit with z0 = roundkeys[i * 4 + 0]; Memory Layout Read z1 = roundkeys[i * 4 + 1]; interleaved T-tables (32-bit entries) table0 z2 = roundkeys[i * 4 + 2]; 0 Table 0 z3 = roundkeys[i * 4 + 3]; table1 Table 1 table2 • Per round, 4 instances of Table 2 p00 = (uint32) y0 >> 20; table3 Table 3 Extract and mask code snippet required p01 = (uint32) y0 >> 12; input bytes 16 Table 0 p02 = (uint32) y0 >> 4; Table 1 p03 = (uint32) y0 << 4; p00 &= 0xff0; Table 2 • AES has 720 instructions (INS) p01 &= 0xff0; Table 3 – 208 loads p02 &= 0xff0; 32 Table 0 p03 &= 0xff0; Table 1 – 4 stores p00 = *(uint32 *) (table0 + p00); Perform Table 2 – 508 integer instructions p01 = *(uint32 *) (table1 + p01); TLU Table 3 p02 = *(uint32 *) (table2 + p02); • 160 shifts 48 Table 0 p03 = *(uint32 *) (table3 + p03); • 176 masks (+16 for last rnd) Table 1 round keys z0 ^= p00; Add TLU to • Table 2 168 XORs z3 ^= p01; Table 3 • z2 ^= p02; 4 overhead for CTR mode z1 ^= p03; Offset (byte) … (only ¼ round shown) Access j -th table entry of table i via table<i>+16j

Optimizing AES for High-Performance • Special instruction: Combined Shift-and-Mask – On PPC, rlwinm is available as single instruction p00 = (uint32) y0 >> 20; p01 = (uint32) y0 >> 12; Extract and mask input bytes p02 = (uint32) y0 >> 4; p00 = (uint32) y0 >> 20 & 0xff0; p03 = (uint32) y0 << 4; p01 = (uint32) y0 >> 12 & 0xff0; p00 &= 0xff0; p02 = (uint32) y0 >> 4 & 0xff0; p01 &= 0xff0; p03 = (uint32) y0 << 4 & 0xff0; p02 &= 0xff0; p03 &= 0xff0; – Saves 160 instructions for separate masking [BS08] – AES on PPC has now 540 instructions

Optimizing AES for High-Performance [cont.] • Special instruction: Scaled Index Loads – On x86, shift and load instructions can be combined Extract and mask p03 = (uint32) y0 << 4 input bytes … and do shifted TLU p03 &= 0xff0 Mask first p03 = y0 & 0xff … p03 = *(uint32 *) (table3 + (p03 << 4)) Perform p03 = *(uint32 *) (table3 + p03) TLU – Saves 80 instructions for separate shifting top and bottom bytes [BS08] – AES on x86 has 640 instructions ( not to be combined with previous method!!)

Optimizing AES for High-Performance [cont.] • Availability of 64-bit Registers – On AMD64 and UltraSparcV9, use padded values in 64-registers 0xc66363a5 0x0c60063006300a50 – Padding implicitly includes the shift by 4 bit (aka multiplication by 16) – Padding is applied consistently through entire AES – Saves 80 instructions (no need to mask top bytes anymore) [BS08] – AES now has 640 instructions (again, not to be combined)

Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, - PowerPoint PPT Presentation

Implementing AES on a Bunch of Processors ECRYPT AES Day Bruges, Belgium Tim Gneysu Hardware Security Group 10/23/2012 Horst Grtz Institute for IT-Security Outline Introduction Processor Platforms Tricks, Tweaks and Codes

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography

CORPORATE PRESENTATION AES GENER September 2017 AES Gener at a Glance Leading power generation

CORPORATE PRESENTATION AES GENER March 2017 AES Gener at a Glance Leading power generation

Improved Key Recovery Attacks on Reduced-Round AES on Reduced-Round AES with Practical Data and

A Meet-in-the-Middle Attack on 8-Round AES H useyin Demirci Ali Aydn Sel cuk presented

Algebraic Analysis of AES Carlos Cid Information Security Group, Royal Holloway, University of

Advanced Encryption Standard (AES) AES Group March 3, 2019 Content 1 Introduction Methods

FY 09 AES Phase I Crab Cavity SBIR Progress Michael Cole (AES), Rama Calaga (BNL), Zenghai Li

B.d) AES W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.96 AES (Advanced Encryption

A High- A High -Performance Area Performance Area- -Efficient Efficient AES Cipher on a Many

4Q-2018 CORPORATE PRESENTATION Company Overview 1 AES GENER AT A GLANCE LEADING GENCO

Yoyo Game with AES Navid Ghaedi Bardeh University of Bergen May 8, 2018 1 / 33 Introduction on

3Q-2018 CORPORATE PRESENTATION Company Overview 1 AES GENER AT A GLANCE LEADING GENCO

AES HUNTINGTON BEACH ENERGY PROJECT Southeast Area Committee Meeting Presented By: AES

Better proofs for rekeying D. J. Bernstein Security of AES-256 key k is far below 2 256 in most

DEC PERLE Board as Board as DEC PERLE an EXAMPLE of an EXAMPLE of RECONFIGURABLE

Filling multiples of embedded curves Robert Young University of Toronto Aug. 2013 Filling area

AMLD Deep Learning in PyTorch 1. Introduction Fran cois Fleuret http://fleuret.org/amld/

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Distributed HPC Systems ASD Distributed Memory HPC Workshop Computer Systems Group Research

Beckstein Lab Computational Biophysics at Arizona State University Hydrodynamics beyond

Dilution, degradation, and time delays in Boolean network models Matthew Macauley Department of

Electronics: Printed Boards and Printed Board Assemblies Critical manufacturing processes for the