Talk Overview Introduction 1 Table-based 2 Vector-Permutation 3 - PowerPoint PPT Presentation

Implementing Lightweight Block Ciphers on x86 Architectures Ryad Benadjila 1 Jian Guo 2 e 1 Thomas Peyrin 2 Victor Lomn´ 1 ANSSI, France 2 NTU, Singapore SAC, August 15, 2013

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Talk Overview Introduction 1 Table-based 2 Vector-Permutation 3 Bitslice 4 Results and Conclusions 5 2 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Motivations Existing work: at CHES 2012, Matsuda and Moriai gave the first bitslice implementations on PRESENT and Piccolo , showing that lightweight block ciphers can perform very well for some cloud applications. the good speed assumes the use case where long data is to be enciphered. This may not always be the case, e.g. , the Electronic Product Code, being a replacement of barcode, is usually of size 64, 96, 125 bits, under which the speed can be significantly slower. also, the key schedule was removed from speed measurement, which does not seem to be a valid assumption for many use cases. 3 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Motivations Existing work: at CHES 2012, Matsuda and Moriai gave the first bitslice implementations on PRESENT and Piccolo , showing that lightweight block ciphers can perform very well for some cloud applications. the good speed assumes the use case where long data is to be enciphered. This may not always be the case, e.g. , the Electronic Product Code, being a replacement of barcode, is usually of size 64, 96, 125 bits, under which the speed can be significantly slower. also, the key schedule was removed from speed measurement, which does not seem to be a valid assumption for many use cases. Our work: consider most of the possible use cases: with short/long data, shared/independent keys, under serial/parallel operation modes. besides bitslice, we also apply other implementation techniques, such as table-based, and vector-permutation. use LED , Piccolo , and PRESENT as examples. give a fair and comprehensive comparison of the speed over all use cases, and over all the three implementation techniques, under test with 6 different devices/servers. 3 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Introduction Techniques considered: Table-Based: table-lookup for sbox implementation Vector-Permutation: introduced by TWINE designers for better software performance Bitslice: sbox implemented in algebraic forms, usually computes multiple instances together. 4 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Introduction Techniques considered: Table-Based: table-lookup for sbox implementation Vector-Permutation: introduced by TWINE designers for better software performance Bitslice: sbox implemented in algebraic forms, usually computes multiple instances together. Ciphers implemented with each technique: LED : 64-bit AES-like design with mainly 64-, 128-bit key size and 32/48 rounds, proposed by Guo et al . at CHES 2011. Piccolo : 64-bit generalized feistel structure with 80-, 128-bit key size and 25/31 rounds, proposed by Shibutani et al . at CHES 2011. PRESENT : 64-bit SP-network design with 80-, 128-bit key size and 31 rounds, proposed by Bogdanov et al . at CHES 2007. 4 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Table-based Implementations I Mainly for designs based on Substition-Permutation Networks, i.e., round function consists of a non-linear operation such as sbox, followed by linear operations, e.g., AES-like designs: AddConstants SubCells ShiftRows MixColumns S S S S S S S S S n cells S S S S S S S S n cells b bits 5 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Table-based Implementations II Implementation Steps: Preparation: Build tables, with cell input as index, and its corresponding column output as table values. 6 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Table-based Implementations II Implementation Steps: Preparation: Build tables, with cell input as index, and its corresponding column output as table values. extract the cell value from column/state representation, this Usage: 1 involves “shift”, and “logic and” operations. table lookups. 2 XOR table lookup values to form round outputs. 3 Pseudo code in C language: Computation of a generic SPN lightweight cipher round Input: State, Tables / Output: Updated state t0 = T0[ state & MASKm]; t1 = T1[(state >> b) & MASKm]; t2 = T2[(state >> 2b) & MASKm]; ... state = t0 ˆ t1 ˆ t2 ˆ ...; 6 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Tabulating Group m , up to n , cells together to form bigger cells, 1 ≤ m ≤ n , then it needs n · ⌈ n / m ⌉ table-lookups, with bigger memory requirements. Example with m = n = 4: AddConstants SubCells ShiftRows MixColumns S S S S S S S S S S n cells S S S S S S S S S S n cells b bits 7 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Tabulating Group m , up to n , cells together to form bigger cells, 1 ≤ m ≤ n , then it needs n · ⌈ n / m ⌉ table-lookups, with bigger memory requirements. Example with m = n = 4: AddConstants SubCells ShiftRows MixColumns S S S S S S S S S S n cells S S S S S S S S S S n cells b bits No. of Tables/Lookups Memory (bits) No. of XORs n 2 · 2 b · nb n 2 n · ( n − 1 ) No Tabulating n 2 / m · 2 mb · nb n 2 / m Tabulating n · ( ⌈ n / m ⌉ − 1 ) 7 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Tradeoffs memory/table sizes v.s. number of table-lookups, via m . Table size affects speed of lookup operations, due to limitation of cache size. 8 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Tradeoffs memory/table sizes v.s. number of table-lookups, via m . Table size affects speed of lookup operations, due to limitation of cache size. column v.s. state as lookup table values. Column representation is smaller, while state representation enables integration of other state-wise operations such as “ShiftRows”, inter-column tabulating , and SuperSbox technique . 8 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Tradeoffs memory/table sizes v.s. number of table-lookups, via m . Table size affects speed of lookup operations, due to limitation of cache size. column v.s. state as lookup table values. Column representation is smaller, while state representation enables integration of other state-wise operations such as “ShiftRows”, inter-column tabulating , and SuperSbox technique . SuperSbox for two rounds with more memory requirements v.s. usual table-lookup with less memory requirements for one round. 8 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Deciding the right m Bigger m implies more memory requirements, and less table-lookups. However, if the tables can not be fit into the cache, lookup slows down, 9 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Deciding the right m Bigger m implies more memory requirements, and less table-lookups. However, if the tables can not be fit into the cache, lookup slows down, by how much ? 9 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Deciding the right m Bigger m implies more memory requirements, and less table-lookups. However, if the tables can not be fit into the cache, lookup slows down, by how much ? microarchitecture L 1 size (KBytes) L 1 latency (cycles) L 2 size (KBytes) L 2 latency (cycles) Intel P 6 16 or 32 3 512 8 Intel Core 32 3 1500 15 Intel Nehalem / Westmere 32 4 256 10 Intel Sandy / Ivy Bridge 32 5 256 12 9 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Deciding the right m Bigger m implies more memory requirements, and less table-lookups. However, if the tables can not be fit into the cache, lookup slows down, by how much ? microarchitecture L 1 size (KBytes) L 1 latency (cycles) L 2 size (KBytes) L 2 latency (cycles) Intel P 6 16 or 32 3 512 8 Intel Core 32 3 1500 15 Intel Nehalem / Westmere 32 4 256 10 Intel Sandy / Ivy Bridge 32 5 256 12 l T = P L 1 × l L 1 + P L 2 × l L 2 + P L 3 × l L 3 + P M × l M + · · · So that we can “predict” the best choice of m , without actual implementations. 9 / 21

Introduction Table-based Vector-Permutation Bitslice Results and Conclusions Deciding the right m Bigger m implies more memory requirements, and less table-lookups. However, if the tables can not be fit into the cache, lookup slows down, by how much ? microarchitecture L 1 size (KBytes) L 1 latency (cycles) L 2 size (KBytes) L 2 latency (cycles) Intel P 6 16 or 32 3 512 8 Intel Core 32 3 1500 15 Intel Nehalem / Westmere 32 4 256 10 Intel Sandy / Ivy Bridge 32 5 256 12 l T = P L 1 × l L 1 + P L 2 × l L 2 + P L 3 × l L 3 + P M × l M + · · · So that we can “predict” the best choice of m , without actual implementations. Observations : for better performance, feed L 1 cache as much as possible, and in most of the cases, exceeding a bit the L 1 cache is better than partial-usage, e.g. , m = 2 gives the best speed for LED , and it is faster when m = 3 than that when m = 1. 9 / 21

Talk Overview Introduction 1 Table-based 2 Vector-Permutation 3 - PowerPoint PPT Presentation

Implementing Lightweight Block Ciphers on x86 Architectures Ryad Benadjila 1 Jian Guo 2 e 1 Thomas Peyrin 2 Victor Lomn 1 ANSSI, France 2 NTU, Singapore SAC, August 15, 2013 Introduction Table-based Vector-Permutation Bitslice Results and

How To Give How To Give a good good Technical Talk Technical Talk Bertrand Meyer Bertrand

How To Design A Signature Talk: Part 1 How To Design Your Signature Talk: Part 1 Your Signature

A Talk about How to Give a Talk Part II Bertram Fronhfer International Center for

01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 | KPF Overview 01 |

Harnessing the Power of Self-Talk Mary Fran Bontempo Self-Talk Self-Talk is your most

Crafting Your Girl Talk Presentation A Guide for Women of Inspiration PAL Volunteer Services

My presentation AB123C Outline Talk about giving a talk A tool to plan and hold

WOCC 2007 Talk WOCC 2007 Talk WOCC 2007 Talk A Management Strategy for A Management Strategy

2nd RULE: You MUST TALK about BOOK CLUB. 2nd RULE: You DO NOT talk about 3rd RULE: PERSEVERE -- If

Talk to me Drupal Talk to me Drupal Using Drupal to power a Voice App Speaker notes Talk to me

3/7/2016 Customized Conversations Most of us talk to GOD every day and talk to LOST PEOPLE

Cheap Talk Games: Extensions Cheap Talk Games: Extensions F. Koessler / November 12, 2008 Cheap

Rules WRITING OVERLOAD BLOG WOMEN TALK 02 Rule No. 1 BE KIND The whole point of Women Talk is

How to Deliver a Great TED Talk Presentation Secrets of How to Deliver a Great TED Talk

How to give a research talk Thomas D. Nielsen September 2008 How to give a research talk

Disclaimer Disclaimer This talk is not about the front end Disclaimer This talk is about

Network Security Technology Project 1 Neng Li ln-fjpt@sjtu.edu.cn Part I 2 Implement the

Getting started with ggplot2 STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley

Kernel TLS and hardware TLS offload in FreeBSD 13 by Mellanox, Chelsio and Netflix Why crypto?

HOST Cryptography III ECE 525 AES Block Cipher Blockciphers are central tool in the design of

got HW crypto? On the (in)security of a Self-Encrypting Drive series Research motivation is HW

Intels New AES Instructions Enhanced Performance and Security Shay Gueron - Intel

Finite fields Definition Theorem (Field) Let F be a set with two binary operations + and . Let

PRO CTCAEs The 2017 update PROs: context Usual description of adverse events with the