Saber on ARM CCA-secure module lattice-based key encapsulation on - - PowerPoint PPT Presentation

saber on arm
SMART_READER_LITE
LIVE PREVIEW

Saber on ARM CCA-secure module lattice-based key encapsulation on - - PowerPoint PPT Presentation

Saber on ARM CCA-secure module lattice-based key encapsulation on ARM Angshuman Karmakar CHES, Amsterdam 11 th September, 2018 Joint work with : Jose Maria Bermudo Mera Sujoy Sinha Roy Ingrid Verbauwhede Saber: CCA secure post-quantum KEM*


slide-1
SLIDE 1

Saber on ARM

CCA-secure module lattice-based key encapsulation on ARM Joint work with : Jose Maria Bermudo Mera Sujoy Sinha Roy Ingrid Verbauwhede Angshuman Karmakar CHES, Amsterdam 11th September, 2018

slide-2
SLIDE 2

Saber: CCA secure post-quantum KEM*

  • Module-LWR : Trade off between Standard and Ideal lattice
  • ,
  • Inherent noise→ Less randomness
  • Efficient
  • Flexible :
  • Increase/decrease matrix dimension to increase/decrease security
  • Basic operations stay same→ High code reusability
  • All moduli are power of two

○ Easy rounding ○ Easy modular reduction in HW/SW ○ Precludes use of NTT

  • Combination of Toom-Cook, Karatsuba and Schoolbook.

J.-P. D’Anvers, A. Karmakar, S. Sinha Roy, and F. Vercauteren. Saber: Module-lwr based key exchange, cpa-secure encryption and cca-secure kem. In A. Joux, A. Nitaj, and T. Rachidi, editors, Progress in Cryptology – AFRICACRYPT 2018, https://eprint.iacr.org/2018/230.pdf 1

slide-3
SLIDE 3

Polynomial Multiplication

slide-4
SLIDE 4

Polynomial multiplication C=A X B

Toom-Cook+Karatsuba+School-book

1

Toom-Cook 4-way

. . . .

256

. . . . . . .

64

. . . . 7

`

. . . . . . 2 . . . . 1

Toom-Cook 4-way

. . . . . . . . . . . . . . 2 . . . .

256

. . . . . . .

64 64 64

B A

2

  • A and B are polynomials of size 256.

7

`

slide-5
SLIDE 5

Polynomial multiplication C=A X B

Toom-Cook+Karatsuba+School-book

1 1

Toom-Cook 4-way

. . . . . . . . . . . . 7

`

. . . . . . 2 . . . . . . . . . . . . . . 9

256

. . . . . . .

64 16

Karatsuba 2-level

1

Karatsuba 2-level

1

Toom-Cook 4-way

. . . .

256

. . . . . . .

64 16

. . . . . . . . 7

`

. . . . . . 2 . . . . . . . . . . 9 . . . .

16 16

B A

2

slide-6
SLIDE 6

Polynomial multiplication C=A X B

Toom-Cook+Karatsuba+School-book

1 1

Toom-Cook 4-way

. . . . . . . . . . . . 7

`

. . . . . . 2 . . . . . . . . . . . . . . 9

256

. . . . . . .

64 16

Karatsuba 2-level

X X 1 1

Toom-Cook 4-way

. . . . . . . . . . . . 7

`

. . . . . . 2 . . . . . . . . . . . . . . 9

256

. . . . . . .

64 16

Karatsuba 2-level

School-book

B A

2

slide-7
SLIDE 7

This work

slide-8
SLIDE 8

Cortex-M0: XMC2Go by Infineon

  • Reduced instruction set
  • Only 8 registers for data processing

instructions

  • 16 KB of RAM

Cortex-M4: STM32F4-discovery by STMicroelectronics

  • DSP instructions
  • 14 registers fully available
  • 192 KB of RAM

Goal

  • Saber is very efficient on high-end processors
  • We show that, Saber is also efficient on low end processors like Cortex-M0

and Cortex-M4

3

slide-9
SLIDE 9
  • A high speed implementation on Cortex-M4
  • Efficient use of DSP instructions on M4
  • Fewer instructions to perform a School-Book multiplication
  • An `in-register` implementation of Toom-Cook multiplication
  • Fewer access to memory
  • A memory-efficient implementation on Cortex-M0
  • A `Just-In-Time` approach to generate the elements of public matrix
  • Memory efficient in-place Karatsuba multiplication

This work

4

slide-10
SLIDE 10
  • Each coefficient is 13 bits long → fits in half word of a register
  • Multiplications between half words can be done by SMLA(B/T)(B/T) instruction
  • SMLA(B/T)(B/T)(ra, rb, rc, rd) := ra ← rb

0/1 * rc 0/1 + rd

Schoolbook multiplication

32 16

5

slide-11
SLIDE 11
  • Each coefficient fits in half word of a register
  • Multiplications between half words can be done by SMLA(B/T)(B/T) instruction
  • SMLA(B/T)(B/T)(ra, rb, rc, rd) := ra ← rb

0/1 * rc 0/1 + rd

Schoolbook multiplication

32 16

16 instructions !

5

slide-12
SLIDE 12
  • DSP instruction SMLADX : Cross multiplies register half words
  • SMLADX(ra, rb, rc, rd) := ra ← rb

0 * rc 1 + rb 1 * rc 0 + rd

Schoolbook multiplication

32 16

5

slide-13
SLIDE 13
  • Replace ra→ c1, rb→ a, rc→ b, rd→ 0,
  • SMLADX(c1, a, b, 0) := c1 ← a0 * b1 + a1 * b0

Schoolbook multiplication

32 16

SMLADX Instruction count reduces 2 → 1

5

slide-14
SLIDE 14
  • Replace ra→ c1, rb→ a, rc→ b, rd→ 0,
  • SMLADX(c1, a, b, 0) := c1 ← a0 * b1 + a1 * b0

Schoolbook multiplication

32 16

SMLADX Total Instruction count 12 25% reduction Similarly

5

slide-15
SLIDE 15
  • Pack non-adjacent coefficients in spare register using PKHBT
  • Apply SMLADX again

Schoolbook multiplication

Total Instruction count 11 ! PKHBT SMLADX

5

16 32

slide-16
SLIDE 16
  • ≅ 37.5% reduction in instruction count for one Schoolbook multiplication
  • A basic unrolled 16 X 16 multiplication requires only 168 SMLA instructions
  • A single Schoolbook multiplication takes only 587 clock cycles

Schoolbook multiplication

6

slide-17
SLIDE 17
  • During evaluation phase of Toom-Cook multiplication polynomial A (& B) is

divided in 4 smaller polynomials A3-A0 each with 64 coefficients

  • Further, we need to create weighted sums of these polynomials
  • In a simple method, for each awi it needs to access all 64 coefficients of

A0-A3 i.e 256 memory accesses

Toom-Cook multiplication

7

slide-18
SLIDE 18

Normal method

. . . . . . . .

ai

. . . . . . . .

ai

1

. . . . . . . .

ai

2

. . . . . . . .

ai

3

A0 A1 A2 A3

. . . . . . . . . . . . . . Registers

ai a1 ai

2

ai

3

a0

i+ai 1+ai 2+ai 3

aw2 awi

2

7

slide-19
SLIDE 19

Memory access efficient method

A2

ai

. . . . . . . .

aw2 awi

2

. . . . . . . .

ai

A0

. . . . . . . .

ai

3

A3

. . . . . . . .

ai

2

. . . . . . . .

ai

1

A1

. . . . . . . .

aw5 awi

5

. . . . . . . .

aw1 awi

1

. . . . . Registers . . . . . .

ai

2

ai

3

3a0

i+8ai 1

4ai

2+ai 3

ai

1

  • Vertical coefficient scanning approach
  • Use spare registers
  • Perform more `in-register` operations to generate weighted coefficients

8

slide-20
SLIDE 20

Memory access efficient method

A2

ai

. . . . . . . .

aw2 awi

2

. . . . . . . .

ai

A0

. . . . . . . .

ai

3

A3

. . . . . . . .

ai

2

. . . . . . . .

ai

1

A1

. . . . . . . .

aw5 awi

5

. . . . . . . .

aw1 awi

1

. . . . . Registers . . . . . .

ai

2

ai

3

3a0

i+8ai 1

4ai

2+ai 3

ai

1

  • Number of memory accesses decrease from 5*256 to 256 only
  • Memory requirement increases.

8

slide-21
SLIDE 21

Memory Optimization

slide-22
SLIDE 22
  • The public matrix , requires a huge memory to generate.

Generation of the Public matrix.

. . . . . . . . . . .

SHAKE-128() Random seed

a00 a01 a02 a10 a11 a12 a20 a21 a22

3744 bytes

9

slide-23
SLIDE 23
  • We use a `Just-in-Time` strategy to reuse a smaller space for

each element polynomial.

Generation of the Public matrix.

Keccak-absorb() Random seed

a00 a01 a02 a10 a11 a12 a20 a21 a22

280 bytes Keccak-squeeze()

. . . . . . . . . . . S

10

slide-24
SLIDE 24
  • We use a `Just-in-Time` strategy to reuse a smaller space for

each element polynomial.

Generation of the Public matrix.

Keccak-absorb() Random seed

a00 a01 a02 a10 a11 a12 a20 a21 a22

280 bytes Keccak-squeeze()

. . . . . . . . . . . S

10

slide-25
SLIDE 25
  • We use a `Just-in-Time` strategy to reuse a smaller space for

each element polynomial.

Generation of the Public matrix.

Keccak-absorb() Random seed

a00 a01 a02 a10 a11 a12 a20 a21 a22

280 bytes Keccak-squeeze()

. . . . . . . . . . . S

10

slide-26
SLIDE 26
  • We use a `Just-in-Time` strategy to reuse a smaller space for

each element polynomial.

Generation of the Public matrix.

Keccak-absorb() Random seed

  • Requires some bookkeeping.
  • Memory requirement is ≅1/9th of the initial requirement

a00 a01 a02 a10 a11 a12 a20 a21 a22

280 bytes Keccak-squeeze()

. . . . . . . . . . . S

10

slide-27
SLIDE 27

Results & Conclusion

slide-28
SLIDE 28
  • Comparison to other NIST-PQC candidates

Cryptosystem Platform Key generation [Kcycles]/[bytes] Encapsulation [Kcycles]/[bytes] Decapsulation [Kcycles]/[bytes] Multiplication [type] NewHope-CCA * Cortex-M4 1,246 / 11,160 1,966 / 17,456 1,977 / 19,656 NTT Kyber* Cortex-M4 1,200 / 10,304 1,497 / 13,464 1,526 / 14,624 NTT Saber-speed Cortex-M4 1,147 / 13,883 1,444 / 16,667 1,543 / 17,763 TC+Kara+SB Saber-memory Cortex-M4 1,165 / 6,931 1,530 / 7,019 1,635 / 8,115 TC+Kara+SB Saber-mem-M0 Cortex-M0 4,786 / 5,031 6,328 / 5,119 7,509 / 6,215 TC+Kara+SB

*pqm4 post-quantum crypto library for the arm cortex-m4. https://github.com/ mupq/pqm4, 2018. [ accessed 15-April-2018] 11

slide-29
SLIDE 29
  • Module-Lattice based cryptography can be practical on resource

constrained devices

  • Cortex-M0 → max memory≅6.2 KB, run time < 250 ms
  • Memory requirement 1/3rd of the reference implementation
  • Cortex-M4 → max memory≅17 KB, run time < 9 ms
  • Run time 5-8 times less than the reference
  • The optimizations can be applied on top of each other.
  • Choice of parameters is very crucial
  • For small dimensions, asymptotically slower Toom-Cook, Karatsuba

multiplication can be very competitive

  • Irregular memory access of NTT
  • Utilization of special instructions

Conclusion

12

slide-30
SLIDE 30

Thank you !