Effjcient Side-Channel Protections of ARX Ciphers Bernhard Jungk 1 - - PowerPoint PPT Presentation

▶

Jan 14, 2023 173 likes •587 views

Effjcient Side-Channel Protections of ARX Ciphers Bernhard Jungk 1 Richard Petri 2 Marc Stttinger 3 1 Fraunhofer Singapore, Singapore, bernhard.jungk@fraunhofer.sg 2 Fraunhofer SIT, Germany, richard.petri@sit.fraunhofer.de 3 Continental AG,

SLIDE 1

Effjcient Side-Channel Protections of ARX Ciphers

Bernhard Jungk1 Richard Petri2 Marc Stöttinger3

1Fraunhofer Singapore, Singapore, bernhard.jungk@fraunhofer.sg 2Fraunhofer SIT, Germany, richard.petri@sit.fraunhofer.de 3Continental AG, Germany, marc.stoettinger@contiental-corporation.com

September 10, 2018

1 / 14

SLIDE 2

Protecting ARX Ciphers

◮ ARX ciphers (e.g. Threefjsh, Speck, ChaCha20) rely on modular Addition, Rotation and XOR Easily protected against timing side-channels, but all the harder to protect against power/EM side-channels, see e.g.

“Butterfmy Attack” against modular addition in Skein “Bricklayer Attack” on ChaCha20

Early work by Goubin (2001) suggested Boolean and arithmetic masking, with conversion in-between (Cost: k ) Simpler: Apply Boolean masking directly to an Addition algorithm in software!

a b c d

≪ ≪ ≪ ≪ 2 / 14

SLIDE 3

Protecting ARX Ciphers

◮ ARX ciphers (e.g. Threefjsh, Speck, ChaCha20) rely on modular Addition, Rotation and XOR ◮ Easily protected against timing side-channels, but all the harder to protect against power/EM side-channels, see e.g.

“Butterfmy Attack” against modular addition in Skein “Bricklayer Attack” on ChaCha20

Early work by Goubin (2001) suggested Boolean and arithmetic masking, with conversion in-between (Cost: k ) Simpler: Apply Boolean masking directly to an Addition algorithm in software!

a b c d

≪ ≪ ≪ ≪ 2 / 14

SLIDE 4

Protecting ARX Ciphers

◮ ARX ciphers (e.g. Threefjsh, Speck, ChaCha20) rely on modular Addition, Rotation and XOR ◮ Easily protected against timing side-channels, but all the harder to protect against power/EM side-channels, see e.g.

◮ “Butterfmy Attack” against modular addition in Skein ◮ “Bricklayer Attack” on ChaCha20

Early work by Goubin (2001) suggested Boolean and arithmetic masking, with conversion in-between (Cost: k ) Simpler: Apply Boolean masking directly to an Addition algorithm in software!

a b c d

≪ ≪ ≪ ≪ 2 / 14

SLIDE 5

Protecting ARX Ciphers

◮ ARX ciphers (e.g. Threefjsh, Speck, ChaCha20) rely on modular Addition, Rotation and XOR ◮ Easily protected against timing side-channels, but all the harder to protect against power/EM side-channels, see e.g.

◮ “Butterfmy Attack” against modular addition in Skein ◮ “Bricklayer Attack” on ChaCha20

◮ Early work by Goubin (2001) suggested Boolean and arithmetic masking, with conversion in-between (Cost: k ) Simpler: Apply Boolean masking directly to an Addition algorithm in software!

a b c d

≪ ≪ ≪ ≪ 2 / 14

SLIDE 6

Protecting ARX Ciphers

◮ ARX ciphers (e.g. Threefjsh, Speck, ChaCha20) rely on modular Addition, Rotation and XOR ◮ Easily protected against timing side-channels, but all the harder to protect against power/EM side-channels, see e.g.

◮ “Butterfmy Attack” against modular addition in Skein ◮ “Bricklayer Attack” on ChaCha20

◮ Early work by Goubin (2001) suggested Boolean and arithmetic masking, with conversion in-between (Cost: O(k)) Simpler: Apply Boolean masking directly to an Addition algorithm in software!

a b c d

≪ ≪ ≪ ≪ 2 / 14

SLIDE 7

Protecting ARX Ciphers

◮ ARX ciphers (e.g. Threefjsh, Speck, ChaCha20) rely on modular Addition, Rotation and XOR ◮ Easily protected against timing side-channels, but all the harder to protect against power/EM side-channels, see e.g.

◮ “Butterfmy Attack” against modular addition in Skein ◮ “Bricklayer Attack” on ChaCha20

◮ Early work by Goubin (2001) suggested Boolean and arithmetic masking, with conversion in-between (Cost: O(k)) ◮ Simpler: Apply Boolean masking directly to an Addition algorithm in software!

a b c d

≪ ≪ ≪ ≪ 2 / 14

SLIDE 8

Our contribution

◮ Threshold Implementations (TI) initially only of interest for hardware implementations until recent developments reduced the number of necessary shares We introduce some optimizations for masking additions

Introduce masked versions of combined SHIFT-AND(-XOR) gates Include the “fmexible second operand” of ARM platform, performing z x y c in one instruction Reduce the number of necessary remasking steps, reducing amount of required entropy

Not in this presentation: We introduce a simpler algorithm for modular subtraction

3 / 14

SLIDE 9

Our contribution

◮ Threshold Implementations (TI) initially only of interest for hardware implementations until recent developments reduced the number of necessary shares ◮ We introduce some optimizations for masking additions

Introduce masked versions of combined SHIFT-AND(-XOR) gates Include the “fmexible second operand” of ARM platform, performing z x y c in one instruction Reduce the number of necessary remasking steps, reducing amount of required entropy

Not in this presentation: We introduce a simpler algorithm for modular subtraction

3 / 14

SLIDE 10

Our contribution

◮ Threshold Implementations (TI) initially only of interest for hardware implementations until recent developments reduced the number of necessary shares ◮ We introduce some optimizations for masking additions

◮ Introduce masked versions of combined SHIFT-AND(-XOR) gates Include the “fmexible second operand” of ARM platform, performing z x y c in one instruction Reduce the number of necessary remasking steps, reducing amount of required entropy

Not in this presentation: We introduce a simpler algorithm for modular subtraction

3 / 14

SLIDE 11

Our contribution

◮ Threshold Implementations (TI) initially only of interest for hardware implementations until recent developments reduced the number of necessary shares ◮ We introduce some optimizations for masking additions

◮ Introduce masked versions of combined SHIFT-AND(-XOR) gates ◮ Include the “fmexible second operand” of ARM platform, performing z ← x(y ≪ c) in one instruction Reduce the number of necessary remasking steps, reducing amount of required entropy

Not in this presentation: We introduce a simpler algorithm for modular subtraction

3 / 14

SLIDE 12

Our contribution

◮ Threshold Implementations (TI) initially only of interest for hardware implementations until recent developments reduced the number of necessary shares ◮ We introduce some optimizations for masking additions

◮ Introduce masked versions of combined SHIFT-AND(-XOR) gates ◮ Include the “fmexible second operand” of ARM platform, performing z ← x(y ≪ c) in one instruction ◮ Reduce the number of necessary remasking steps, reducing amount of required entropy

Not in this presentation: We introduce a simpler algorithm for modular subtraction

3 / 14

SLIDE 13

Our contribution

◮ Threshold Implementations (TI) initially only of interest for hardware implementations until recent developments reduced the number of necessary shares ◮ We introduce some optimizations for masking additions

◮ Introduce masked versions of combined SHIFT-AND(-XOR) gates ◮ Include the “fmexible second operand” of ARM platform, performing z ← x(y ≪ c) in one instruction ◮ Reduce the number of necessary remasking steps, reducing amount of required entropy

◮ Not in this presentation: We introduce a simpler algorithm for modular subtraction

3 / 14

SLIDE 14

Kogge-Stone Adder (KSA)

Input Bit 0 (x[0], y[0]) Bit 1 (x[1], y[1]) Bit 2 (x[2], y[2]) Bit 3 (x[3], y[3]) Bit 4 (x[4], y[4]) Bit 5 (x[5], y[5]) Bit 6 (x[6], y[6]) Bit 7 (x[7], y[7]) Iteration 1 Iteration 2 Iteration 3 Output g[b] ← x[b] ⊕ y[b] p[b] ← x[b] ∧ y[b] (x[b], y[b]) (g[b], p[b]) g[b] ← (p[b] ∧ g[b − 2i]) ⊕ g[b] p[b] ← (p[b] ∧ p[b − 2i]) (g[b], g[b]) (g[b − 2i], y[b − 2i]) (g[b], p[b])

Combined SHIFT-AND(-XOR) gates

4 / 14

SLIDE 15

Kogge-Stone Adder (KSA)

Combined SHIFT-AND(-XOR) gates

4 / 14

SLIDE 16

TI AND(-XOR) Gate with 2 shares

(z0 ⊕ z1) ← (x0 ⊕ x1) ∧ (y0 ⊕ y1) m x0 1 u k 1 s0 ← x0 ∧ y0, s1 ← x0 ∧ y1 s2 ← x1 ∧ y0, s3 ← x1 ∧ y1 z0 ← s0 ⊕ s2, z1 ← s1 ⊕ s3 ◮ Direct approach to constructing an AND gate with four output shares, which are registered and recombined Output is not uniform, requiring remasking with a guard share m Typical software implementation processes k-shares in parallel use one uniform input shares as guard share (just need one fresh bit) In the case of z x y u no guard share is required

5 / 14

SLIDE 17

TI AND(-XOR) Gate with 2 shares

(z0 ⊕ z1) ← (x0 ⊕ x1) ∧ (y0 ⊕ y1) m x0 1 u k 1 s0 ← x0 ∧ y0, s1 ← x0 ∧ y1 s2 ← x1 ∧ y0, s3 ← x1 ∧ y1 t0 ← s0 ⊕ m, t1 ← s1 ⊕ m z0 ← t0 ⊕ s2, z1 ← t1 ⊕ s3 ◮ Direct approach to constructing an AND gate with four output shares, which are registered and recombined ◮ Output is not uniform, requiring remasking with a guard share m Typical software implementation processes k-shares in parallel use one uniform input shares as guard share (just need one fresh bit) In the case of z x y u no guard share is required

5 / 14

SLIDE 18

TI AND(-XOR) Gate with 2 shares

(z0 ⊕ z1) ← (x0 ⊕ x1) ∧ (y0 ⊕ y1) m ← (x0 ≫ 1) ⊕ (u ≪ k − 1) s0 ← x0 ∧ y0, s1 ← x0 ∧ y1 s2 ← x1 ∧ y0, s3 ← x1 ∧ y1 t0 ← s0 ⊕ m, t1 ← s1 ⊕ m z0 ← t0 ⊕ s2, z1 ← t1 ⊕ s3 ◮ Direct approach to constructing an AND gate with four output shares, which are registered and recombined ◮ Output is not uniform, requiring remasking with a guard share m ◮ Typical software implementation processes k-shares in parallel → use one uniform input shares as guard share (just need one fresh bit) In the case of z x y u no guard share is required

5 / 14

SLIDE 19

TI AND(-XOR) Gate with 2 shares

(z0 ⊕ z1) ← (x0 ⊕ x1) ∧ (y0 ⊕ y1) ⊕ (u0 ⊕ u1) m x0 1 u k 1 s0 ← x0 ∧ y0, s1 ← x0 ∧ y1 s2 ← x1 ∧ y0, s3 ← x1 ∧ y1 t0 ← s0 ⊕ u0, t1 ← s1 ⊕ u1 z0 ← t0 ⊕ s2, z1 ← t1 ⊕ s3 ◮ Direct approach to constructing an AND gate with four output shares, which are registered and recombined ◮ Output is not uniform, requiring remasking with a guard share m ◮ Typical software implementation processes k-shares in parallel → use one uniform input shares as guard share (just need one fresh bit) ◮ In the case of z ← (x ∧ y) ⊕ u no guard share is required

5 / 14

SLIDE 20

Combined SHIFT-AND(-XOR) gate

m ← (x0 ≫ 1) ⊕ (u ≪ k − 1) s0 ← x0 ∧ (x0 ≪ i), s1 ← x0 ∧ (x1 ≪ i) s2 ← x1 ∧ (x0 ≪ i), s3 ← x1 ∧ (x1 ≪ i) t0 ← s0 ⊕ m, t1 ← s1 ⊕ m z0 ← t0 ⊕ s2, z1 ← t1 ⊕ s3 ◮ The KSA heavily uses a combined SHIFT-AND (and SHIFT-AND-XOR)

peration which lends itself well to the ARM “fmexible second operand”

Again, in the case of z x y i y no guard share is required

6 / 14

SLIDE 21

Combined SHIFT-AND(-XOR) gate

m x0 1 u k 1 s0 ← x0 ∧ (y0 ≪ i), s1 ← x0 ∧ (y1 ≪ i) s2 ← x1 ∧ (y0 ≪ i), s3 ← x1 ∧ (y1 ≪ i) t0 ← s0 ⊕ y0, t1 ← s1 ⊕ y1 z0 ← t0 ⊕ s2, z1 ← t1 ⊕ s3 ◮ The KSA heavily uses a combined SHIFT-AND (and SHIFT-AND-XOR)

peration which lends itself well to the ARM “fmexible second operand”

◮ Again, in the case of z ← (x ∧ (y << i)) ⊕ y no guard share is required

6 / 14

SLIDE 22

Protected KSA

Require: x, y ∈ Z2k, k > 0 Ensure: z = (x + y) mod 2k

1: n ← max(⌈log2(k − 1)⌉, 1) 2: g ← x ∧ y 3: p ← x ⊕ y 4: for i = 1 to n − 1 do 5:

g ← (p ∧ (g ≪ 2i−1)) ⊕ g

6:

p ← (p ∧ (p ≪ 2i−1))

7: end for 8: g ← (p ∧ (g ≪ 2n−1)) ⊕ g 9: z ← x ⊕ y ⊕ 2g 10: return z

7 / 14

SLIDE 23

Protected KSA

Require: x0, x1, y0, y1 ∈ Z2k, k > 0, u ∈ {0, 1}, with x = x0 ⊕ x1 and y = y0 ⊕ y1 Ensure: z = (x + y) mod 2k, with z = z0 ⊕ z1

1: n ← max(⌈log2(k − 1)⌉, 1) 2: (g0, g1) ← SecAnd(x0, x1, y0, y1, u)

# Shared AND

3: (p0, p1) ← SecXor(x0, x1, y0, y1)

# Shared XOR

4: u ← x0 mod 2

# Update guard share

5: for i = 1 to n − 1 do 6:

v ← p0 mod 2 # Save next guard share

7:

(g0, g1) ← SecAndShiftXor(p0, p1, g0, g1, 2i−1) # Shared AND-SHIFT-XOR

8:

(p0, p1) ← SecAndShift(p0, p1, u, 2i−1) # Shared AND-SHIFT

9:

u ← v # Update guard share

10: end for 11: (g0, g1) ← SecAndShiftXor(p0, p1, g0, g1, 2n−1)

# Shared AND-SHIFT-XOR

12: (z0, z1) ← (x0 ⊕ y0 ⊕ 2g0, x1 ⊕ y1 ⊕ 2g1)

# Compute fjnal output

13: return (z0, z1, u)

7 / 14

SLIDE 24

Protected KSA

Require: x0, x1, y0, y1 ∈ Z2k, k > 0, u ∈ {0, 1}, with x = x0 ⊕ x1 and y = y0 ⊕ y1 Ensure: z = (x + y) mod 2k, with z = z0 ⊕ z1

1: n ← max(⌈log2(k − 1)⌉, 1) 2: (g0, g1) ← SecAnd(x0, x1, y0, y1, u)

# Shared AND

3: (p0, p1) ← SecXor(x0, x1, y0, y1)

# Shared XOR

4: u ← x0 mod 2

# Update guard share

5: for i = 1 to n − 1 do 6:

v ← p0 mod 2 # Save next guard share

7:

(g0, g1) ← SecAndShiftXor(p0, p1, g0, g1, 2i−1) # Shared AND-SHIFT-XOR

8:

(p0, p1) ← SecAndShift(p0, p1, u, 2i−1) # Shared AND-SHIFT

9:

u ← v # Update guard share

10: end for 11: (g0, g1) ← SecAndShiftXor(p0, p1, g0, g1, 2n−1)

# Shared AND-SHIFT-XOR

12: (z0, z1) ← (x0 ⊕ y0 ⊕ 2g0, x1 ⊕ y1 ⊕ 2g1)

# Compute fjnal output

13: return (z0, z1, u)

7 / 14

SLIDE 25

Protected KSA

Input Bit 0

(x[0], y[0])

Bit 1

(x[1], y[1])

Bit 2

(x[2], y[2])

Bit 3

(x[3], y[3])

Bit 4

(x[4], y[4])

Bit 5

(x[5], y[5])

Bit 6

(x[6], y[6])

Bit 7

(x[7], y[7])

Iteration 1 Iteration 2 Iteration 3 Output

LSB can be used as guard share for next iteration

7 / 14

SLIDE 26

Protected KSA

Require: x0, x1, y0, y1 ∈ Z2k, k > 0, u ∈ {0, 1}, with x = x0 ⊕ x1 and y = y0 ⊕ y1 Ensure: z = (x + y) mod 2k, with z = z0 ⊕ z1

1: n ← max(⌈log2(k − 1)⌉, 1) 2: (g0, g1) ← SecAnd(x0, x1, y0, y1, u)

# Shared AND

3: (p0, p1) ← SecXor(x0, x1, y0, y1)

# Shared XOR

4: u ← x0 mod 2

# Update guard share

5: for i = 1 to n − 1 do 6:

v ← p0 mod 2 # Save next guard share

7:

(g0, g1) ← SecAndShiftXor(p0, p1, g0, g1, 2i−1) # Shared AND-SHIFT-XOR

8:

(p0, p1) ← SecAndShift(p0, p1, u, 2i−1) # Shared AND-SHIFT

9:

u ← v # Update guard share

10: end for 11: (g0, g1) ← SecAndShiftXor(p0, p1, g0, g1, 2n−1)

# Shared AND-SHIFT-XOR

12: (z0, z1) ← (x0 ⊕ y0 ⊕ 2g0, x1 ⊕ y1 ⊕ 2g1)

# Compute fjnal output

13: return (z0, z1, u)

7 / 14

SLIDE 27

Further optimization

s0 ← x0 ∧ y0, s1 ← x0 ∨ ¬y1 s2 ← x1 ∧ y0, s3 ← x1 ∨ ¬y1 z0 ← s0 ⊕ s1, z1 ← s2 ⊕ s3 ◮ Biryukov et al. (2017) introduced a further optimized secure AND gate (SecAndOpt/SecAndShiftOpt) which can be combined with our approach

8 / 14

SLIDE 28

Comparision of masked operations

SecXor SecShift SecAnd SecAndShift / -Opt SecAndShiftXor Generic [Coron et al.] 2 4 8 8 + 2 8 + 4 + 2 ARM [Coron et al.] 2 4 8 8 + 2 8 + 4 + 2 Generic [Biryukov et al.] 2 2 7 7 + 2 7 + 2 + 2 ARM [Biryukov et al.] 2 2 6 6 + 2 6 + 2 + 2 Generic [new] 2

10 / 9 10 ARM [new] 2

8 / 6 8

◮ Combined AND-SHIFT operations save most of the instructions Especially when combined with optimizations proposed by Biryukov et el. Generation of refresh mask takes only 3 instructions

9 / 14

SLIDE 29

Comparision of masked operations

SecXor SecShift SecAnd SecAndShift / -Opt SecAndShiftXor Generic [Coron et al.] 2 4 8 8 + 2 8 + 4 + 2 ARM [Coron et al.] 2 4 8 8 + 2 8 + 4 + 2 Generic [Biryukov et al.] 2 2 7 7 + 2 7 + 2 + 2 ARM [Biryukov et al.] 2 2 6 6 + 2 6 + 2 + 2 Generic [new] 2

10 / 9 10 ARM [new] 2

8 / 6 8

◮ Combined AND-SHIFT operations save most of the instructions ◮ Especially when combined with optimizations proposed by Biryukov et el. Generation of refresh mask takes only 3 instructions

9 / 14

SLIDE 30

Comparision of masked operations

SecXor SecShift SecAnd SecAndShift / -Opt SecAndShiftXor Generic [Coron et al.] 2 4 8 8 + 2 8 + 4 + 2 ARM [Coron et al.] 2 4 8 8 + 2 8 + 4 + 2 Generic [Biryukov et al.] 2 2 7 7 + 2 7 + 2 + 2 ARM [Biryukov et al.] 2 2 6 6 + 2 6 + 2 + 2 Generic [new] 2

10 / 9 10 ARM [new] 2

8 / 6 8

◮ Combined AND-SHIFT operations save most of the instructions ◮ Especially when combined with optimizations proposed by Biryukov et el. ◮ Generation of refresh mask takes only 3 instructions

9 / 14

SLIDE 31

Comparision of masked 32-bit modular addition

A d d A d d ( A R M ) S u b S u b ( A R M )

144 144 116 114 164 156 106 78 112 83 # instructions Coron et al. Biryukov et al. This work ◮ ARM implementation improved by 31% when combined with approach by Biryukov et al. Signifjcantly improved subtraction instruction counts Needs one random bit, outputs one random bit

10 / 14

SLIDE 32

Comparision of masked 32-bit modular addition

A d d A d d ( A R M ) S u b S u b ( A R M )

144 144 116 114 164 156 106 78 112 83 # instructions Coron et al. Biryukov et al. This work ◮ ARM implementation improved by 31% when combined with approach by Biryukov et al. ◮ Signifjcantly improved subtraction instruction counts Needs one random bit, outputs one random bit

10 / 14

SLIDE 33

Comparision of masked 32-bit modular addition

A d d A d d ( A R M ) S u b S u b ( A R M )

144 144 116 114 164 156 106 78 112 83 # instructions Coron et al. Biryukov et al. This work ◮ ARM implementation improved by 31% when combined with approach by Biryukov et al. ◮ Signifjcantly improved subtraction instruction counts ◮ Needs one random bit, outputs one random bit

10 / 14

SLIDE 34

Application to ChaCha20 cipher

◮ We implemented an unprotected reference and two protected variants Masked addition is the driving factor Note: cycle-counts not entirely comparable due to possible difgerences in memory architecture

Reference Previous Results This Work

# cycles 121,618 93,993 Masked [Adomnicai et al.] Masked Opt. [Adomnicai et al.] 1,726 72,721 60,623 TI 2-share TI 2-share Opt.

11 / 14

SLIDE 35

Application to ChaCha20 cipher

◮ We implemented an unprotected reference and two protected variants ◮ Masked addition is the driving factor Note: cycle-counts not entirely comparable due to possible difgerences in memory architecture

Reference Previous Results This Work

# cycles 121,618 93,993 Masked [Adomnicai et al.] Masked Opt. [Adomnicai et al.] 1,726 72,721 60,623 TI 2-share TI 2-share Opt.

11 / 14

SLIDE 36

Application to ChaCha20 cipher

◮ We implemented an unprotected reference and two protected variants ◮ Masked addition is the driving factor ◮ Note: cycle-counts not entirely comparable due to possible difgerences in memory architecture

Reference Previous Results This Work

# cycles 121,618 93,993 Masked [Adomnicai et al.] Masked Opt. [Adomnicai et al.] 1,726 72,721 60,623 TI 2-share TI 2-share Opt.

11 / 14

SLIDE 37

Simulation

◮ ChaCha implementation was simulated with Micro-Architectural Power Simulator (MAPS)1 ◮ Simulator was extended by 11 instructions ◮ Hamming distance is sampled for each register assignment ◮ t-Test with a fjxed vs. random setup and 105 noise free traces ◮ Noise amplifjcation methods like shuffming should still be used −10 −5 5 10 5 10 15 20 25 30 35 t Time [Samples ×103]

1https://github.com/cryptolu/maps 12 / 14

SLIDE 38

Thank you for listening

13 / 14

SLIDE 39

Chacha Shuffming (Backup Slide)

◮ In the case of ChaCha, shuffming can be used to amplify the noise ◮ ChaCha State consists of 4 columns which are processed independently (within a round) ◮ Instead of processing columns sequentially, one can jump between columns ◮

(4·12)! (12!)4 ≈ 288 Permutations

◮ Noise can be further amplifjed by splitting the masked addition into several operations

Col0 Col1 Col2 Col3 add add add add xor xor xor xor shift shift shift shift add add add add xor xor xor xor shift shift shift shift add add add add xor xor xor xor . . . . . . . . . . . .

14 / 14

SLIDE 40

Chacha Shuffming (Backup Slide)

◮ In the case of ChaCha, shuffming can be used to amplify the noise ◮ ChaCha State consists of 4 columns which are processed independently (within a round) ◮ Instead of processing columns sequentially, one can jump between columns ◮

(4·12)! (12!)4 ≈ 288 Permutations

◮ Noise can be further amplifjed by splitting the masked addition into several operations

Col0 Col1 Col2 Col3 add add add add xor xor xor xor shift shift shift shift add add add add xor xor xor xor shift shift shift shift add add add add xor xor xor xor . . . . . . . . . . . .

14 / 14