Smartphone/tablet CPUs Apple A4 also appeared in iPhone 4 (2010). - - PowerPoint PPT Presentation

smartphone tablet cpus apple a4 also appeared in iphone 4
SMART_READER_LITE
LIVE PREVIEW

Smartphone/tablet CPUs Apple A4 also appeared in iPhone 4 (2010). - - PowerPoint PPT Presentation

1 2 Smartphone/tablet CPUs Apple A4 also appeared in iPhone 4 (2010). iPad 1 (2010) was the first popular tablet: 45nm 1GHz Samsung Exynos more than 15 million sold. 3110 in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. iPad 1


slide-1
SLIDE 1

1

Smartphone/tablet CPUs iPad 1 (2010) was the first popular tablet: more than 15 million sold. iPad 1 contains 45nm Apple A4 system-on-chip. Apple A4 contains 1GHz ARM Cortex-A8 CPU core + PowerVR SGX 535 GPU. Cortex-A8 CPU core (2005) supports ARMv7-A insn set, including NEON vector insns.

2

Apple A4 also appeared in iPhone 4 (2010). 45nm 1GHz Samsung Exynos 3110 in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. 45nm 1GHz TI OMAP3630 in Motorola Droid X (2010) contains Cortex-A8 CPU core. 65nm 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011) contains Cortex-A8 CPU core.

slide-2
SLIDE 2

1

rtphone/tablet CPUs (2010) was the

  • pular tablet:

than 15 million sold. contains 45nm A4 system-on-chip. A4 contains ARM Cortex-A8 CPU core erVR SGX 535 GPU. rtex-A8 CPU core (2005) rts ARMv7-A insn set, including NEON vector insns.

2

Apple A4 also appeared in iPhone 4 (2010). 45nm 1GHz Samsung Exynos 3110 in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. 45nm 1GHz TI OMAP3630 in Motorola Droid X (2010) contains Cortex-A8 CPU core. 65nm 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011) contains Cortex-A8 CPU core. ARM designed supporting Cortex-A9 Cortex-A5 Cortex-A15 Cortex-A7 Cortex-A17 Also some A9, A15, cores are tries to reo compensate

slide-3
SLIDE 3

1

rtphone/tablet CPUs as the blet: million sold. 45nm system-on-chip. contains rtex-A8 CPU core 535 GPU. core (2005) ARMv7-A insn set, vector insns.

2

Apple A4 also appeared in iPhone 4 (2010). 45nm 1GHz Samsung Exynos 3110 in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. 45nm 1GHz TI OMAP3630 in Motorola Droid X (2010) contains Cortex-A8 CPU core. 65nm 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011) contains Cortex-A8 CPU core. ARM designed mo supporting same ARMv7-A Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), Also some larger 64-bit A9, A15, A17, and cores are “out of o tries to reorder instructions compensate for dumb

slide-4
SLIDE 4

1

CPU core GPU. (2005) set, insns.

2

Apple A4 also appeared in iPhone 4 (2010). 45nm 1GHz Samsung Exynos 3110 in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. 45nm 1GHz TI OMAP3630 in Motorola Droid X (2010) contains Cortex-A8 CPU core. 65nm 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011) contains Cortex-A8 CPU core. ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions t compensate for dumb compilers.

slide-5
SLIDE 5

2

Apple A4 also appeared in iPhone 4 (2010). 45nm 1GHz Samsung Exynos 3110 in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. 45nm 1GHz TI OMAP3630 in Motorola Droid X (2010) contains Cortex-A8 CPU core. 65nm 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011) contains Cortex-A8 CPU core.

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers.

slide-6
SLIDE 6

2

A4 also appeared iPhone 4 (2010). 1GHz Samsung Exynos in Samsung Galaxy S (2010) contains Cortex-A8 CPU core. 1GHz TI OMAP3630 in rola Droid X (2010) contains Cortex-A8 CPU core. 800MHz Freescale i.MX50 Amazon Kindle 4 (2011) contains Cortex-A8 CPU core.

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers. A5, A7, fewer insns

slide-7
SLIDE 7

2

appeared (2010). Samsung Exynos Samsung Galaxy S (2010) rtex-A8 CPU core. OMAP3630 in X (2010) rtex-A8 CPU core. reescale i.MX50 Kindle 4 (2011) rtex-A8 CPU core.

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers. A5, A7, original A8 fewer insns at once.

slide-8
SLIDE 8

2

Exynos (2010) core. OMAP3630 in core. i.MX50 (2011) core.

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers. A5, A7, original A8 are in-order, fewer insns at once.

slide-9
SLIDE 9

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers.

4

A5, A7, original A8 are in-order, fewer insns at once.

slide-10
SLIDE 10

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers.

4

A5, A7, original A8 are in-order, fewer insns at once. ⇒ Simpler, cheaper, more energy-efficient.

slide-11
SLIDE 11

3

ARM designed more cores supporting same ARMv7-A insns: Cortex-A9 (2007), Cortex-A5 (2009), Cortex-A15 (2010), Cortex-A7 (2011), Cortex-A17 (2014), etc. Also some larger 64-bit cores. A9, A15, A17, and some 64-bit cores are “out of order”: CPU tries to reorder instructions to compensate for dumb compilers.

4

A5, A7, original A8 are in-order, fewer insns at once. ⇒ Simpler, cheaper, more energy-efficient. More than one billion Cortex-A7 devices have been sold. Popular in low-cost and mid-range smartphones: Mobiistar Buddy, Mobiistar Kool, Mobiistar LAI Z1, Samsung Galaxy J1 Ace Neo, etc. Also used in typical TV boxes, Sony SmartWatch 3, Samsung Gear S2, Raspberry Pi 2, etc.

slide-12
SLIDE 12

3

designed more cores rting same ARMv7-A insns: rtex-A9 (2007), rtex-A5 (2009), rtex-A15 (2010), rtex-A7 (2011), rtex-A17 (2014), etc. some larger 64-bit cores. A15, A17, and some 64-bit re “out of order”: CPU to reorder instructions to ensate for dumb compilers.

4

A5, A7, original A8 are in-order, fewer insns at once. ⇒ Simpler, cheaper, more energy-efficient. More than one billion Cortex-A7 devices have been sold. Popular in low-cost and mid-range smartphones: Mobiistar Buddy, Mobiistar Kool, Mobiistar LAI Z1, Samsung Galaxy J1 Ace Neo, etc. Also used in typical TV boxes, Sony SmartWatch 3, Samsung Gear S2, Raspberry Pi 2, etc. NEON crypto Basic ARM 16 32-bit Optional 16 128-bit Cortex-A7 (and Cortex- and Qualcomm and Qualcomm always have Cortex-A5 sometimes

slide-13
SLIDE 13

3

more cores ARMv7-A insns: (2007), (2009), (2010), (2011), (2014), etc. 64-bit cores. and some 64-bit

  • f order”: CPU

instructions to dumb compilers.

4

A5, A7, original A8 are in-order, fewer insns at once. ⇒ Simpler, cheaper, more energy-efficient. More than one billion Cortex-A7 devices have been sold. Popular in low-cost and mid-range smartphones: Mobiistar Buddy, Mobiistar Kool, Mobiistar LAI Z1, Samsung Galaxy J1 Ace Neo, etc. Also used in typical TV boxes, Sony SmartWatch 3, Samsung Gear S2, Raspberry Pi 2, etc. NEON crypto Basic ARM insn set 16 32-bit registers: Optional NEON extension 16 128-bit registers: Cortex-A7 and Cortex-A8 (and Cortex-A15 and and Qualcomm Sco and Qualcomm Krait) always have NEON Cortex-A5 and Cortex-A9 sometimes have NEON

slide-14
SLIDE 14

3

insns: res. 64-bit CPU instructions to compilers.

4

A5, A7, original A8 are in-order, fewer insns at once. ⇒ Simpler, cheaper, more energy-efficient. More than one billion Cortex-A7 devices have been sold. Popular in low-cost and mid-range smartphones: Mobiistar Buddy, Mobiistar Kool, Mobiistar LAI Z1, Samsung Galaxy J1 Ace Neo, etc. Also used in typical TV boxes, Sony SmartWatch 3, Samsung Gear S2, Raspberry Pi 2, etc. NEON crypto Basic ARM insn set uses 16 32-bit registers: 512 bits. Optional NEON extension uses 16 128-bit registers: 2048 bits. Cortex-A7 and Cortex-A8 (and Cortex-A15 and Cortex-A17 and Qualcomm Scorpion and Qualcomm Krait) always have NEON insns. Cortex-A5 and Cortex-A9 sometimes have NEON insns.

slide-15
SLIDE 15

4

A5, A7, original A8 are in-order, fewer insns at once. ⇒ Simpler, cheaper, more energy-efficient. More than one billion Cortex-A7 devices have been sold. Popular in low-cost and mid-range smartphones: Mobiistar Buddy, Mobiistar Kool, Mobiistar LAI Z1, Samsung Galaxy J1 Ace Neo, etc. Also used in typical TV boxes, Sony SmartWatch 3, Samsung Gear S2, Raspberry Pi 2, etc.

5

NEON crypto Basic ARM insn set uses 16 32-bit registers: 512 bits. Optional NEON extension uses 16 128-bit registers: 2048 bits. Cortex-A7 and Cortex-A8 (and Cortex-A15 and Cortex-A17 and Qualcomm Scorpion and Qualcomm Krait) always have NEON insns. Cortex-A5 and Cortex-A9 sometimes have NEON insns.

slide-16
SLIDE 16

4

A7, original A8 are in-order, insns at once. ⇒ Simpler, er, more energy-efficient. than one billion Cortex-A7 devices have been sold. r in low-cost and mid-range rtphones: Mobiistar Buddy, Mobiistar Kool, Mobiistar LAI Z1, Samsung Galaxy J1 Ace Neo, etc. used in typical TV boxes, SmartWatch 3, Samsung 2, Raspberry Pi 2, etc.

5

NEON crypto Basic ARM insn set uses 16 32-bit registers: 512 bits. Optional NEON extension uses 16 128-bit registers: 2048 bits. Cortex-A7 and Cortex-A8 (and Cortex-A15 and Cortex-A17 and Qualcomm Scorpion and Qualcomm Krait) always have NEON insns. Cortex-A5 and Cortex-A9 sometimes have NEON insns. 2012 Bernstein–Schw “NEON new Cortex-A8 for various e.g. Curve25519 460200 cycles 498284 cycles Compare cycles on for NIST 9 million 4.8 million 3.9 million

slide-17
SLIDE 17

4

A8 are in-order,

  • nce. ⇒ Simpler,

energy-efficient. billion Cortex-A7 een sold. w-cost and mid-range Mobiistar Buddy, Mobiistar LAI Z1, J1 Ace Neo, etc. ypical TV boxes, atch 3, Samsung rry Pi 2, etc.

5

NEON crypto Basic ARM insn set uses 16 32-bit registers: 512 bits. Optional NEON extension uses 16 128-bit registers: 2048 bits. Cortex-A7 and Cortex-A8 (and Cortex-A15 and Cortex-A17 and Qualcomm Scorpion and Qualcomm Krait) always have NEON insns. Cortex-A5 and Cortex-A9 sometimes have NEON insns. 2012 Bernstein–Schw “NEON crypto” soft new Cortex-A8 speed for various crypto e.g. Curve25519 ECDH: 460200 cycles on Co 498284 cycles on Co Compare to OpenSSL cycles on Cortex-A8-slo for NIST P-256 ECDH: 9 million for OpenSSL 4.8 million for OpenSSL 3.9 million for OpenSSL

slide-18
SLIDE 18

4

in-order, Simpler, energy-efficient. tex-A7 mid-range Buddy, LAI Z1, Neo, etc. xes, Samsung etc.

5

NEON crypto Basic ARM insn set uses 16 32-bit registers: 512 bits. Optional NEON extension uses 16 128-bit registers: 2048 bits. Cortex-A7 and Cortex-A8 (and Cortex-A15 and Cortex-A17 and Qualcomm Scorpion and Qualcomm Krait) always have NEON insns. Cortex-A5 and Cortex-A9 sometimes have NEON insns. 2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slo Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.

slide-19
SLIDE 19

5

NEON crypto Basic ARM insn set uses 16 32-bit registers: 512 bits. Optional NEON extension uses 16 128-bit registers: 2048 bits. Cortex-A7 and Cortex-A8 (and Cortex-A15 and Cortex-A17 and Qualcomm Scorpion and Qualcomm Krait) always have NEON insns. Cortex-A5 and Cortex-A9 sometimes have NEON insns.

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.

slide-20
SLIDE 20

5

crypto ARM insn set uses 32-bit registers: 512 bits. Optional NEON extension uses 128-bit registers: 2048 bits. rtex-A7 and Cortex-A8 Cortex-A15 and Cortex-A17 Qualcomm Scorpion Qualcomm Krait) have NEON insns. rtex-A5 and Cortex-A9 sometimes have NEON insns.

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j. NEON instructions 4x a = b is a vecto a[0] a[1] a[2] a[3]

slide-21
SLIDE 21

5

set uses registers: 512 bits. extension uses registers: 2048 bits. Cortex-A8 and Cortex-A17 Scorpion Krait) NEON insns. Cortex-A9 NEON insns.

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j. NEON instructions 4x a = b + c is a vector of 4 32-bit a[0] = b[0] + a[1] = b[1] + a[2] = b[2] + a[3] = b[3] +

slide-22
SLIDE 22

5

bits. uses bits. rtex-A17 insns.

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j. NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3].

slide-23
SLIDE 23

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3].

slide-24
SLIDE 24

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. Cortex-A8 NEON arithmetic unit can do this every cycle.

slide-25
SLIDE 25

6

2012 Bernstein–Schwabe “NEON crypto” software: new Cortex-A8 speed records for various crypto primitives. e.g. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL cycles on Cortex-A8-slow for NIST P-256 ECDH: 9 million for OpenSSL 0.9.8k. 4.8 million for OpenSSL 1.0.1c. 3.9 million for OpenSSL 1.0.2j.

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. Cortex-A8 NEON arithmetic unit can do this every cycle. Stage N2: reads b and c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 cycles ADD 2 cycles ADD

slide-26
SLIDE 26

6

Bernstein–Schwabe “NEON crypto” software: Cortex-A8 speed records rious crypto primitives. Curve25519 ECDH: 460200 cycles on Cortex-A8-fast, 498284 cycles on Cortex-A8-slow. Compare to OpenSSL

  • n Cortex-A8-slow

IST P-256 ECDH: million for OpenSSL 0.9.8k. million for OpenSSL 1.0.1c. million for OpenSSL 1.0.2j.

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. Cortex-A8 NEON arithmetic unit can do this every cycle. Stage N2: reads b and c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 cycles ADD 2 cycles ADD

4x a = b is a vecto a[0] a[1] a[2] a[3] Stage N1: Stage N2: Stage N3: Stage N4: ADD

2

Also logic

slide-27
SLIDE 27

6

Bernstein–Schwabe software: speed records crypto primitives. ECDH:

  • n Cortex-A8-fast,
  • n Cortex-A8-slow.

enSSL rtex-A8-slow ECDH: enSSL 0.9.8k. OpenSSL 1.0.1c. OpenSSL 1.0.2j.

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. Cortex-A8 NEON arithmetic unit can do this every cycle. Stage N2: reads b and c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 cycles ADD 2 cycles ADD

4x a = b - c is a vector of 4 32-bit a[0] = b[0] - a[1] = b[1] - a[2] = b[2] - a[3] = b[3] - Stage N1: reads c Stage N2: reads b Stage N3: performs Stage N4: a is ready ADD

2 or 3 cycles

Also logic insns, shifts,

slide-28
SLIDE 28

6

rds rimitives. rtex-A8-fast, rtex-A8-slow. 0.9.8k. 1.0.1c. 1.0.2j.

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. Cortex-A8 NEON arithmetic unit can do this every cycle. Stage N2: reads b and c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 cycles ADD 2 cycles ADD

4x a = b - c is a vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. Stage N1: reads c. Stage N2: reads b, negates Stage N3: performs addition Stage N4: a is ready. ADD

2 or 3 cycles SUB

Also logic insns, shifts, etc.

slide-29
SLIDE 29

7

NEON instructions 4x a = b + c is a vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. Cortex-A8 NEON arithmetic unit can do this every cycle. Stage N2: reads b and c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 cycles ADD 2 cycles ADD

8

4x a = b - c is a vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. Stage N1: reads c. Stage N2: reads b, negates c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 or 3 cycles SUB

Also logic insns, shifts, etc.

slide-30
SLIDE 30

7

instructions b + c vector of 4 32-bit additions: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; a[2] = b[2] + c[2]; a[3] = b[3] + c[3]. rtex-A8 NEON arithmetic unit this every cycle. N2: reads b and c. N3: performs addition. N4: a is ready.

2 cycles ADD 2 cycles ADD

8

4x a = b - c is a vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. Stage N1: reads c. Stage N2: reads b, negates c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 or 3 cycles SUB

Also logic insns, shifts, etc. Multiplication c[0,1] c[2,3] Two cycles Multiply-accumulate c[0,1] c[2,3] Also two Stage N1: Stage N2: Stage N3: . . . Stage N8:

slide-31
SLIDE 31

7

instructions 32-bit additions: + c[0]; + c[1]; + c[2]; + c[3]. NEON arithmetic unit every cycle. b and c. rms addition. ready. ADD

2 cycles ADD

8

4x a = b - c is a vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. Stage N1: reads c. Stage N2: reads b, negates c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 or 3 cycles SUB

Also logic insns, shifts, etc. Multiplication insn: c[0,1] = a[0] signed* c[2,3] = a[1] signed* Two cycles on Cor Multiply-accumulate c[0,1] += a[0] signed* c[2,3] += a[1] signed* Also two cycles on Stage N1: reads b Stage N2: reads a Stage N3: reads c . . . Stage N8: c is ready

slide-32
SLIDE 32

7

additions: rithmetic unit addition.

cycles ADD

8

4x a = b - c is a vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. Stage N1: reads c. Stage N2: reads b, negates c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 or 3 cycles SUB

Also logic insns, shifts, etc. Multiplication insn: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. Multiply-accumulate insn: c[0,1] += a[0] signed* b[0]; c[2,3] += a[1] signed* b[1] Also two cycles on Cortex-A8. Stage N1: reads b. Stage N2: reads a. Stage N3: reads c if accumulate. . . . Stage N8: c is ready.

slide-33
SLIDE 33

8

4x a = b - c is a vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. Stage N1: reads c. Stage N2: reads b, negates c. Stage N3: performs addition. Stage N4: a is ready. ADD

2 or 3 cycles SUB

Also logic insns, shifts, etc.

9

Multiplication insn: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. Multiply-accumulate insn: c[0,1] += a[0] signed* b[0]; c[2,3] += a[1] signed* b[1] Also two cycles on Cortex-A8. Stage N1: reads b. Stage N2: reads a. Stage N3: reads c if accumulate. . . . Stage N8: c is ready.

slide-34
SLIDE 34

8

b - c vector of 4 32-bit subtractions: a[0] = b[0] - c[0]; a[1] = b[1] - c[1]; a[2] = b[2] - c[2]; a[3] = b[3] - c[3]. N1: reads c. N2: reads b, negates c. N3: performs addition. N4: a is ready.

2 or 3 cycles SUB

logic insns, shifts, etc.

9

Multiplication insn: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. Multiply-accumulate insn: c[0,1] += a[0] signed* b[0]; c[2,3] += a[1] signed* b[1] Also two cycles on Cortex-A8. Stage N1: reads b. Stage N2: reads a. Stage N3: reads c if accumulate. . . . Stage N8: c is ready. Typical sequence c[0,1] c[2,3] c[0,1] c[2,3] c[0,1] c[2,3] Cortex-A8 Reads c

slide-35
SLIDE 35

8

32-bit subtractions:

  • c[0];
  • c[1];
  • c[2];
  • c[3].

c. b, negates c. rms addition. ready.

cycles SUB

shifts, etc.

9

Multiplication insn: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. Multiply-accumulate insn: c[0,1] += a[0] signed* b[0]; c[2,3] += a[1] signed* b[1] Also two cycles on Cortex-A8. Stage N1: reads b. Stage N2: reads a. Stage N3: reads c if accumulate. . . . Stage N8: c is ready. Typical sequence of c[0,1] = a[0] signed* c[2,3] = a[1] signed* c[0,1] += e[2] signed* c[2,3] += e[3] signed* c[0,1] += g[0] signed* c[2,3] += g[1] signed* Cortex-A8 recognizes Reads c in N6 instead

slide-36
SLIDE 36

8

subtractions: negates c. addition. etc.

9

Multiplication insn: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. Multiply-accumulate insn: c[0,1] += a[0] signed* b[0]; c[2,3] += a[1] signed* b[1] Also two cycles on Cortex-A8. Stage N1: reads b. Stage N2: reads a. Stage N3: reads c if accumulate. . . . Stage N8: c is ready. Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[0,1] += e[2] signed* f[2]; c[2,3] += e[3] signed* f[3] c[0,1] += g[0] signed* h[2]; c[2,3] += g[1] signed* h[3] Cortex-A8 recognizes this pattern. Reads c in N6 instead of N3.

slide-37
SLIDE 37

9

Multiplication insn: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Two cycles on Cortex-A8. Multiply-accumulate insn: c[0,1] += a[0] signed* b[0]; c[2,3] += a[1] signed* b[1] Also two cycles on Cortex-A8. Stage N1: reads b. Stage N2: reads a. Stage N3: reads c if accumulate. . . . Stage N8: c is ready.

10

Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[0,1] += e[2] signed* f[2]; c[2,3] += e[3] signed* f[3] c[0,1] += g[0] signed* h[2]; c[2,3] += g[1] signed* h[3] Cortex-A8 recognizes this pattern. Reads c in N6 instead of N3.

slide-38
SLIDE 38

9

Multiplication insn: = a[0] signed* b[0]; = a[1] signed* b[1] cycles on Cortex-A8. Multiply-accumulate insn: += a[0] signed* b[0]; += a[1] signed* b[1] wo cycles on Cortex-A8. N1: reads b. N2: reads a. N3: reads c if accumulate. N8: c is ready.

10

Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[0,1] += e[2] signed* f[2]; c[2,3] += e[3] signed* f[3] c[0,1] += g[0] signed* h[2]; c[2,3] += g[1] signed* h[3] Cortex-A8 recognizes this pattern. Reads c in N6 instead of N3. Time N1 1 b 2 3 f 4 5 h 6 7 8 9 10 11 12

slide-39
SLIDE 39

9

insn: signed* b[0]; signed* b[1] Cortex-A8. Multiply-accumulate insn: signed* b[0]; signed* b[1]

  • n Cortex-A8.

b. a. c if accumulate. ready.

10

Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[0,1] += e[2] signed* f[2]; c[2,3] += e[3] signed* f[3] c[0,1] += g[0] signed* h[2]; c[2,3] += g[1] signed* h[3] Cortex-A8 recognizes this pattern. Reads c in N6 instead of N3. Time N1 N2 N3 N4 1 b 2 a 3 f × 4 e × 5 h × 6 g × 7 × 8 × 9 10 11 12

slide-40
SLIDE 40

9

b[0]; b[1] b[0]; b[1] rtex-A8. accumulate.

10

Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[0,1] += e[2] signed* f[2]; c[2,3] += e[3] signed* f[3] c[0,1] += g[0] signed* h[2]; c[2,3] += g[1] signed* h[3] Cortex-A8 recognizes this pattern. Reads c in N6 instead of N3. Time N1 N2 N3 N4 N5 N6 N7 1 b 2 a 3 f × 4 e × 5 h × × 6 g × × 7 × × 8 × × 9 × + 10 × 11 + 12

slide-41
SLIDE 41

10

Typical sequence of three insns: c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] c[0,1] += e[2] signed* f[2]; c[2,3] += e[3] signed* f[3] c[0,1] += g[0] signed* h[2]; c[2,3] += g[1] signed* h[3] Cortex-A8 recognizes this pattern. Reads c in N6 instead of N3.

11

Time N1 N2 N3 N4 N5 N6 N7 N8 1 b 2 a 3 f × 4 e × 5 h × × 6 g × × 7 × × 8 × × c 9 × + 10 × c 11 + 12 c

slide-42
SLIDE 42

10

ypical sequence of three insns: = a[0] signed* b[0]; = a[1] signed* b[1] += e[2] signed* f[2]; += e[3] signed* f[3] += g[0] signed* h[2]; += g[1] signed* h[3] rtex-A8 recognizes this pattern. in N6 instead of N3.

11

Time N1 N2 N3 N4 N5 N6 N7 N8 1 b 2 a 3 f × 4 e × 5 h × × 6 g × × 7 × × 8 × × c 9 × + 10 × c 11 + 12 c NEON also and permutation r = s[1] Cortex-A8 NEON load/sto that runs NEON a Arithmetic most imp can often to hide loads/sto Cortex-A7 handling

slide-43
SLIDE 43

10

sequence of three insns: signed* b[0]; signed* b[1] signed* f[2]; signed* f[3] signed* h[2]; signed* h[3] recognizes this pattern. instead of N3.

11

Time N1 N2 N3 N4 N5 N6 N7 N8 1 b 2 a 3 f × 4 e × 5 h × × 6 g × × 7 × × 8 × × c 9 × + 10 × c 11 + 12 c NEON also has load/sto and permutation insns: r = s[1] t[2] r[2,3] Cortex-A8 has a sepa NEON load/store that runs in parallel NEON arithmetic unit. Arithmetic is typically most important bottlenec can often schedule to hide loads/stores/p Cortex-A7 is different: handling all NEON

slide-44
SLIDE 44

10

insns: b[0]; b[1] f[2]; f[3] h[2]; h[3] pattern. N3.

11

Time N1 N2 N3 N4 N5 N6 N7 N8 1 b 2 a 3 f × 4 e × 5 h × × 6 g × × 7 × × 8 × × c 9 × + 10 × c 11 + 12 c NEON also has load/store insns and permutation insns: e.g., r = s[1] t[2] r[2,3] Cortex-A8 has a separate NEON load/store unit that runs in parallel with NEON arithmetic unit. Arithmetic is typically most important bottleneck: can often schedule insns to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns.

slide-45
SLIDE 45

11

Time N1 N2 N3 N4 N5 N6 N7 N8 1 b 2 a 3 f × 4 e × 5 h × × 6 g × × 7 × × 8 × × c 9 × + 10 × c 11 + 12 c

12

NEON also has load/store insns and permutation insns: e.g., r = s[1] t[2] r[2,3] Cortex-A8 has a separate NEON load/store unit that runs in parallel with NEON arithmetic unit. Arithmetic is typically most important bottleneck: can often schedule insns to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns.

slide-46
SLIDE 46

11

N1 N2 N3 N4 N5 N6 N7 N8 a × e × × × g × × × × × × c × + × c + c

12

NEON also has load/store insns and permutation insns: e.g., r = s[1] t[2] r[2,3] Cortex-A8 has a separate NEON load/store unit that runs in parallel with NEON arithmetic unit. Arithmetic is typically most important bottleneck: can often schedule insns to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns. Curve25519 Radix 225 (f0; f1; f2 to represent f = f0 + 2102f4 + 2204f8 + Unscaled f is value f0t0 + 20 f4t4 + 20 f8t8 + 20

slide-47
SLIDE 47

11

N4 N5 N6 N7 N8 × × × × × × × c × + × c + c

12

NEON also has load/store insns and permutation insns: e.g., r = s[1] t[2] r[2,3] Cortex-A8 has a separate NEON load/store unit that runs in parallel with NEON arithmetic unit. Arithmetic is typically most important bottleneck: can often schedule insns to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns. Curve25519 on NEON Radix 225:5: Use small (f0; f1; f2; f3; f4; f5; f to represent the integer f = f0 + 226f1 + 2 2102f4 + 2128f5 + 2 2204f8 + 2230f9 mo Unscaled polynomial f is value at 225:5 f0t0 + 20:5f1t1 + f2 f4t4 + 20:5f5t5 + f6 f8t8 + 20:5f9t9.

slide-48
SLIDE 48

11

N7 N8 c + c + c

12

NEON also has load/store insns and permutation insns: e.g., r = s[1] t[2] r[2,3] Cortex-A8 has a separate NEON load/store unit that runs in parallel with NEON arithmetic unit. Arithmetic is typically most important bottleneck: can often schedule insns to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns. Curve25519 on NEON Radix 225:5: Use small integers (f0; f1; f2; f3; f4; f5; f6; f7; f8; f9 to represent the integer f = f0 + 226f1 + 251f2 + 277 2102f4 + 2128f5 + 2153f6 + 2179 2204f8 + 2230f9 modulo 2255 Unscaled polynomial view: f is value at 225:5 of the poly f0t0 + 20:5f1t1 + f2t2 + 20:5f f4t4 + 20:5f5t5 + f6t6 + 20:5f f8t8 + 20:5f9t9.

slide-49
SLIDE 49

12

NEON also has load/store insns and permutation insns: e.g., r = s[1] t[2] r[2,3] Cortex-A8 has a separate NEON load/store unit that runs in parallel with NEON arithmetic unit. Arithmetic is typically most important bottleneck: can often schedule insns to hide loads/stores/perms. Cortex-A7 is different: one unit handling all NEON insns.

13

Curve25519 on NEON Radix 225:5: Use small integers (f0; f1; f2; f3; f4; f5; f6; f7; f8; f9) to represent the integer f = f0 + 226f1 + 251f2 + 277f3 + 2102f4 + 2128f5 + 2153f6 + 2179f7 + 2204f8 + 2230f9 modulo 2255 − 19. Unscaled polynomial view: f is value at 225:5 of the poly f0t0 + 20:5f1t1 + f2t2 + 20:5f3t3 + f4t4 + 20:5f5t5 + f6t6 + 20:5f7t7 + f8t8 + 20:5f9t9.

slide-50
SLIDE 50

12

also has load/store insns ermutation insns: e.g., s[1] t[2] r[2,3] rtex-A8 has a separate load/store unit runs in parallel with arithmetic unit. Arithmetic is typically important bottleneck:

  • ften schedule insns

hide loads/stores/perms. rtex-A7 is different: one unit handling all NEON insns.

13

Curve25519 on NEON Radix 225:5: Use small integers (f0; f1; f2; f3; f4; f5; f6; f7; f8; f9) to represent the integer f = f0 + 226f1 + 251f2 + 277f3 + 2102f4 + 2128f5 + 2153f6 + 2179f7 + 2204f8 + 2230f9 modulo 2255 − 19. Unscaled polynomial view: f is value at 225:5 of the poly f0t0 + 20:5f1t1 + f2t2 + 20:5f3t3 + f4t4 + 20:5f5t5 + f6t6 + 20:5f7t7 + f8t8 + 20:5f9t9. h ≡ f g

h0 = f0g0+ h1 = f0g1+ h2 = f0g2+ h3 = f0g3+ h4 = f0g4+ h5 = f0g5+ h6 = f0g6+ h7 = f0g7+ h8 = f0g8+ h9 = f0g9+

Proof: multiply

slide-51
SLIDE 51

12

load/store insns insns: e.g., r[2,3] separate re unit rallel with rithmetic unit. ypically bottleneck: schedule insns loads/stores/perms. different: one unit NEON insns.

13

Curve25519 on NEON Radix 225:5: Use small integers (f0; f1; f2; f3; f4; f5; f6; f7; f8; f9) to represent the integer f = f0 + 226f1 + 251f2 + 277f3 + 2102f4 + 2128f5 + 2153f6 + 2179f7 + 2204f8 + 2230f9 modulo 2255 − 19. Unscaled polynomial view: f is value at 225:5 of the poly f0t0 + 20:5f1t1 + f2t2 + 20:5f3t3 + f4t4 + 20:5f5t5 + f6t6 + 20:5f7t7 + f8t8 + 20:5f9t9. h ≡ f g (mod 2255

h0 = f0g0+38f1g9+19 h1 = f0g1+ f1g0+19 h2 = f0g2+ 2f1g1+ h3 = f0g3+ f1g2+ h4 = f0g4+ 2f1g3+ h5 = f0g5+ f1g4+ h6 = f0g6+ 2f1g5+ h7 = f0g7+ f1g6+ h8 = f0g8+ 2f1g7+ h9 = f0g9+ f1g8+

Proof: multiply polys

slide-52
SLIDE 52

12

insns e.g., k: erms.

  • ne unit

13

Curve25519 on NEON Radix 225:5: Use small integers (f0; f1; f2; f3; f4; f5; f6; f7; f8; f9) to represent the integer f = f0 + 226f1 + 251f2 + 277f3 + 2102f4 + 2128f5 + 2153f6 + 2179f7 + 2204f8 + 2230f9 modulo 2255 − 19. Unscaled polynomial view: f is value at 225:5 of the poly f0t0 + 20:5f1t1 + f2t2 + 20:5f3t3 + f4t4 + 20:5f5t5 + f6t6 + 20:5f7t7 + f8t8 + 20:5f9t9. h ≡ f g (mod 2255 − 19) where

h0 = f0g0+38f1g9+19f2g8+38f3g h1 = f0g1+ f1g0+19f2g9+19f3g h2 = f0g2+ 2f1g1+ f2g0+38f3g h3 = f0g3+ f1g2+ f2g1+ f3g h4 = f0g4+ 2f1g3+ f2g2+ 2f3g h5 = f0g5+ f1g4+ f2g3+ f3g h6 = f0g6+ 2f1g5+ f2g4+ 2f3g h7 = f0g7+ f1g6+ f2g5+ f3g h8 = f0g8+ 2f1g7+ f2g6+ 2f3g h9 = f0g9+ f1g8+ f2g7+ f3g

Proof: multiply polys mod t

slide-53
SLIDE 53

13

Curve25519 on NEON Radix 225:5: Use small integers (f0; f1; f2; f3; f4; f5; f6; f7; f8; f9) to represent the integer f = f0 + 226f1 + 251f2 + 277f3 + 2102f4 + 2128f5 + 2153f6 + 2179f7 + 2204f8 + 2230f9 modulo 2255 − 19. Unscaled polynomial view: f is value at 225:5 of the poly f0t0 + 20:5f1t1 + f2t2 + 20:5f3t3 + f4t4 + 20:5f5t5 + f6t6 + 20:5f7t7 + f8t8 + 20:5f9t9.

14

h ≡ f g (mod 2255 − 19) where

h0 = f0g0+38f1g9+19f2g8+38f3g7+19f4g6+ h1 = f0g1+ f1g0+19f2g9+19f3g8+19f4g7+ h2 = f0g2+ 2f1g1+ f2g0+38f3g9+19f4g8+ h3 = f0g3+ f1g2+ f2g1+ f3g0+19f4g9+ h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+

Proof: multiply polys mod t10 − 19.

slide-54
SLIDE 54

13

Curve25519 on NEON 225:5: Use small integers f2; f3; f4; f5; f6; f7; f8; f9) resent the integer + 226f1 + 251f2 + 277f3 + + 2128f5 + 2153f6 + 2179f7 + + 2230f9 modulo 2255 − 19. Unscaled polynomial view: value at 225:5 of the poly 20:5f1t1 + f2t2 + 20:5f3t3 + 20:5f5t5 + f6t6 + 20:5f7t7 + 20:5f9t9.

14

h ≡ f g (mod 2255 − 19) where

h0 = f0g0+38f1g9+19f2g8+38f3g7+19f4g6+ h1 = f0g1+ f1g0+19f2g9+19f3g8+19f4g7+ h2 = f0g2+ 2f1g1+ f2g0+38f3g9+19f4g8+ h3 = f0g3+ f1g2+ f2g1+ f3g0+19f4g9+ h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+

Proof: multiply polys mod t10 − 19.

38f5g5+19f6 19f5g6+19f6 38f5g7+19f6 19f5g8+19f6 38f5g9+19f6 f5g0+19f6 2f5g1+ f6 f5g2+ f6 2f5g3+ f6 f5g4+ f6

slide-55
SLIDE 55

13

NEON small integers ; f6; f7; f8; f9) integer 251f2 + 277f3 + 2153f6 + 2179f7 + modulo 2255 − 19.

  • lynomial view:

5 of the poly

f2t2 + 20:5f3t3 + f6t6 + 20:5f7t7 +

14

h ≡ f g (mod 2255 − 19) where

h0 = f0g0+38f1g9+19f2g8+38f3g7+19f4g6+ h1 = f0g1+ f1g0+19f2g9+19f3g8+19f4g7+ h2 = f0g2+ 2f1g1+ f2g0+38f3g9+19f4g8+ h3 = f0g3+ f1g2+ f2g1+ f3g0+19f4g9+ h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+

Proof: multiply polys mod t10 − 19.

38f5g5+19f6g4+38f7g3 19f5g6+19f6g5+19f7g4 38f5g7+19f6g6+38f7g5 19f5g8+19f6g7+19f7g6 38f5g9+19f6g8+38f7g7 f5g0+19f6g9+19f7g8 2f5g1+ f6g0+38f7g9 f5g2+ f6g1+ f7g0 2f5g3+ f6g2+ 2f7g1 f5g4+ f6g3+ f7g2

slide-56
SLIDE 56

13

integers f9) 277f3 + 2179f7 +

255 − 19.

  • ly

:5f3t3 + :5f7t7 +

14

h ≡ f g (mod 2255 − 19) where

h0 = f0g0+38f1g9+19f2g8+38f3g7+19f4g6+ h1 = f0g1+ f1g0+19f2g9+19f3g8+19f4g7+ h2 = f0g2+ 2f1g1+ f2g0+38f3g9+19f4g8+ h3 = f0g3+ f1g2+ f2g1+ f3g0+19f4g9+ h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+

Proof: multiply polys mod t10 − 19.

38f5g5+19f6g4+38f7g3+19f8g2+38 19f5g6+19f6g5+19f7g4+19f8g3+19 38f5g7+19f6g6+38f7g5+19f8g4+38 19f5g8+19f6g7+19f7g6+19f8g5+19 38f5g9+19f6g8+38f7g7+19f8g6+38 f5g0+19f6g9+19f7g8+19f8g7+19 2f5g1+ f6g0+38f7g9+19f8g8+38 f5g2+ f6g1+ f7g0+19f8g9+19 2f5g3+ f6g2+ 2f7g1+ f8g0+38 f5g4+ f6g3+ f7g2+ f8g1+

slide-57
SLIDE 57

14

h ≡ f g (mod 2255 − 19) where

h0 = f0g0+38f1g9+19f2g8+38f3g7+19f4g6+ h1 = f0g1+ f1g0+19f2g9+19f3g8+19f4g7+ h2 = f0g2+ 2f1g1+ f2g0+38f3g9+19f4g8+ h3 = f0g3+ f1g2+ f2g1+ f3g0+19f4g9+ h4 = f0g4+ 2f1g3+ f2g2+ 2f3g1+ f4g0+ h5 = f0g5+ f1g4+ f2g3+ f3g2+ f4g1+ h6 = f0g6+ 2f1g5+ f2g4+ 2f3g3+ f4g2+ h7 = f0g7+ f1g6+ f2g5+ f3g4+ f4g3+ h8 = f0g8+ 2f1g7+ f2g6+ 2f3g5+ f4g4+ h9 = f0g9+ f1g8+ f2g7+ f3g6+ f4g5+

Proof: multiply polys mod t10 − 19.

15

38f5g5+19f6g4+38f7g3+19f8g2+38f9g1; 19f5g6+19f6g5+19f7g4+19f8g3+19f9g2; 38f5g7+19f6g6+38f7g5+19f8g4+38f9g3; 19f5g8+19f6g7+19f7g6+19f8g5+19f9g4; 38f5g9+19f6g8+38f7g7+19f8g6+38f9g5; f5g0+19f6g9+19f7g8+19f8g7+19f9g6; 2f5g1+ f6g0+38f7g9+19f8g8+38f9g7; f5g2+ f6g1+ f7g0+19f8g9+19f9g8; 2f5g3+ f6g2+ 2f7g1+ f8g0+38f9g9; f5g4+ f6g3+ f7g2+ f8g1+ f9g0:

slide-58
SLIDE 58

14

(mod 2255 − 19) where

0+38f1g9+19f2g8+38f3g7+19f4g6+ 1+

f1g0+19f2g9+19f3g8+19f4g7+

2+ 2f1g1+

f2g0+38f3g9+19f4g8+

3+

f1g2+ f2g1+ f3g0+19f4g9+

4+ 2f1g3+

f2g2+ 2f3g1+ f4g0+

5+

f1g4+ f2g3+ f3g2+ f4g1+

6+ 2f1g5+

f2g4+ 2f3g3+ f4g2+

7+

f1g6+ f2g5+ f3g4+ f4g3+

8+ 2f1g7+

f2g6+ 2f3g5+ f4g4+

9+

f1g8+ f2g7+ f3g6+ f4g5+

multiply polys mod t10 − 19.

15

38f5g5+19f6g4+38f7g3+19f8g2+38f9g1; 19f5g6+19f6g5+19f7g4+19f8g3+19f9g2; 38f5g7+19f6g6+38f7g5+19f8g4+38f9g3; 19f5g8+19f6g7+19f7g6+19f8g5+19f9g4; 38f5g9+19f6g8+38f7g7+19f8g6+38f9g5; f5g0+19f6g9+19f7g8+19f8g7+19f9g6; 2f5g1+ f6g0+38f7g9+19f8g8+38f9g7; f5g2+ f6g1+ f7g0+19f8g9+19f9g8; 2f5g3+ f6g2+ 2f7g1+ f8g0+38f9g9; f5g4+ f6g3+ f7g2+ f8g1+ f9g0:

Each hi products

  • f 2f1; 2f

19g1; 19g Each hi under reasonable sizes of f (Analyze bugs can See 2011 Barbosa–V several recent h0; h1; : : for subse

slide-59
SLIDE 59

14

255 − 19) where 19f2g8+38f3g7+19f4g6+ 19f2g9+19f3g8+19f4g7+ f2g0+38f3g9+19f4g8+ f2g1+ f3g0+19f4g9+ f2g2+ 2f3g1+ f4g0+ f2g3+ f3g2+ f4g1+ f2g4+ 2f3g3+ f4g2+ f2g5+ f3g4+ f4g3+ f2g6+ 2f3g5+ f4g4+ f2g7+ f3g6+ f4g5+

polys mod t10 − 19.

15

38f5g5+19f6g4+38f7g3+19f8g2+38f9g1; 19f5g6+19f6g5+19f7g4+19f8g3+19f9g2; 38f5g7+19f6g6+38f7g5+19f8g4+38f9g3; 19f5g8+19f6g7+19f7g6+19f8g5+19f9g4; 38f5g9+19f6g8+38f7g7+19f8g6+38f9g5; f5g0+19f6g9+19f7g8+19f8g7+19f9g6; 2f5g1+ f6g0+38f7g9+19f8g8+38f9g7; f5g2+ f6g1+ f7g0+19f8g9+19f9g8; 2f5g3+ f6g2+ 2f7g1+ f8g0+38f9g9; f5g4+ f6g3+ f7g2+ f8g1+ f9g0:

Each hi is a sum of products after precomputation

  • f 2f1; 2f3; 2f5; 2f7;

19g1; 19g2; : : : ; 19g Each hi fits into 64 under reasonable lim sizes of f1; g1; : : : ; (Analyze this very bugs can slip past See 2011 Brumley–P Barbosa–Vercauteren several recent OpenSS h0; h1; : : : are too la for subsequent multiplication.

slide-60
SLIDE 60

14

where

g7+19f4g6+ g8+19f4g7+ g9+19f4g8+ g0+19f4g9+ g1+ f4g0+ g2+ f4g1+ g3+ f4g2+ g4+ f4g3+ g5+ f4g4+ g6+ f4g5+

t10 − 19.

15

38f5g5+19f6g4+38f7g3+19f8g2+38f9g1; 19f5g6+19f6g5+19f7g4+19f8g3+19f9g2; 38f5g7+19f6g6+38f7g5+19f8g4+38f9g3; 19f5g8+19f6g7+19f7g6+19f8g5+19f9g4; 38f5g9+19f6g8+38f7g7+19f8g6+38f9g5; f5g0+19f6g9+19f7g8+19f8g7+19f9g6; 2f5g1+ f6g0+38f7g9+19f8g8+38f9g7; f5g2+ f6g1+ f7g0+19f8g9+19f9g8; 2f5g3+ f6g2+ 2f7g1+ f8g0+38f9g9; f5g4+ f6g3+ f7g2+ f8g1+ f9g0:

Each hi is a sum of ten products after precomputation

  • f 2f1; 2f3; 2f5; 2f7; 2f9;

19g1; 19g2; : : : ; 19g9. Each hi fits into 64 bits under reasonable limits on sizes of f1; g1; : : : ; f9; g9. (Analyze this very carefully: bugs can slip past most tests! See 2011 Brumley–Page– Barbosa–Vercauteren and several recent OpenSSL bugs.) h0; h1; : : : are too large for subsequent multiplication.

slide-61
SLIDE 61

15

38f5g5+19f6g4+38f7g3+19f8g2+38f9g1; 19f5g6+19f6g5+19f7g4+19f8g3+19f9g2; 38f5g7+19f6g6+38f7g5+19f8g4+38f9g3; 19f5g8+19f6g7+19f7g6+19f8g5+19f9g4; 38f5g9+19f6g8+38f7g7+19f8g6+38f9g5; f5g0+19f6g9+19f7g8+19f8g7+19f9g6; 2f5g1+ f6g0+38f7g9+19f8g8+38f9g7; f5g2+ f6g1+ f7g0+19f8g9+19f9g8; 2f5g3+ f6g2+ 2f7g1+ f8g0+38f9g9; f5g4+ f6g3+ f7g2+ f8g1+ f9g0:

16

Each hi is a sum of ten products after precomputation

  • f 2f1; 2f3; 2f5; 2f7; 2f9;

19g1; 19g2; : : : ; 19g9. Each hi fits into 64 bits under reasonable limits on sizes of f1; g1; : : : ; f9; g9. (Analyze this very carefully: bugs can slip past most tests! See 2011 Brumley–Page– Barbosa–Vercauteren and several recent OpenSSL bugs.) h0; h1; : : : are too large for subsequent multiplication.

slide-62
SLIDE 62

15

19f6g4+38f7g3+19f8g2+38f9g1; 19f6g5+19f7g4+19f8g3+19f9g2; 19f6g6+38f7g5+19f8g4+38f9g3; 19f6g7+19f7g6+19f8g5+19f9g4; 19f6g8+38f7g7+19f8g6+38f9g5; 19f6g9+19f7g8+19f8g7+19f9g6; f6g0+38f7g9+19f8g8+38f9g7; f6g1+ f7g0+19f8g9+19f9g8; f6g2+ 2f7g1+ f8g0+38f9g9; f6g3+ f7g2+ f8g1+ f9g0:

16

Each hi is a sum of ten products after precomputation

  • f 2f1; 2f3; 2f5; 2f7; 2f9;

19g1; 19g2; : : : ; 19g9. Each hi fits into 64 bits under reasonable limits on sizes of f1; g1; : : : ; f9; g9. (Analyze this very carefully: bugs can slip past most tests! See 2011 Brumley–Page– Barbosa–Vercauteren and several recent OpenSSL bugs.) h0; h1; : : : are too large for subsequent multiplication. Carry h0 replace ( (h0 mod This mak Similarly Eventually We actual Slightly mo (given details but more Some things

  • Mix signed,
  • Interleave
slide-63
SLIDE 63

15 3+19f8g2+38f9g1; 4+19f8g3+19f9g2; 5+19f8g4+38f9g3; 6+19f8g5+19f9g4; 7+19f8g6+38f9g5; 8+19f8g7+19f9g6; 9+19f8g8+38f9g7; 0+19f8g9+19f9g8; 1+

f8g0+38f9g9;

2+

f8g1+ f9g0:

16

Each hi is a sum of ten products after precomputation

  • f 2f1; 2f3; 2f5; 2f7; 2f9;

19g1; 19g2; : : : ; 19g9. Each hi fits into 64 bits under reasonable limits on sizes of f1; g1; : : : ; f9; g9. (Analyze this very carefully: bugs can slip past most tests! See 2011 Brumley–Page– Barbosa–Vercauteren and several recent OpenSSL bugs.) h0; h1; : : : are too large for subsequent multiplication. Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + This makes h0 small. Similarly for other Eventually all hi are We actually use signed Slightly more expensive (given details of insn but more room for Some things we haven’t

  • Mix signed, unsigned
  • Interleave reduction,
slide-64
SLIDE 64

15

38f9g1; 19f9g2; 38f9g3; 19f9g4; 38f9g5; 19f9g6; 38f9g7; 19f9g8; 38f9g9; f9g0:

16

Each hi is a sum of ten products after precomputation

  • f 2f1; 2f3; 2f5; 2f7; 2f9;

19g1; 19g2; : : : ; 19g9. Each hi fits into 64 bits under reasonable limits on sizes of f1; g1; : : : ; f9; g9. (Analyze this very carefully: bugs can slip past most tests! See 2011 Brumley–Page– Barbosa–Vercauteren and several recent OpenSSL bugs.) h0; h1; : : : are too large for subsequent multiplication. Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + ¨ h0=226˝ ). This makes h0 small. Similarly for other hi. Eventually all hi are small enough. We actually use signed coeffs. Slightly more expensive carries (given details of insn set) but more room for ab + c2 etc. Some things we haven’t tried

  • Mix signed, unsigned carries.
  • Interleave reduction, carrying.
slide-65
SLIDE 65

16

Each hi is a sum of ten products after precomputation

  • f 2f1; 2f3; 2f5; 2f7; 2f9;

19g1; 19g2; : : : ; 19g9. Each hi fits into 64 bits under reasonable limits on sizes of f1; g1; : : : ; f9; g9. (Analyze this very carefully: bugs can slip past most tests! See 2011 Brumley–Page– Barbosa–Vercauteren and several recent OpenSSL bugs.) h0; h1; : : : are too large for subsequent multiplication.

17

Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + ¨ h0=226˝ ). This makes h0 small. Similarly for other hi. Eventually all hi are small enough. We actually use signed coeffs. Slightly more expensive carries (given details of insn set) but more room for ab + c2 etc. Some things we haven’t tried yet:

  • Mix signed, unsigned carries.
  • Interleave reduction, carrying.
slide-66
SLIDE 66

16

i is a sum of ten

ducts after precomputation 2f3; 2f5; 2f7; 2f9; 19g2; : : : ; 19g9.

i fits into 64 bits

reasonable limits on

  • f f1; g1; : : : ; f9; g9.

(Analyze this very carefully: can slip past most tests! 2011 Brumley–Page–

  • sa–Vercauteren and

several recent OpenSSL bugs.) : : : are too large bsequent multiplication.

17

Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + ¨ h0=226˝ ). This makes h0 small. Similarly for other hi. Eventually all hi are small enough. We actually use signed coeffs. Slightly more expensive carries (given details of insn set) but more room for ab + c2 etc. Some things we haven’t tried yet:

  • Mix signed, unsigned carries.
  • Interleave reduction, carrying.

Minor cha Result of used until Find an for the CPU while the Sometimes higher-level Example: h2 → h3 h7 → h8 have long

slide-67
SLIDE 67

16

  • f ten

recomputation f7; 2f9; 19g9. 64 bits limits on : ; f9; g9. very carefully: past most tests! Brumley–Page– ercauteren and OpenSSL bugs.)

  • large

multiplication.

17

Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + ¨ h0=226˝ ). This makes h0 small. Similarly for other hi. Eventually all hi are small enough. We actually use signed coeffs. Slightly more expensive carries (given details of insn set) but more room for ab + c2 etc. Some things we haven’t tried yet:

  • Mix signed, unsigned carries.
  • Interleave reduction, carrying.

Minor challenge: pip Result of each insn used until a few cycles Find an independent for the CPU to sta while the first insn Sometimes helps to higher-level computations. Example: carries h h2 → h3 → h4 → h h7 → h8 → h9 → h have long chain of

slide-68
SLIDE 68

16

recomputation refully: tests! bugs.) multiplication.

17

Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + ¨ h0=226˝ ). This makes h0 small. Similarly for other hi. Eventually all hi are small enough. We actually use signed coeffs. Slightly more expensive carries (given details of insn set) but more room for ab + c2 etc. Some things we haven’t tried yet:

  • Mix signed, unsigned carries.
  • Interleave reduction, carrying.

Minor challenge: pipelining. Result of each insn cannot b used until a few cycles later. Find an independent insn for the CPU to start working while the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h0 → h1 have long chain of dependencies.

slide-69
SLIDE 69

17

Carry h0 → h1: i.e., replace (h0; h1) with (h0 mod 226; h1 + ¨ h0=226˝ ). This makes h0 small. Similarly for other hi. Eventually all hi are small enough. We actually use signed coeffs. Slightly more expensive carries (given details of insn set) but more room for ab + c2 etc. Some things we haven’t tried yet:

  • Mix signed, unsigned carries.
  • Interleave reduction, carrying.

18

Minor challenge: pipelining. Result of each insn cannot be used until a few cycles later. Find an independent insn for the CPU to start working on while the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h0 → h1 have long chain of dependencies.

slide-70
SLIDE 70

17

h0 → h1: i.e., replace (h0; h1) with mod 226; h1 + ¨ h0=226˝ ). makes h0 small. rly for other hi. Eventually all hi are small enough. actually use signed coeffs. Slightly more expensive carries details of insn set) more room for ab + c2 etc. things we haven’t tried yet: signed, unsigned carries. Interleave reduction, carrying.

18

Minor challenge: pipelining. Result of each insn cannot be used until a few cycles later. Find an independent insn for the CPU to start working on while the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h0 → h1 have long chain of dependencies. Alternative: h0 → h1 h1 → h2 h2 → h3 h3 → h4 h4 → h5 h5 → h6 12 carries but latency Now much to find indep for CPU

slide-71
SLIDE 71

17

i.e., with + ¨ h0=226˝ ). small.

  • ther hi.

are small enough. signed coeffs. expensive carries insn set) for ab + c2 etc. haven’t tried yet: unsigned carries. reduction, carrying.

18

Minor challenge: pipelining. Result of each insn cannot be used until a few cycles later. Find an independent insn for the CPU to start working on while the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h0 → h1 have long chain of dependencies. Alternative: carry h0 → h1 and h5 → h1 → h2 and h6 → h2 → h3 and h7 → h3 → h4 and h8 → h4 → h5 and h9 → h5 → h6 and h0 → 12 carries instead of but latency is much Now much easier to find independent for CPU to handle

slide-72
SLIDE 72

17

˝ ). enough. effs. rries etc. tried yet: rries. rrying.

18

Minor challenge: pipelining. Result of each insn cannot be used until a few cycles later. Find an independent insn for the CPU to start working on while the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h0 → h1 have long chain of dependencies. Alternative: carry h0 → h1 and h5 → h6; h1 → h2 and h6 → h7; h2 → h3 and h7 → h8; h3 → h4 and h8 → h9; h4 → h5 and h9 → h0; h5 → h6 and h0 → h1. 12 carries instead of 11, but latency is much smaller. Now much easier to find independent insns for CPU to handle in parallel.

slide-73
SLIDE 73

18

Minor challenge: pipelining. Result of each insn cannot be used until a few cycles later. Find an independent insn for the CPU to start working on while the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h2 → h3 → h4 → h5 → h6 → h7 → h8 → h9 → h0 → h1 have long chain of dependencies.

19

Alternative: carry h0 → h1 and h5 → h6; h1 → h2 and h6 → h7; h2 → h3 and h7 → h8; h3 → h4 and h8 → h9; h4 → h5 and h9 → h0; h5 → h6 and h0 → h1. 12 carries instead of 11, but latency is much smaller. Now much easier to find independent insns for CPU to handle in parallel.

slide-74
SLIDE 74

18

challenge: pipelining.

  • f each insn cannot be

until a few cycles later. an independent insn CPU to start working on the first insn is in progress. Sometimes helps to adjust higher-level computations. Example: carries h0 → h1 → h3 → h4 → h5 → h6 → h8 → h9 → h0 → h1 long chain of dependencies.

19

Alternative: carry h0 → h1 and h5 → h6; h1 → h2 and h6 → h7; h2 → h3 and h7 → h8; h3 → h4 and h8 → h9; h4 → h5 and h9 → h0; h5 → h6 and h0 → h1. 12 carries instead of 11, but latency is much smaller. Now much easier to find independent insns for CPU to handle in parallel. Major ch e.g. 4x a does 4 additions but needs

  • f inputs

On Cortex-A8,

  • ccasional

run in pa but frequent would be On Cortex-A7, every op

slide-75
SLIDE 75

18

: pipelining. insn cannot be cycles later. endent insn start working on insn is in progress. to adjust computations. h0 → h1 → h5 → h6 → h0 → h1

  • f dependencies.

19

Alternative: carry h0 → h1 and h5 → h6; h1 → h2 and h6 → h7; h2 → h3 and h7 → h8; h3 → h4 and h8 → h9; h4 → h5 and h9 → h0; h5 → h6 and h0 → h1. 12 carries instead of 11, but latency is much smaller. Now much easier to find independent insns for CPU to handle in parallel. Major challenge: ve e.g. 4x a = b + c does 4 additions at but needs particula

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with but frequent permutations would be a bottleneck. On Cortex-A7, every operation cos

slide-76
SLIDE 76

18

elining. cannot be later. rking on rogress. → → endencies.

19

Alternative: carry h0 → h1 and h5 → h6; h1 → h2 and h6 → h7; h2 → h3 and h7 → h8; h3 → h4 and h8 → h9; h4 → h5 and h9 → h0; h5 → h6 and h0 → h1. 12 carries instead of 11, but latency is much smaller. Now much easier to find independent insns for CPU to handle in parallel. Major challenge: vectorization. e.g. 4x a = b + c does 4 additions at once, but needs particular arrangement

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with arithmetic, but frequent permutations would be a bottleneck. On Cortex-A7, every operation costs cycles.

slide-77
SLIDE 77

19

Alternative: carry h0 → h1 and h5 → h6; h1 → h2 and h6 → h7; h2 → h3 and h7 → h8; h3 → h4 and h8 → h9; h4 → h5 and h9 → h0; h5 → h6 and h0 → h1. 12 carries instead of 11, but latency is much smaller. Now much easier to find independent insns for CPU to handle in parallel.

20

Major challenge: vectorization. e.g. 4x a = b + c does 4 additions at once, but needs particular arrangement

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with arithmetic, but frequent permutations would be a bottleneck. On Cortex-A7, every operation costs cycles.

slide-78
SLIDE 78

19

Alternative: carry h1 and h5 → h6; h2 and h6 → h7; h3 and h7 → h8; h4 and h8 → h9; h5 and h9 → h0; h6 and h0 → h1. rries instead of 11, latency is much smaller. much easier independent insns CPU to handle in parallel.

20

Major challenge: vectorization. e.g. 4x a = b + c does 4 additions at once, but needs particular arrangement

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with arithmetic, but frequent permutations would be a bottleneck. On Cortex-A7, every operation costs cycles. Often higher-level do a pair h = f g; Vectorize Merge f0 and f ′

0; f1

into vecto Similarly Then compute Computation into NEON c[0,1] c[2,3]

slide-79
SLIDE 79

19

rry → h6; → h7; → h8; → h9; → h0; → h1. instead of 11, much smaller. easier endent insns handle in parallel.

20

Major challenge: vectorization. e.g. 4x a = b + c does 4 additions at once, but needs particular arrangement

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with arithmetic, but frequent permutations would be a bottleneck. On Cortex-A7, every operation costs cycles. Often higher-level do a pair of mults h = f g; h′ = f ′g′. Vectorize across those Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; Computation fits natura into NEON insns: c[0,1] = a[0] signed* c[2,3] = a[1] signed*

slide-80
SLIDE 80

19

smaller. rallel.

20

Major challenge: vectorization. e.g. 4x a = b + c does 4 additions at once, but needs particular arrangement

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with arithmetic, but frequent permutations would be a bottleneck. On Cortex-A7, every operation costs cycles. Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]

slide-81
SLIDE 81

20

Major challenge: vectorization. e.g. 4x a = b + c does 4 additions at once, but needs particular arrangement

  • f inputs and outputs.

On Cortex-A8,

  • ccasional permutations

run in parallel with arithmetic, but frequent permutations would be a bottleneck. On Cortex-A7, every operation costs cycles.

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]

slide-82
SLIDE 82

20

challenge: vectorization. a = b + c additions at once, needs particular arrangement inputs and outputs. rtex-A8, ccasional permutations parallel with arithmetic, frequent permutations be a bottleneck. rtex-A7,

  • peration costs cycles.

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Example: C = X1 · inside point-addition for Edwa

slide-83
SLIDE 83

20

allenge: vectorization. at once, rticular arrangement

  • utputs.

ermutations with arithmetic, ermutations

  • ttleneck.

costs cycles.

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Example: Recall C = X1 · X2; D = inside point-addition for Edwards curves.

slide-84
SLIDE 84

20

rization. rrangement rithmetic, cycles.

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1] Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves.

slide-85
SLIDE 85

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves.

slide-86
SLIDE 86

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P.

slide-87
SLIDE 87

21

Often higher-level operations do a pair of mults in parallel: h = f g; h′ = f ′g′. Vectorize across those mults. Merge f0; f1; : : : ; f9 and f ′

0; f ′ 1; : : : ; f ′ 9

into vectors (fi; f ′

i ).

Similarly (gi; g′

i ).

Then compute (hi; h′

i).

Computation fits naturally into NEON insns: e.g., c[0,1] = a[0] signed* b[0]; c[2,3] = a[1] signed* b[1]

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P. Example: Typical algorithms for fixed-base scalarmult have many parallel point adds.

slide-88
SLIDE 88

21

higher-level operations pair of mults in parallel: ; h′ = f ′g′. rize across those mults. f0; f1; : : : ; f9 ; f ′

1; : : : ; f ′ 9

vectors (fi; f ′

i ).

rly (gi; g′

i ).

compute (hi; h′

i).

Computation fits naturally NEON insns: e.g., = a[0] signed* b[0]; = a[1] signed* b[1]

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P. Example: Typical algorithms for fixed-base scalarmult have many parallel point adds. Example: with a backlog can vecto

slide-89
SLIDE 89

21

higher-level operations mults in parallel:

′.

those mults. ; f9 f ′

i ).

). hi; h′

i).

fits naturally insns: e.g., signed* b[0]; signed* b[1]

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P. Example: Typical algorithms for fixed-base scalarmult have many parallel point adds. Example: A busy server with a backlog of scala can vectorize across

slide-90
SLIDE 90

21

erations rallel: mults. b[0]; b[1]

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P. Example: Typical algorithms for fixed-base scalarmult have many parallel point adds. Example: A busy server with a backlog of scalarmults can vectorize across them.

slide-91
SLIDE 91

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P. Example: Typical algorithms for fixed-base scalarmult have many parallel point adds.

23

Example: A busy server with a backlog of scalarmults can vectorize across them.

slide-92
SLIDE 92

22

Example: Recall C = X1 · X2; D = Y1 · Y2 inside point-addition formulas for Edwards curves. Example: Can compute 2P; 3P; 4P; 5P; 6P; 7P as 2P = P + P; 3P = 2P + P and 4P = 2P + 2P; 5P = 4P + P and 6P = 3P + 3P and 7P = 4P + 3P. Example: Typical algorithms for fixed-base scalarmult have many parallel point adds.

23

Example: A busy server with a backlog of scalarmults can vectorize across them. Beware a disadvantage of vectorizing across two mults: 256-bit f ; f ′; g; g′; h; h′

  • ccupy at least 1536 bits,

leaving very little room for temporary registers. We use some loads and stores inside vectorized mulmul. Mostly invisible on Cortex-A8, but bigger issue on Cortex-A7.

slide-93
SLIDE 93

22

Example: Recall

1 · X2; D = Y1 · Y2

point-addition formulas wards curves. Example: Can compute ; 4P; 5P; 6P; 7P as P + P; 2P + P and 4P = 2P + 2P; 4P + P and 6P = 3P + 3P = 4P + 3P. Example: Typical algorithms fixed-base scalarmult many parallel point adds.

23

Example: A busy server with a backlog of scalarmults can vectorize across them. Beware a disadvantage of vectorizing across two mults: 256-bit f ; f ′; g; g′; h; h′

  • ccupy at least 1536 bits,

leaving very little room for temporary registers. We use some loads and stores inside vectorized mulmul. Mostly invisible on Cortex-A8, but bigger issue on Cortex-A7. Some field inside a single Example: convert fraction Z−1X ∈ Easy, constant 11M + 254

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z_10_5 =

slide-94
SLIDE 94

22

= Y1 · Y2

  • int-addition formulas

curves. compute P; 7P as and 4P = 2P + 2P; and 6P = 3P + 3P 3P. ypical algorithms scalarmult rallel point adds.

23

Example: A busy server with a backlog of scalarmults can vectorize across them. Beware a disadvantage of vectorizing across two mults: 256-bit f ; f ′; g; g′; h; h′

  • ccupy at least 1536 bits,

leaving very little room for temporary registers. We use some loads and stores inside vectorized mulmul. Mostly invisible on Cortex-A8, but bigger issue on Cortex-A7. Some field ops are inside a single scala Example: At end of convert fraction (X Z−1X ∈ {0; 1; : : : ; Easy, constant time: 11M + 254S for p

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5

slide-95
SLIDE 95

22

rmulas P + 2P; P + 3P rithms adds.

23

Example: A busy server with a backlog of scalarmults can vectorize across them. Beware a disadvantage of vectorizing across two mults: 256-bit f ; f ′; g; g′; h; h′

  • ccupy at least 1536 bits,

leaving very little room for temporary registers. We use some loads and stores inside vectorized mulmul. Mostly invisible on Cortex-A8, but bigger issue on Cortex-A7. Some field ops are hard to pair inside a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into Z−1X ∈ {0; 1; : : : ; p − 1}. Easy, constant time: Z−1 = 11M + 254S for p = 2255 −

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5

slide-96
SLIDE 96

23

Example: A busy server with a backlog of scalarmults can vectorize across them. Beware a disadvantage of vectorizing across two mults: 256-bit f ; f ′; g; g′; h; h′

  • ccupy at least 1536 bits,

leaving very little room for temporary registers. We use some loads and stores inside vectorized mulmul. Mostly invisible on Cortex-A8, but bigger issue on Cortex-A7.

24

Some field ops are hard to pair inside a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into Z−1X ∈ {0; 1; : : : ; p − 1}. Easy, constant time: Z−1 = Zp−2. 11M + 254S for p = 2255 − 19:

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5

slide-97
SLIDE 97

23

Example: A busy server backlog of scalarmults vectorize across them. re a disadvantage of rizing across two mults: 256-bit f ; f ′; g; g′; h; h′ at least 1536 bits, very little room emporary registers. use some loads and stores vectorized mulmul. invisible on Cortex-A8, bigger issue on Cortex-A7.

24

Some field ops are hard to pair inside a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into Z−1X ∈ {0; 1; : : : ; p − 1}. Easy, constant time: Z−1 = Zp−2. 11M + 254S for p = 2255 − 19:

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5 z_10_0 = z_20_10 z_20_0 = z_40_20 z_40_0 = z_50_10 z_50_0 = z_100_50 z_100_0 z_200_100 z_200_0 z_250_50 z_250_0 z_255_5 z_255_21

slide-98
SLIDE 98

23

busy server

  • f scalarmults

across them. disadvantage of across two mults:

′; h; h′

1536 bits, little room registers. loads and stores mulmul.

  • n Cortex-A8,
  • n Cortex-A7.

24

Some field ops are hard to pair inside a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into Z−1X ∈ {0; 1; : : : ; p − 1}. Easy, constant time: Z−1 = Zp−2. 11M + 254S for p = 2255 − 19:

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5 z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

slide-99
SLIDE 99

23

rmults them. mults: stores rtex-A8, rtex-A7.

24

Some field ops are hard to pair inside a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into Z−1X ∈ {0; 1; : : : ; p − 1}. Easy, constant time: Z−1 = Zp−2. 11M + 254S for p = 2255 − 19:

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5 z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

slide-100
SLIDE 100

24

Some field ops are hard to pair inside a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into Z−1X ∈ {0; 1; : : : ; p − 1}. Easy, constant time: Z−1 = Zp−2. 11M + 254S for p = 2255 − 19:

z2 = z1^2^1 z8 = z2^2^2 z9 = z1*z8 z11 = z2*z9 z22 = z11^2^1 z_5_0 = z9*z22 z_10_5 = z_5_0^2^5

25

z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

slide-101
SLIDE 101

24

field ops are hard to pair a single scalarmult. Example: At end of ECDH, convert fraction (X : Z) into ∈ {0; 1; : : : ; p − 1}. constant time: Z−1 = Zp−2. 254S for p = 2255 − 19:

z1^2^1 z2^2^2 z1*z8 z2*z9 z11^2^1 z9*z22 = z_5_0^2^5

25

z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

Can still inside a single Strategy 50 mul insns

(f0;2f1);(f2; (f1;f8);(f3;f0 (g0;g1);(g2;g (g0;19g1);(g (19g2;19g3) (19g2;g3);(19

Change ca e.g., (h0;

slide-102
SLIDE 102

24

re hard to pair scalarmult. end of ECDH, (X : Z) into : : ; p − 1}. time: Z−1 = Zp−2. p = 2255 − 19:

z_5_0^2^5

25

z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

Can still vectorize inside a single field Strategy in our soft 50 mul insns starting

(f0;2f1);(f2;2f3);(f4;2f5);( (f1;f8);(f3;f0);(f5;f2);(f7;f (g0;g1);(g2;g3);(g4;g5);(g (g0;19g1);(g2;19g3);(g4;19 (19g2;19g3);(19g4;19g5); (19g2;g3);(19g4;g5);(19g6

Change carry pattern e.g., (h0; h4) → (h

slide-103
SLIDE 103

24

pair ECDH, into . = Zp−2. − 19:

25

z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9 (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7 (19g2;19g3);(19g4;19g5);(19g6;19g7);(19 (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9

Change carry pattern to vecto e.g., (h0; h4) → (h1; h5).

slide-104
SLIDE 104

25

z_10_0 = z_10_5*z_5_0 z_20_10 = z_10_0^2^10 z_20_0 = z_20_10*z_10_0 z_40_20 = z_20_0^2^20 z_40_0 = z_40_20*z_20_0 z_50_10 = z_40_0^2^10 z_50_0 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 z_100_0 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 z_200_0 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 z_250_0 = z_250_50*z_50_0 z_255_5 = z_250_0^2^5 z_255_21 = z_255_5*z11

26

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); (19g2;19g3);(19g4;19g5);(19g6;19g7);(19g8;19g9); (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, e.g., (h0; h4) → (h1; h5).

slide-105
SLIDE 105

25

= z_10_5*z_5_0 = z_10_0^2^10 = z_20_10*z_10_0 = z_20_0^2^20 = z_40_20*z_20_0 = z_40_0^2^10 = z_50_10*z_10_0 z_100_50 = z_50_0^2^50 = z_100_50*z_50_0 z_200_100 = z_100_0^2^100 = z_200_100*z_100_0 z_250_50 = z_200_0^2^50 = z_250_50*z_50_0 = z_250_0^2^5 z_255_21 = z_255_5*z11

26

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); (19g2;19g3);(19g4;19g5);(19g6;19g7);(19g8;19g9); (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, e.g., (h0; h4) → (h1; h5). Core arithmetic:

  • n mul insns

Squarings Some loss ECDH: ≈ More detailed 356019 cycles ≈78% of Cortex-A8-fast Still some

slide-106
SLIDE 106

25

z_10_5*z_5_0 z_10_0^2^10 z_20_10*z_10_0 z_20_0^2^20 z_40_20*z_20_0 z_40_0^2^10 z_50_10*z_10_0 z_50_0^2^50 z_100_50*z_50_0 z_100_0^2^100 z_200_100*z_100_0 z_200_0^2^50 z_250_50*z_50_0 z_250_0^2^5 z_255_5*z11

26

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); (19g2;19g3);(19g4;19g5);(19g6;19g7);(19g8;19g9); (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, e.g., (h0; h4) → (h1; h5). Core arithmetic: 100

  • n mul insns for each

Squarings are somewhat Some loss for carries ECDH: ≈10 field muls More detailed analysis: 356019 cycles on a ≈78% of software’s Cortex-A8-fast cycles Still some room fo

slide-107
SLIDE 107

25

z_100_50*z_50_0 z_100_0^2^100 z_200_100*z_100_0 z_250_50*z_50_0

26

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); (19g2;19g3);(19g4;19g5);(19g6;19g7);(19g8;19g9); (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, e.g., (h0; h4) → (h1; h5). Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement.

slide-108
SLIDE 108

26

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); (19g2;19g3);(19g4;19g5);(19g6;19g7);(19g8;19g9); (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, e.g., (h0; h4) → (h1; h5).

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement.

slide-109
SLIDE 109

26

Can still vectorize inside a single field op. Strategy in our software: 50 mul insns starting from

(f0;2f1);(f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); (f1;f8);(f3;f0);(f5;f2);(f7;f4);(f9;f6); (g0;g1);(g2;g3);(g4;g5);(g6;g7); (g0;19g1);(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); (19g2;19g3);(19g4;19g5);(19g6;19g7);(19g8;19g9); (19g2;g3);(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, e.g., (h0; h4) → (h1; h5).

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25?

slide-110
SLIDE 110

26

still vectorize a single field op. Strategy in our software: mul insns starting from

f2;2f3);(f4;2f5);(f6;2f7);(f8;2f9); ;f0);(f5;f2);(f7;f4);(f9;f6); 2;g3);(g4;g5);(g6;g7); ;(g2;19g3);(g4;19g5);(g6;19g7);(g8;19g9); 3);(19g4;19g5);(19g6;19g7);(19g8;19g9); ;(19g4;g5);(19g6;g7);(19g8;g9).

Change carry pattern to vectorize, h0; h4) → (h1; h5).

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25? Much mo https://bench.cr.yp.to benchma 2137 public hundreds 39 DH p 56 signature 304 authenticated

slide-111
SLIDE 111

26

rize field op. software: rting from

);(f6;2f7);(f8;2f9); 7;f4);(f9;f6); ;(g6;g7); 4;19g5);(g6;19g7);(g8;19g9); );(19g6;19g7);(19g8;19g9); (19g6;g7);(19g8;g9).

pattern to vectorize, (h1; h5).

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25? Much more work to https://bench.cr.yp.to benchmarks for (currently) 2137 public implementations hundreds of crypto 39 DH primitives, 56 signature primitives, 304 authenticated

slide-112
SLIDE 112

26 2f9); g7);(g8;19g9); );(19g8;19g9); ;g9).

vectorize,

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25? Much more work to do https://bench.cr.yp.to: benchmarks for (currently) 2137 public implementations hundreds of crypto primitives— 39 DH primitives, 56 signature primitives, 304 authenticated ciphers, etc.

slide-113
SLIDE 113

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25?

28

Much more work to do https://bench.cr.yp.to: benchmarks for (currently) 2137 public implementations of hundreds of crypto primitives— 39 DH primitives, 56 signature primitives, 304 authenticated ciphers, etc.

slide-114
SLIDE 114

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25?

28

Much more work to do https://bench.cr.yp.to: benchmarks for (currently) 2137 public implementations of hundreds of crypto primitives— 39 DH primitives, 56 signature primitives, 304 authenticated ciphers, etc. Many interesting primitives are far slower than necessary

  • n many important CPUs.
slide-115
SLIDE 115

27

Core arithmetic: 100 cycles

  • n mul insns for each field mul.

Squarings are somewhat faster. Some loss for carries etc. ECDH: ≈10 field muls · 255 bits. More detailed analysis: 356019 cycles on arithmetic; ≈78% of software’s total Cortex-A8-fast cycles for ECDH. Still some room for improvement. Each CPU is a new adventure. e.g. Could it be better to use Cortex-A7 FPU with radix 221:25?

28

Much more work to do https://bench.cr.yp.to: benchmarks for (currently) 2137 public implementations of hundreds of crypto primitives— 39 DH primitives, 56 signature primitives, 304 authenticated ciphers, etc. Many interesting primitives are far slower than necessary

  • n many important CPUs.

Exercise: Make them faster!