How cryptographic benchmarking About PRESERVE: The goes wrong - - PowerPoint PPT Presentation

▶

Jun 24, 2023 93 likes •1.2k views

1 2 How cryptographic benchmarking About PRESERVE: The goes wrong mission of PRESERVE is, to design, implement, and Daniel J. Bernstein test a secure and scalable Thanks to NIST 60NANB12D261 V2X Security Subsystem for for funding

SLIDE 1

1

How cryptographic benchmarking goes wrong Daniel J. Bernstein Thanks to NIST 60NANB12D261 for funding this work, and for not reviewing these slides in advance. PRESERVE, ending 2015.06.30, was a European project “Preparing Secure Vehicle-to-X Communication Systems”. Project cost: 5383431 EUR, including 3850000 EUR from the European Commission.

2

“About PRESERVE”: “The mission of PRESERVE is, to design, implement, and test a secure and scalable V2X Security Subsystem for realistic deployment scenarios. : : : [Expected Results:] 1. Harmonized V2X Security

Architecture. 2. Implementation
f V2X Security Subsystem. 3.

Cheap and scalable security ASIC for V2X. 4. Testing results VSS under realistic conditions. 5. Research results for deployment challenges.”

SLIDE 2

1

cryptographic benchmarking wrong

J. Bernstein

Thanks to NIST 60NANB12D261 funding this work, and for not reviewing these slides in advance. PRESERVE, ending 2015.06.30, European project “Preparing Secure Vehicle-to-X Communication Systems”. Project cost: 5383431 EUR, including 3850000 EUR from European Commission.

2

“About PRESERVE”: “The mission of PRESERVE is, to design, implement, and test a secure and scalable V2X Security Subsystem for realistic deployment scenarios. : : : [Expected Results:] 1. Harmonized V2X Security

Architecture. 2. Implementation
f V2X Security Subsystem. 3.

Cheap and scalable security ASIC for V2X. 4. Testing results VSS under realistic conditions. 5. Research results for deployment challenges.” Cars already Why build PRESERVE “Security Security “Processing second and ms can ha hardware a Pentiu needs ab a verification cryptographic likely to

SLIDE 3

1

cryptographic benchmarking Bernstein 60NANB12D261 work, and for not slides in advance. ending 2015.06.30, project Secure Vehicle-to-X Systems”. 5383431 EUR, 3850000 EUR from Commission.

2

“About PRESERVE”: “The mission of PRESERVE is, to design, implement, and test a secure and scalable V2X Security Subsystem for realistic deployment scenarios. : : : [Expected Results:] 1. Harmonized V2X Security

Architecture. 2. Implementation
f V2X Security Subsystem. 3.

Cheap and scalable security ASIC for V2X. 4. Testing results VSS under realistic conditions. 5. Research results for deployment challenges.” Cars already include Why build an ASIC? PRESERVE deliverable “Security Requirements Security Architecture”, “Processing 1,000 second and processing ms can hardly be met

hardware. As discussed

a Pentium D 3.4 GHz needs about 5 times a verification : : : a cryptographic co-p likely to be necessa

SLIDE 4

1

enchmarking 60NANB12D261 for not advance. 2015.06.30, ehicle-to-X Systems”. EUR, from Commission.

2

“About PRESERVE”: “The mission of PRESERVE is, to design, implement, and test a secure and scalable V2X Security Subsystem for realistic deployment scenarios. : : : [Expected Results:] 1. Harmonized V2X Security

Architecture. 2. Implementation
f V2X Security Subsystem. 3.

Cheap and scalable security ASIC for V2X. 4. Testing results VSS under realistic conditions. 5. Research results for deployment challenges.” Cars already include many CPUs. Why build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of V Security Architecture”, 2011: “Processing 1,000 packets p second and processing each ms can hardly be met by current

hardware. As discussed in [32],

a Pentium D 3.4 GHz processo needs about 5 times as long a verification : : : a dedicated cryptographic co-processor is likely to be necessary.”

SLIDE 5

2

“About PRESERVE”: “The mission of PRESERVE is, to design, implement, and test a secure and scalable V2X Security Subsystem for realistic deployment scenarios. : : : [Expected Results:] 1. Harmonized V2X Security

Architecture. 2. Implementation
f V2X Security Subsystem. 3.

Cheap and scalable security ASIC for V2X. 4. Testing results VSS under realistic conditions. 5. Research results for deployment challenges.”

3

Cars already include many CPUs. Why build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of Vehicle Security Architecture”, 2011: “Processing 1,000 packets per second and processing each in 1 ms can hardly be met by current

hardware. As discussed in [32],

a Pentium D 3.4 GHz processor needs about 5 times as long for a verification : : : a dedicated cryptographic co-processor is likely to be necessary.”

SLIDE 6

2

ut PRESERVE”: “The

mission of PRESERVE is, design, implement, and secure and scalable Security Subsystem for realistic deployment scenarios. [Expected Results:] 1. rmonized V2X Security

Architecture. 2. Implementation

Security Subsystem. 3. and scalable security ASIC

V2X. 4. Testing results VSS

realistic conditions. 5. rch results for deployment challenges.”

3

Cars already include many CPUs. Why build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of Vehicle Security Architecture”, 2011: “Processing 1,000 packets per second and processing each in 1 ms can hardly be met by current

hardware. As discussed in [32],

a Pentium D 3.4 GHz processor needs about 5 times as long for a verification : : : a dedicated cryptographic co-processor is likely to be necessary.” PRESERVE “Deployment V4”, 2016: ECC signature second is factor fo environment 4mm×4mm technology space for 90nm will cores and more.” F max 100MHz,

SLIDE 7

2

PRESERVE”: “The PRESERVE is, implement, and and scalable Subsystem for yment scenarios. Results:] 1. Security Implementation

Subsystem. 3.

able security ASIC esting results VSS

conditions. 5.

for deployment

3

Cars already include many CPUs. Why build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of Vehicle Security Architecture”, 2011: “Processing 1,000 packets per second and processing each in 1 ms can hardly be met by current

hardware. As discussed in [32],

a Pentium D 3.4 GHz processor needs about 5 times as long for a verification : : : a dedicated cryptographic co-processor is likely to be necessary.” PRESERVE deliverable “Deployment Issues V4”, 2016: “the numb ECC signature verifications second is the key p factor for ASICs in environment : : : [On 4mm×4mm chip] the technology may onl space for one ECC 90nm will allow for cores and 55nm will more.” For 180nm max 100MHz, 100

SLIDE 8

2

“The for rios. Implementation

Subsystem. 3.

y ASIC results VSS 5. yment

3

Cars already include many CPUs. Why build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of Vehicle Security Architecture”, 2011: “Processing 1,000 packets per second and processing each in 1 ms can hardly be met by current

hardware. As discussed in [32],

a Pentium D 3.4 GHz processor needs about 5 times as long for a verification : : : a dedicated cryptographic co-processor is likely to be necessary.” PRESERVE deliverable 5.4, “Deployment Issues Report V4”, 2016: “the number of ECC signature verifications p second is the key performance factor for ASICs in a C2C environment : : : [On a 4mm×4mm chip] the 180nm technology may only yield enough space for one ECC core, whereas 90nm will allow for up to ten cores and 55nm will allow fo more.” For 180nm core says max 100MHz, 100 verif/second.

SLIDE 9

3

Cars already include many CPUs. Why build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of Vehicle Security Architecture”, 2011: “Processing 1,000 packets per second and processing each in 1 ms can hardly be met by current

hardware. As discussed in [32],

a Pentium D 3.4 GHz processor needs about 5 times as long for a verification : : : a dedicated cryptographic co-processor is likely to be necessary.”

4

PRESERVE deliverable 5.4, “Deployment Issues Report V4”, 2016: “the number of ECC signature verifications per second is the key performance factor for ASICs in a C2C environment : : : [On a 4mm×4mm chip] the 180nm technology may only yield enough space for one ECC core, whereas 90nm will allow for up to ten ECC cores and 55nm will allow for even more.” For 180nm core says max 100MHz, 100 verif/second.

SLIDE 10

3

lready include many CPUs. build an ASIC? PRESERVE deliverable 1.1, “Security Requirements of Vehicle Security Architecture”, 2011: cessing 1,000 packets per and processing each in 1 can hardly be met by current

are. As discussed in [32],

entium D 3.4 GHz processor about 5 times as long for verification : : : a dedicated cryptographic co-processor is to be necessary.”

4

PRESERVE deliverable 5.4, “Deployment Issues Report V4”, 2016: “the number of ECC signature verifications per second is the key performance factor for ASICs in a C2C environment : : : [On a 4mm×4mm chip] the 180nm technology may only yield enough space for one ECC core, whereas 90nm will allow for up to ten ECC cores and 55nm will allow for even more.” For 180nm core says max 100MHz, 100 verif/second. Compare IAIK NIST 858 scala in 111620 at 180nm technology standard 9.3744 — conditions core voltage Signature somewhat Still close than the

SLIDE 11

3

include many CPUs. ASIC? deliverable 1.1, Requirements of Vehicle Architecture”, 2011: 1,000 packets per cessing each in 1 e met by current discussed in [32], GHz processor times as long for a dedicated co-processor is necessary.”

4

PRESERVE deliverable 5.4, “Deployment Issues Report V4”, 2016: “the number of ECC signature verifications per second is the key performance factor for ASICs in a C2C environment : : : [On a 4mm×4mm chip] the 180nm technology may only yield enough space for one ECC core, whereas 90nm will allow for up to ten ECC cores and 55nm will allow for even more.” For 180nm core says max 100MHz, 100 verif/second. Compare to, e.g., IAIK NIST P-256 ECC 858 scalarmult/second in 111620 GE at 192 at 180nm (“UMC technology using F standard cell library 9.3744 —m2/GE; w conditions (temperature core voltage 1.62V)”). Signature verification somewhat slower than Still close to 100× than the PRESERVE

SLIDE 12

3

CPUs. 1.1, Vehicle 2011: per each in 1 current [32], cessor long for dedicated r is

4

PRESERVE deliverable 5.4, “Deployment Issues Report V4”, 2016: “the number of ECC signature verifications per second is the key performance factor for ASICs in a C2C environment : : : [On a 4mm×4mm chip] the 180nm technology may only yield enough space for one ECC core, whereas 90nm will allow for up to ten ECC cores and 55nm will allow for even more.” For 180nm core says max 100MHz, 100 verif/second. Compare to, e.g., IAIK NIST P-256 ECC Module 858 scalarmult/second in 111620 GE at 192 MHz at 180nm (“UMC L180GII technology using Faraday f180 standard cell library (FSA0A 9.3744 —m2/GE; worst case conditions (temperature 125 core voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. Still close to 100× more efficient than the PRESERVE estimates.

SLIDE 13

4

PRESERVE deliverable 5.4, “Deployment Issues Report V4”, 2016: “the number of ECC signature verifications per second is the key performance factor for ASICs in a C2C environment : : : [On a 4mm×4mm chip] the 180nm technology may only yield enough space for one ECC core, whereas 90nm will allow for up to ten ECC cores and 55nm will allow for even more.” For 180nm core says max 100MHz, 100 verif/second.

5

Compare to, e.g., IAIK NIST P-256 ECC Module: 858 scalarmult/second in 111620 GE at 192 MHz at 180nm (“UMC L180GII technology using Faraday f180 standard cell library (FSA0A C), 9.3744 —m2/GE; worst case conditions (temperature 125◦C, core voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. Still close to 100× more efficient than the PRESERVE estimates.

SLIDE 14

4

PRESERVE deliverable 5.4, yment Issues Report 2016: “the number of signature verifications per is the key performance for ASICs in a C2C environment : : : [On a 4mm chip] the 180nm technology may only yield enough for one ECC core, whereas will allow for up to ten ECC and 55nm will allow for even For 180nm core says 100MHz, 100 verif/second.

5

Compare to, e.g., IAIK NIST P-256 ECC Module: 858 scalarmult/second in 111620 GE at 192 MHz at 180nm (“UMC L180GII technology using Faraday f180 standard cell library (FSA0A C), 9.3744 —m2/GE; worst case conditions (temperature 125◦C, core voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. Still close to 100× more efficient than the PRESERVE estimates. Let’s go core argument Central claim: in [32], a processor (i.e., 17 for signature [32] is “P Z., ‘Analysis

verhead

Third Joint Mobile Net (WMNC),

SLIDE 15

4

deliverable 5.4, Issues Report number of verifications per ey performance in a C2C [On a chip] the 180nm

nly yield enough

ECC core, whereas for up to ten ECC will allow for even 180nm core says 100 verif/second.

5

Compare to, e.g., IAIK NIST P-256 ECC Module: 858 scalarmult/second in 111620 GE at 192 MHz at 180nm (“UMC L180GII technology using Faraday f180 standard cell library (FSA0A C), 9.3744 —m2/GE; worst case conditions (temperature 125◦C, core voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. Still close to 100× more efficient than the PRESERVE estimates. Let’s go back to PRESERVE’s core argument for Central claim: “As in [32], a Pentium processor needs ab (i.e., 17 million CPU for signature verification. [32] is “Petit, J., Mamm Z., ‘Analysis of authentication

verhead in vehicula

Third Joint IFIP Wireless Mobile Networking (WMNC), 2010.”

SLIDE 16

4

5.4, rt

verifications per rmance 180nm enough whereas ten ECC for even ys verif/second.

5

Compare to, e.g., IAIK NIST P-256 ECC Module: 858 scalarmult/second in 111620 GE at 192 MHz at 180nm (“UMC L180GII technology using Faraday f180 standard cell library (FSA0A C), 9.3744 —m2/GE; worst case conditions (temperature 125◦C, core voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. Still close to 100× more efficient than the PRESERVE estimates. Let’s go back to PRESERVE’s core argument for an ASIC. Central claim: “As discussed in [32], a Pentium D 3.4 GHz processor needs about” 5ms (i.e., 17 million CPU cycles) for signature verification. [32] is “Petit, J., Mammeri, Z., ‘Analysis of authentication

verhead in vehicular networks’,

Third Joint IFIP Wireless and Mobile Networking Conference (WMNC), 2010.”

SLIDE 17

5

Compare to, e.g., IAIK NIST P-256 ECC Module: 858 scalarmult/second in 111620 GE at 192 MHz at 180nm (“UMC L180GII technology using Faraday f180 standard cell library (FSA0A C), 9.3744 —m2/GE; worst case conditions (temperature 125◦C, core voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. Still close to 100× more efficient than the PRESERVE estimates.

6

Let’s go back to PRESERVE’s core argument for an ASIC. Central claim: “As discussed in [32], a Pentium D 3.4 GHz processor needs about” 5ms (i.e., 17 million CPU cycles) for signature verification. [32] is “Petit, J., Mammeri, Z., ‘Analysis of authentication

verhead in vehicular networks’,

Third Joint IFIP Wireless and Mobile Networking Conference (WMNC), 2010.”

SLIDE 18

5

Compare to, e.g., NIST P-256 ECC Module: scalarmult/second 111620 GE at 192 MHz 180nm (“UMC L180GII technology using Faraday f180 rd cell library (FSA0A C), —m2/GE; worst case conditions (temperature 125◦C, voltage 1.62V)”). Signature verification will be somewhat slower than scalarmult. close to 100× more efficient the PRESERVE estimates.

6

Let’s go back to PRESERVE’s core argument for an ASIC. Central claim: “As discussed in [32], a Pentium D 3.4 GHz processor needs about” 5ms (i.e., 17 million CPU cycles) for signature verification. [32] is “Petit, J., Mammeri, Z., ‘Analysis of authentication

verhead in vehicular networks’,

Third Joint IFIP Wireless and Mobile Networking Conference (WMNC), 2010.” [32] says to the huge economic from vehicula governments, companies, have made vehicular [1]. On average, collisions and 7900 United States, economic [2]. : : : [Simila costing e

SLIDE 19

5

e.g., P-256 ECC Module: rmult/second 192 MHz (“UMC L180GII Faraday f180 rary (FSA0A C), /GE; worst case erature 125◦C, 1.62V)”). verification will be er than scalarmult. × more efficient PRESERVE estimates.

6

Let’s go back to PRESERVE’s core argument for an ASIC. Central claim: “As discussed in [32], a Pentium D 3.4 GHz processor needs about” 5ms (i.e., 17 million CPU cycles) for signature verification. [32] is “Petit, J., Mammeri, Z., ‘Analysis of authentication

verhead in vehicular networks’,

Third Joint IFIP Wireless and Mobile Networking Conference (WMNC), 2010.” [32] says “1. Intro to the huge life losses economic impacts from vehicular collisions, governments, automotive companies, and indu have made the reduction vehicular fatalities [1]. On average, vehicula collisions cause 102 and 7900 injuries daily United States, leaving economic impact of [2]. : : : [Similar sto costing e160 billion

SLIDE 20

5

dule: I f180 (FSA0A C), case 125◦C, be calarmult. efficient estimates.

6

Let’s go back to PRESERVE’s core argument for an ASIC. Central claim: “As discussed in [32], a Pentium D 3.4 GHz processor needs about” 5ms (i.e., 17 million CPU cycles) for signature verification. [32] is “Petit, J., Mammeri, Z., ‘Analysis of authentication

verhead in vehicular networks’,

Third Joint IFIP Wireless and Mobile Networking Conference (WMNC), 2010.” [32] says “1. Introduction. Due to the huge life losses and the economic impacts resulting from vehicular collisions, many governments, automotive companies, and industry conso have made the reduction of vehicular fatalities a top prio [1]. On average, vehicular collisions cause 102 deaths and 7900 injuries daily in the United States, leaving an economic impact of $230 billion [2]. : : : [Similar story for EU:] costing e160 billion annually

SLIDE 21

6

Let’s go back to PRESERVE’s core argument for an ASIC. Central claim: “As discussed in [32], a Pentium D 3.4 GHz processor needs about” 5ms (i.e., 17 million CPU cycles) for signature verification. [32] is “Petit, J., Mammeri, Z., ‘Analysis of authentication

verhead in vehicular networks’,

Third Joint IFIP Wireless and Mobile Networking Conference (WMNC), 2010.”

7

[32] says “1. Introduction. Due to the huge life losses and the economic impacts resulting from vehicular collisions, many governments, automotive companies, and industry consortia have made the reduction of vehicular fatalities a top priority [1]. On average, vehicular collisions cause 102 deaths and 7900 injuries daily in the United States, leaving an economic impact of $230 billion [2]. : : : [Similar story for EU:] costing e160 billion annually [3].”

SLIDE 22

6

go back to PRESERVE’s rgument for an ASIC. Central claim: “As discussed [32], a Pentium D 3.4 GHz cessor needs about” 5ms 17 million CPU cycles) ignature verification. “Petit, J., Mammeri, ‘Analysis of authentication

verhead in vehicular networks’,

Joint IFIP Wireless and Networking Conference (WMNC), 2010.”

7

[32] says “1. Introduction. Due to the huge life losses and the economic impacts resulting from vehicular collisions, many governments, automotive companies, and industry consortia have made the reduction of vehicular fatalities a top priority [1]. On average, vehicular collisions cause 102 deaths and 7900 injuries daily in the United States, leaving an economic impact of $230 billion [2]. : : : [Similar story for EU:] costing e160 billion annually [3].” Vehicles information.

f IEEE1609.2

support the Signature [8] over P-224 and paper, w and communication the authentication provided

II. Signature

verification D 3.4Ghz

SLIDE 23

6

PRESERVE’s for an ASIC. “As discussed entium D 3.4 GHz about” 5ms CPU cycles) verification. J., Mammeri, authentication vehicular networks’, Wireless and rking Conference 2010.”

7

[32] says “1. Introduction. Due to the huge life losses and the economic impacts resulting from vehicular collisions, many governments, automotive companies, and industry consortia have made the reduction of vehicular fatalities a top priority [1]. On average, vehicular collisions cause 102 deaths and 7900 injuries daily in the United States, leaving an economic impact of $230 billion [2]. : : : [Similar story for EU:] costing e160 billion annually [3].” Vehicles will communicate

information. “All implementations
f IEEE1609.2 standa

support the Elliptic Signature Algorithm [8] over the two NIST P-224 and P-256. paper, we assess the and communication the authentication provided by ECDSA.

II. Signature generation

verification times on D 3.4Ghz workstation

SLIDE 24

6

PRESERVE’s ASIC. discussed GHz 5ms cycles) eri, authentication

rks’,

and Conference

7

[32] says “1. Introduction. Due to the huge life losses and the economic impacts resulting from vehicular collisions, many governments, automotive companies, and industry consortia have made the reduction of vehicular fatalities a top priority [1]. On average, vehicular collisions cause 102 deaths and 7900 injuries daily in the United States, leaving an economic impact of $230 billion [2]. : : : [Similar story for EU:] costing e160 billion annually [3].” Vehicles will communicate safet

information. “All implementations
f IEEE1609.2 standard [7] shall

support the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] over the two NIST curves P-224 and P-256. : : : In this paper, we assess the processing and communication overhead the authentication mechanism provided by ECDSA. : : : Table

II. Signature generation and

verification times on a Pentiu D 3.4Ghz workstation [10]”

SLIDE 25

7

[32] says “1. Introduction. Due to the huge life losses and the economic impacts resulting from vehicular collisions, many governments, automotive companies, and industry consortia have made the reduction of vehicular fatalities a top priority [1]. On average, vehicular collisions cause 102 deaths and 7900 injuries daily in the United States, leaving an economic impact of $230 billion [2]. : : : [Similar story for EU:] costing e160 billion annually [3].”

8

Vehicles will communicate safety

information. “All implementations
f IEEE1609.2 standard [7] shall

support the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] over the two NIST curves P-224 and P-256. : : : In this paper, we assess the processing and communication overhead of the authentication mechanism provided by ECDSA. : : : Table

II. Signature generation and

verification times on a Pentium D 3.4Ghz workstation [10]”

SLIDE 26

7

ys “1. Introduction. Due huge life losses and the economic impacts resulting vehicular collisions, many governments, automotive companies, and industry consortia made the reduction of vehicular fatalities a top priority On average, vehicular collisions cause 102 deaths 7900 injuries daily in the States, leaving an economic impact of $230 billion : [Similar story for EU:] costing e160 billion annually [3].”

8

Vehicles will communicate safety

information. “All implementations
f IEEE1609.2 standard [7] shall

support the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] over the two NIST curves P-224 and P-256. : : : In this paper, we assess the processing and communication overhead of the authentication mechanism provided by ECDSA. : : : Table

II. Signature generation and

verification times on a Pentium D 3.4Ghz workstation [10]” [10] (in [32]) J., ‘Analysis Authentication VANETs’, Conference Mobility Cairo, Decemb [10] says implemented and follo For NIST “Pentium 2.50ms/3.33ms 4.97ms/6.63ms

SLIDE 27

7

Introduction. Due

losses and the acts resulting collisions, many automotive industry consortia reduction of atalities a top priority average, vehicular 102 deaths uries daily in the leaving an act of $230 billion story for EU:] billion annually [3].”

8

Vehicles will communicate safety

information. “All implementations
f IEEE1609.2 standard [7] shall

support the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] over the two NIST curves P-224 and P-256. : : : In this paper, we assess the processing and communication overhead of the authentication mechanism provided by ECDSA. : : : Table

II. Signature generation and

verification times on a Pentium D 3.4Ghz workstation [10]” [10] (in [32]) is “P J., ‘Analysis of ECDSA Authentication Pro VANETs’, 3rd IFIP Conference on New Mobility and Securit Cairo, December 2009. [10] says “ECDSA implemented using and following the Fig.1.” For NIST P-224/P-256 “Pentium D 3.4GHz 2.50ms/3.33ms to 4.97ms/6.63ms to

SLIDE 28

7

duction. Due

the resulting many consortia

riority deaths the billion EU:] nnually [3].”

8

Vehicles will communicate safety

information. “All implementations
f IEEE1609.2 standard [7] shall

support the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] over the two NIST curves P-224 and P-256. : : : In this paper, we assess the processing and communication overhead of the authentication mechanism provided by ECDSA. : : : Table

II. Signature generation and

verification times on a Pentium D 3.4Ghz workstation [10]” [10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify.

SLIDE 29

8

Vehicles will communicate safety

information. “All implementations
f IEEE1609.2 standard [7] shall

support the Elliptic Curve Digital Signature Algorithm (ECDSA) [8] over the two NIST curves P-224 and P-256. : : : In this paper, we assess the processing and communication overhead of the authentication mechanism provided by ECDSA. : : : Table

II. Signature generation and

verification times on a Pentium D 3.4Ghz workstation [10]”

9

[10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify.

SLIDE 30

8

ehicles will communicate safety

rmation. “All implementations

IEEE1609.2 standard [7] shall rt the Elliptic Curve Digital Signature Algorithm (ECDSA)

ver the two NIST curves

and P-256. : : : In this we assess the processing communication overhead of authentication mechanism rovided by ECDSA. : : : Table Signature generation and verification times on a Pentium 3.4Ghz workstation [10]”

9

[10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify. Compare speeds rep

f 14nm

(“2015 Intel https://bench.cr.yp.to 0.015ms 0.049ms

SLIDE 31

8

communicate safety “All implementations standard [7] shall Elliptic Curve Digital rithm (ECDSA) NIST curves P-256. : : : In this the processing communication overhead of authentication mechanism

ECDSA. : : : Table

generation and times on a Pentium rkstation [10]”

9

[10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify. Compare to, e.g., Ed25519 speeds reported for

f 14nm 3.31GHz

(“2015 Intel Core i5-6600”) https://bench.cr.yp.to 0.015ms to sign (49840 0.049ms to verify (163206

SLIDE 32

8

safety implementations [7] shall Digital (ECDSA) curves this cessing

verhead of

mechanism able nd entium [10]”

9

[10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify. Compare to, e.g., Ed25519 speeds reported for single co

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles).

SLIDE 33

9

[10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify.

10

Compare to, e.g., Ed25519 speeds reported for single core

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles).

SLIDE 34

9

[10] (in [32]) is “Petit J., ‘Analysis of ECDSA Authentication Processing in VANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), Cairo, December 2009.” [10] says “ECDSA was implemented using MIRACL and following the Fig.1.” For NIST P-224/P-256 on “Pentium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify.

10

Compare to, e.g., Ed25519 speeds reported for single core

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles). This chip didn’t exist in 2009. Compare instead to single core

f 65nm 2.4GHz Core 2 (“2007

Intel Core 2 Quad Q6600”). 0.065ms to sign (156843 cycles), 0.232ms to verify (557082 cycles).

SLIDE 35

9

(in [32]) is “Petit ‘Analysis of ECDSA Authentication Processing in ANETs’, 3rd IFIP International Conference on New Technologies, Mobility and Security (NTMS), December 2009.” ys “ECDSA was implemented using MIRACL following the Fig.1.” NIST P-224/P-256 on entium D 3.4GHz workstation”: 2.50ms/3.33ms to sign, 4.97ms/6.63ms to verify.

10

Compare to, e.g., Ed25519 speeds reported for single core

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles). This chip didn’t exist in 2009. Compare instead to single core

f 65nm 2.4GHz Core 2 (“2007

Intel Core 2 Quad Q6600”). 0.065ms to sign (156843 cycles), 0.232ms to verify (557082 cycles). 2012 Bernstein–Schw

n 720MHz

0.9ms to ARM Co 1000MHz in iPad 1, 1000MHz in Samsung 1000MHz Motorola 800MHz Amazon Today: in Cortex-A7

SLIDE 36

9

“Petit ECDSA Processing in IFIP International New Technologies, Security (NTMS), er 2009.” “ECDSA was using MIRACL the Fig.1.” P-224/P-256 on 3.4GHz workstation”: to sign, to verify.

10

Compare to, e.g., Ed25519 speeds reported for single core

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles). This chip didn’t exist in 2009. Compare instead to single core

f 65nm 2.4GHz Core 2 (“2007

Intel Core 2 Quad Q6600”). 0.065ms to sign (156843 cycles), 0.232ms to verify (557082 cycles). 2012 Bernstein–Schw

n 720MHz ARM

0.9ms to verify (650102 ARM Cortex-A8 co 1000MHz Apple A4 in iPad 1, iPhone 4 1000MHz Samsung in Samsung Galaxy 1000MHz TI OMAP3630 Motorola Droid X 800MHz Freescale Amazon Kindle 4 (2011); Today: in CPUs costing Cortex-A7 is even

SLIDE 37

9

in national echnologies, (NTMS), CL rkstation”:

10

Compare to, e.g., Ed25519 speeds reported for single core

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles). This chip didn’t exist in 2009. Compare instead to single core

f 65nm 2.4GHz Core 2 (“2007

Intel Core 2 Quad Q6600”). 0.065ms to sign (156843 cycles), 0.232ms to verify (557082 cycles). 2012 Bernstein–Schwabe

n 720MHz ARM Cortex-A8:

0.9ms to verify (650102 cycles). ARM Cortex-A8 cores were in 1000MHz Apple A4 in iPad 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 in Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in Motorola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : Today: in CPUs costing ≈2 Cortex-A7 is even more popula

SLIDE 38

10

Compare to, e.g., Ed25519 speeds reported for single core

f 14nm 3.31GHz Skylake

(“2015 Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles). This chip didn’t exist in 2009. Compare instead to single core

f 65nm 2.4GHz Core 2 (“2007

Intel Core 2 Quad Q6600”). 0.065ms to sign (156843 cycles), 0.232ms to verify (557082 cycles).

11

2012 Bernstein–Schwabe

n 720MHz ARM Cortex-A8:

0.9ms to verify (650102 cycles). ARM Cortex-A8 cores were in 1000MHz Apple A4 in iPad 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 in Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in Motorola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : Today: in CPUs costing ≈2 EUR. Cortex-A7 is even more popular.

SLIDE 39

10

Compare to, e.g., Ed25519 reported for single core 14nm 3.31GHz Skylake Intel Core i5-6600”) on https://bench.cr.yp.to: 0.015ms to sign (49840 cycles), 0.049ms to verify (163206 cycles). chip didn’t exist in 2009. Compare instead to single core 65nm 2.4GHz Core 2 (“2007 Core 2 Quad Q6600”). 0.065ms to sign (156843 cycles), 0.232ms to verify (557082 cycles).

11

2012 Bernstein–Schwabe

n 720MHz ARM Cortex-A8:

0.9ms to verify (650102 cycles). ARM Cortex-A8 cores were in 1000MHz Apple A4 in iPad 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 in Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in Motorola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : Today: in CPUs costing ≈2 EUR. Cortex-A7 is even more popular. 180nm 32-bit (“2001 Intel 0.46ms (0.9 for Curve25519 using floating-p Integer multiplier Nobody adapting Would b 3.4GHz P same basic more instructions, Ed25519

n one co

SLIDE 40

10

e.g., Ed25519 for single core 3.31GHz Skylake re i5-6600”) on https://bench.cr.yp.to: (49840 cycles), verify (163206 cycles). exist in 2009. to single core Core 2 (“2007 Quad Q6600”). (156843 cycles), verify (557082 cycles).

11

2012 Bernstein–Schwabe

n 720MHz ARM Cortex-A8:

0.9ms to verify (650102 cycles). ARM Cortex-A8 cores were in 1000MHz Apple A4 in iPad 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 in Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in Motorola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : Today: in CPUs costing ≈2 EUR. Cortex-A7 is even more popular. 180nm 32-bit 2GHz (“2001 Intel Pentium 0.46ms (0.9 million for Curve25519 scala using floating-point Integer multiplier is Nobody has ever b adapting this to signatures. Would be ≈0:6ms 3.4GHz Pentium D same basic microarchitecture, more instructions, Ed25519 would be

n one core than P

SLIDE 41

10

core i5-6600”) on : cycles), cycles). 2009. core (“2007 Q6600”). cycles), cycles).

11

2012 Bernstein–Schwabe

n 720MHz ARM Cortex-A8:

0.9ms to verify (650102 cycles). ARM Cortex-A8 cores were in 1000MHz Apple A4 in iPad 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 in Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in Motorola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : Today: in CPUs costing ≈2 EUR. Cortex-A7 is even more popular. 180nm 32-bit 2GHz Willamette (“2001 Intel Pentium 4”): 0.46ms (0.9 million cycles) for Curve25519 scalarmult using floating-point multiplier. Integer multiplier is much slo Nobody has ever bothered adapting this to signatures. Would be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): same basic microarchitecture, more instructions, faster clock. Ed25519 would be >10× faster

n one core than Petit’s soft

SLIDE 42

11

2012 Bernstein–Schwabe

n 720MHz ARM Cortex-A8:

0.9ms to verify (650102 cycles). ARM Cortex-A8 cores were in 1000MHz Apple A4 in iPad 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 in Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in Motorola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : Today: in CPUs costing ≈2 EUR. Cortex-A7 is even more popular.

12

180nm 32-bit 2GHz Willamette (“2001 Intel Pentium 4”): 0.46ms (0.9 million cycles) for Curve25519 scalarmult using floating-point multiplier. Integer multiplier is much slower! Nobody has ever bothered adapting this to signatures. Would be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): same basic microarchitecture, more instructions, faster clock. Ed25519 would be >10× faster

n one core than Petit’s software.

SLIDE 43

11

Bernstein–Schwabe 720MHz ARM Cortex-A8: to verify (650102 cycles). Cortex-A8 cores were in 1000MHz Apple A4 1, iPhone 4 (2010); 1000MHz Samsung Exynos 3110 Samsung Galaxy S (2010); 1000MHz TI OMAP3630 in rola Droid X (2010); 800MHz Freescale i.MX50 in Amazon Kindle 4 (2011); : : : y: in CPUs costing ≈2 EUR. rtex-A7 is even more popular.

12

180nm 32-bit 2GHz Willamette (“2001 Intel Pentium 4”): 0.46ms (0.9 million cycles) for Curve25519 scalarmult using floating-point multiplier. Integer multiplier is much slower! Nobody has ever bothered adapting this to signatures. Would be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): same basic microarchitecture, more instructions, faster clock. Ed25519 would be >10× faster

n one core than Petit’s software.

Bad ECDSA-NIST-P-256 certainly

can’t use
can’t use
need an
etc. Typical

2000 Bro Menezes 4.0ms/6.4ms cycles) fo inside NIST 2001 Bernstein, 0.7 million for NIST

SLIDE 44

11

Bernstein–Schwabe Cortex-A8: (650102 cycles). cores were in A4 iPhone 4 (2010); Samsung Exynos 3110 laxy S (2010); OMAP3630 in X (2010); reescale i.MX50 in 4 (2011); : : : costing ≈2 EUR. even more popular.

12

180nm 32-bit 2GHz Willamette (“2001 Intel Pentium 4”): 0.46ms (0.9 million cycles) for Curve25519 scalarmult using floating-point multiplier. Integer multiplier is much slower! Nobody has ever bothered adapting this to signatures. Would be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): same basic microarchitecture, more instructions, faster clock. Ed25519 would be >10× faster

n one core than Petit’s software.

Bad ECDSA-NIST-P-256 certainly has some

can’t use fastest
can’t use fastest
need an annoying
etc. Typical estimate:

2000 Brown–Hank Menezes on 400MHz 4.0ms/6.4ms (1.6/2.6 cycles) for double inside NIST P-224/P-256 2001 Bernstein, ≈1 0.7 million cycles on for NIST P-224 scala

SLIDE 45

11

rtex-A8: cycles). ere in (2010); Exynos 3110 (2010); in in : : : 2 EUR.

pular.

12

180nm 32-bit 2GHz Willamette (“2001 Intel Pentium 4”): 0.46ms (0.9 million cycles) for Curve25519 scalarmult using floating-point multiplier. Integer multiplier is much slower! Nobody has ever bothered adapting this to signatures. Would be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): same basic microarchitecture, more instructions, faster clock. Ed25519 would be >10× faster

n one core than Petit’s software.

Bad ECDSA-NIST-P-256 design certainly has some impact:

can’t use fastest mulmods;
can’t use fastest curve form
need an annoying inversion;
etc. Typical estimate: 2× slo

2000 Brown–Hankerson–L´

Menezes on 400MHz Pentium 4.0ms/6.4ms (1.6/2.6 million cycles) for double scalarmult inside NIST P-224/P-256 verif. 2001 Bernstein, ≈1:6× faster: 0.7 million cycles on Pentium for NIST P-224 scalarmult.

SLIDE 46

12

180nm 32-bit 2GHz Willamette (“2001 Intel Pentium 4”): 0.46ms (0.9 million cycles) for Curve25519 scalarmult using floating-point multiplier. Integer multiplier is much slower! Nobody has ever bothered adapting this to signatures. Would be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): same basic microarchitecture, more instructions, faster clock. Ed25519 would be >10× faster

n one core than Petit’s software.

13

Bad ECDSA-NIST-P-256 design certainly has some impact:

can’t use fastest mulmods;
can’t use fastest curve formulas;
need an annoying inversion;
etc. Typical estimate: 2× slower.

2000 Brown–Hankerson–L´

pez–

Menezes on 400MHz Pentium II: 4.0ms/6.4ms (1.6/2.6 million cycles) for double scalarmult inside NIST P-224/P-256 verif. 2001 Bernstein, ≈1:6× faster: 0.7 million cycles on Pentium II for NIST P-224 scalarmult.

SLIDE 47

12

32-bit 2GHz Willamette Intel Pentium 4”): 0.46ms (0.9 million cycles) Curve25519 scalarmult floating-point multiplier. Integer multiplier is much slower! dy has ever bothered adapting this to signatures. be ≈0:6ms for verify. 3.4GHz Pentium D (dual core): basic microarchitecture, instructions, faster clock. Ed25519 would be >10× faster core than Petit’s software.

13

Bad ECDSA-NIST-P-256 design certainly has some impact:

can’t use fastest mulmods;
can’t use fastest curve formulas;
need an annoying inversion;
etc. Typical estimate: 2× slower.

2000 Brown–Hankerson–L´

pez–

Menezes on 400MHz Pentium II: 4.0ms/6.4ms (1.6/2.6 million cycles) for double scalarmult inside NIST P-224/P-256 verif. 2001 Bernstein, ≈1:6× faster: 0.7 million cycles on Pentium II for NIST P-224 scalarmult. 2000 Bro Menezes cycles on e.g., P-224 1.2 million 2.7 million 2001 Bernstein 0.7 million 0.8 million 0.9 million using comp OpenSSL 2.0 million

SLIDE 48

12

2GHz Willamette entium 4”): million cycles) scalarmult

int multiplier.

multiplier is much slower! ever bothered signatures. 6ms for verify. D (dual core): microarchitecture, instructions, faster clock. be >10× faster Petit’s software.

13

Bad ECDSA-NIST-P-256 design certainly has some impact:

can’t use fastest mulmods;
can’t use fastest curve formulas;
need an annoying inversion;
etc. Typical estimate: 2× slower.

2000 Brown–Hankerson–L´

pez–

Menezes on 400MHz Pentium II: 4.0ms/6.4ms (1.6/2.6 million cycles) for double scalarmult inside NIST P-224/P-256 verif. 2001 Bernstein, ≈1:6× faster: 0.7 million cycles on Pentium II for NIST P-224 scalarmult. 2000 Brown–Hank Menezes software use cycles on P4 than e.g., P-224 scalarmult: 1.2 million cycles on 2.7 million cycles on 2001 Bernstein P-224 0.7 million cycles on 0.8 million cycles on 0.9 million cycles on using compressed k OpenSSL 1.0.1, P-224 2.0 million cycles on

SLIDE 49

12

Willamette cycles) multiplier. slower! signatures. verify. core): rchitecture, clock. faster software.

13

Bad ECDSA-NIST-P-256 design certainly has some impact:

can’t use fastest mulmods;
can’t use fastest curve formulas;
need an annoying inversion;
etc. Typical estimate: 2× slower.

2000 Brown–Hankerson–L´

pez–

Menezes on 400MHz Pentium II: 4.0ms/6.4ms (1.6/2.6 million cycles) for double scalarmult inside NIST P-224/P-256 verif. 2001 Bernstein, ≈1:6× faster: 0.7 million cycles on Pentium II for NIST P-224 scalarmult. 2000 Brown–Hankerson–L´

Menezes software uses many cycles on P4 than on PII. e.g., P-224 scalarmult: 1.2 million cycles on Pentium 2.7 million cycles on Pentium 2001 Bernstein P-224 scalarmult: 0.7 million cycles on Pentium 0.8 million cycles on Pentium 0.9 million cycles on Pentium using compressed keys. OpenSSL 1.0.1, P-224 verif: 2.0 million cycles on Pentium

SLIDE 50

13

Bad ECDSA-NIST-P-256 design certainly has some impact:

can’t use fastest mulmods;
can’t use fastest curve formulas;
need an annoying inversion;
etc. Typical estimate: 2× slower.

2000 Brown–Hankerson–L´

pez–

Menezes on 400MHz Pentium II: 4.0ms/6.4ms (1.6/2.6 million cycles) for double scalarmult inside NIST P-224/P-256 verif. 2001 Bernstein, ≈1:6× faster: 0.7 million cycles on Pentium II for NIST P-224 scalarmult.

14

2000 Brown–Hankerson–L´

pez–

Menezes software uses many more cycles on P4 than on PII. e.g., P-224 scalarmult: 1.2 million cycles on Pentium II. 2.7 million cycles on Pentium 4. 2001 Bernstein P-224 scalarmult: 0.7 million cycles on Pentium II. 0.8 million cycles on Pentium 4. 0.9 million cycles on Pentium 4 using compressed keys. OpenSSL 1.0.1, P-224 verif: 2.0 million cycles on Pentium D.

SLIDE 51

13

ECDSA-NIST-P-256 design certainly has some impact: can’t use fastest mulmods; can’t use fastest curve formulas; an annoying inversion; ypical estimate: 2× slower. Brown–Hankerson–L´

pez–

Menezes on 400MHz Pentium II: 4.0ms/6.4ms (1.6/2.6 million for double scalarmult NIST P-224/P-256 verif. Bernstein, ≈1:6× faster: million cycles on Pentium II IST P-224 scalarmult.

14

2000 Brown–Hankerson–L´

pez–

Menezes software uses many more cycles on P4 than on PII. e.g., P-224 scalarmult: 1.2 million cycles on Pentium II. 2.7 million cycles on Pentium 4. 2001 Bernstein P-224 scalarmult: 0.7 million cycles on Pentium II. 0.8 million cycles on Pentium 4. 0.9 million cycles on Pentium 4 using compressed keys. OpenSSL 1.0.1, P-224 verif: 2.0 million cycles on Pentium D. How did 17 million 22 million Presumably bad mulmo Why did ECDSA, underlying Why did previous Why did Why did

SLIDE 52

13

ECDSA-NIST-P-256 design some impact: fastest mulmods; fastest curve formulas; ying inversion; estimate: 2× slower. wn–Hankerson–L´

pez–

400MHz Pentium II: (1.6/2.6 million double scalarmult P-224/P-256 verif. ≈1:6× faster: cycles on Pentium II scalarmult.

14

2000 Brown–Hankerson–L´

pez–

Menezes software uses many more cycles on P4 than on PII. e.g., P-224 scalarmult: 1.2 million cycles on Pentium II. 2.7 million cycles on Pentium 4. 2001 Bernstein P-224 scalarmult: 0.7 million cycles on Pentium II. 0.8 million cycles on Pentium 4. 0.9 million cycles on Pentium 4 using compressed keys. OpenSSL 1.0.1, P-224 verif: 2.0 million cycles on Pentium D. How did Petit manage 17 million cycles fo 22 million cycles fo Presumably some combination bad mulmod and ba Why did Petit reimplement ECDSA, using MIRA underlying arithmetic? Why did Petit not previous speed literature? Why did Petit cho Why did BHLM cho

SLIDE 53

13

design impact: ds; formulas; inversion; slower. erson–L´

pez–

entium II: million rmult verif. faster: entium II rmult.

14

2000 Brown–Hankerson–L´

pez–

Menezes software uses many more cycles on P4 than on PII. e.g., P-224 scalarmult: 1.2 million cycles on Pentium II. 2.7 million cycles on Pentium 4. 2001 Bernstein P-224 scalarmult: 0.7 million cycles on Pentium II. 0.8 million cycles on Pentium 4. 0.9 million cycles on Pentium 4 using compressed keys. OpenSSL 1.0.1, P-224 verif: 2.0 million cycles on Pentium D. How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif Presumably some combination bad mulmod and bad curve Why did Petit reimplement ECDSA, using MIRACL for t underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium Why did BHLM choose PII?

SLIDE 54

14

2000 Brown–Hankerson–L´

pez–

Menezes software uses many more cycles on P4 than on PII. e.g., P-224 scalarmult: 1.2 million cycles on Pentium II. 2.7 million cycles on Pentium 4. 2001 Bernstein P-224 scalarmult: 0.7 million cycles on Pentium II. 0.8 million cycles on Pentium 4. 0.9 million cycles on Pentium 4 using compressed keys. OpenSSL 1.0.1, P-224 verif: 2.0 million cycles on Pentium D.

15

How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif? Presumably some combination of bad mulmod and bad curve ops. Why did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium D? Why did BHLM choose PII?

SLIDE 55

14

Brown–Hankerson–L´

pez–

Menezes software uses many more

n P4 than on PII.

P-224 scalarmult: million cycles on Pentium II. million cycles on Pentium 4. Bernstein P-224 scalarmult: million cycles on Pentium II. million cycles on Pentium 4. million cycles on Pentium 4 compressed keys. enSSL 1.0.1, P-224 verif: million cycles on Pentium D.

15

How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif? Presumably some combination of bad mulmod and bad curve ops. Why did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium D? Why did BHLM choose PII? Petit: “There cryptographic OpenSSL Authors comparison that MIRA performance elliptic curves

SLIDE 56

14

wn–Hankerson–L´

pez–

re uses many more than on PII. rmult: cycles on Pentium II. cycles on Pentium 4. P-224 scalarmult: cycles on Pentium II. cycles on Pentium 4. cycles on Pentium 4 ressed keys. P-224 verif: cycles on Pentium D.

15

How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif? Presumably some combination of bad mulmod and bad curve ops. Why did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium D? Why did BHLM choose PII? Petit: “There are three cryptographic libra OpenSSL and Crypto++. Authors in [21] prop comparison and concluded that MIRACL has performance for op elliptic curves over

SLIDE 57

14

erson–L´

pez–

many more entium II. entium 4. scalarmult: entium II. entium 4. entium 4 verif: entium D.

15

How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif? Presumably some combination of bad mulmod and bad curve ops. Why did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium D? Why did BHLM choose PII? Petit: “There are three main cryptographic libraries: MIRA OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fie

SLIDE 58

15

How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif? Presumably some combination of bad mulmod and bad curve ops. Why did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium D? Why did BHLM choose PII?

16

Petit: “There are three main cryptographic libraries: MIRACL, OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fields.”

SLIDE 59

15

How did Petit manage to use 17 million cycles for P-224 verif, 22 million cycles for P-256 verif? Presumably some combination of bad mulmod and bad curve ops. Why did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? Why did Petit not simply cite previous speed literature? Why did Petit choose Pentium D? Why did BHLM choose PII?

16

Petit: “There are three main cryptographic libraries: MIRACL, OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fields.” But NIST P-224 and NIST P-256 are defined over prime fields! [21] says “For elliptic curves

ver prime fields, OpenSSL has

the best performance under all platforms.”

SLIDE 60

15

did Petit manage to use million cycles for P-224 verif, million cycles for P-256 verif? Presumably some combination of mulmod and bad curve ops. did Petit reimplement ECDSA, using MIRACL for the underlying arithmetic? did Petit not simply cite revious speed literature? did Petit choose Pentium D? did BHLM choose PII?

16

Petit: “There are three main cryptographic libraries: MIRACL, OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fields.” But NIST P-224 and NIST P-256 are defined over prime fields! [21] says “For elliptic curves

ver prime fields, OpenSSL has

the best performance under all platforms.” More general Paper analyzes crypto up If the crypto Why is the Why should If the crypto Paper is Look, here’s More likely More likely funding to

SLIDE 61

15

manage to use for P-224 verif, for P-256 verif? some combination of bad curve ops. reimplement MIRACL for the rithmetic? not simply cite literature? choose Pentium D? choose PII?

16

Petit: “There are three main cryptographic libraries: MIRACL, OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fields.” But NIST P-224 and NIST P-256 are defined over prime fields! [21] says “For elliptic curves

ver prime fields, OpenSSL has

the best performance under all platforms.” More general situation: Paper analyzes impact crypto upon an ap If the crypto sounds Why is the paper interesting? Why should it be published? If the crypto sounds Paper is more interesting. Look, here’s a spee More likely to be pu More likely to mot funding to fix the p

SLIDE 62

15

use verif, verif? combination of curve ops. reimplement r the cite tium D? PII?

16

Petit: “There are three main cryptographic libraries: MIRACL, OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fields.” But NIST P-224 and NIST P-256 are defined over prime fields! [21] says “For elliptic curves

ver prime fields, OpenSSL has

the best performance under all platforms.” More general situation: Paper analyzes impact of crypto upon an application. If the crypto sounds fast: Why is the paper interesting? Why should it be published? If the crypto sounds slower: Paper is more interesting. Look, here’s a speed problem! More likely to be published. More likely to motivate funding to fix the problem.

SLIDE 63

16

Petit: “There are three main cryptographic libraries: MIRACL, OpenSSL and Crypto++. Authors in [21] proposed a comparison and concluded that MIRACL has the best performance for operations on elliptic curves over binary fields.” But NIST P-224 and NIST P-256 are defined over prime fields! [21] says “For elliptic curves

ver prime fields, OpenSSL has

the best performance under all platforms.”

17

More general situation: Paper analyzes impact of crypto upon an application. If the crypto sounds fast: Why is the paper interesting? Why should it be published? If the crypto sounds slower: Paper is more interesting. Look, here’s a speed problem! More likely to be published. More likely to motivate funding to fix the problem.

SLIDE 64

16

“There are three main cryptographic libraries: MIRACL, enSSL and Crypto++. rs in [21] proposed a rison and concluded MIRACL has the best rmance for operations on curves over binary fields.” NIST P-224 and NIST P-256 defined over prime fields! ys “For elliptic curves rime fields, OpenSSL has est performance under all rms.”

17

More general situation: Paper analyzes impact of crypto upon an application. If the crypto sounds fast: Why is the paper interesting? Why should it be published? If the crypto sounds slower: Paper is more interesting. Look, here’s a speed problem! More likely to be published. More likely to motivate funding to fix the problem. Obvious application deployment: Many random answering CPU to literature mulmod, Slowest, are most Situation randomness There’s no deliberately

SLIDE 65

16

re three main raries: MIRACL, Crypto++. proposed a concluded has the best

perations on
ver binary fields.”

and NIST P-256 prime fields! elliptic curves fields, OpenSSL has rmance under all

17

More general situation: Paper analyzes impact of crypto upon an application. If the crypto sounds fast: Why is the paper interesting? Why should it be published? If the crypto sounds slower: Paper is more interesting. Look, here’s a speed problem! More likely to be published. More likely to motivate funding to fix the problem. Obvious question whenever application considers deployment: “Is it Many random metho answering this question. CPU to test? What literature and libra mulmod, or curve Slowest, least comp are most likely to b Situation is fully explainable randomness + natura There’s no evidence deliberately slowed

SLIDE 66

16

main MIRACL, est s on fields.” P-256 fields! curves enSSL has under all

17

More general situation: Paper analyzes impact of crypto upon an application. If the crypto sounds fast: Why is the paper interesting? Why should it be published? If the crypto sounds slower: Paper is more interesting. Look, here’s a speed problem! More likely to be published. More likely to motivate funding to fix the problem. Obvious question whenever an application considers crypto deployment: “Is it fast enough?” Many random methodologies answering this question. Which CPU to test? What to take literature and libraries? Reuse mulmod, or curve ops, or mo Slowest, least competent answ are most likely to be published. Situation is fully explainable randomness + natural selection. There’s no evidence that Petit deliberately slowed down crypto.

SLIDE 67

17

More general situation: Paper analyzes impact of crypto upon an application. If the crypto sounds fast: Why is the paper interesting? Why should it be published? If the crypto sounds slower: Paper is more interesting. Look, here’s a speed problem! More likely to be published. More likely to motivate funding to fix the problem.

18

Obvious question whenever an application considers crypto deployment: “Is it fast enough?” Many random methodologies for answering this question. Which CPU to test? What to take from literature and libraries? Reuse mulmod, or curve ops, or more? Slowest, least competent answers are most likely to be published. Situation is fully explainable by randomness + natural selection. There’s no evidence that Petit deliberately slowed down crypto.

SLIDE 68

17

general situation: analyzes impact of upon an application. crypto sounds fast: is the paper interesting? should it be published? crypto sounds slower: is more interesting. here’s a speed problem! likely to be published. likely to motivate funding to fix the problem.

18

Obvious question whenever an application considers crypto deployment: “Is it fast enough?” Many random methodologies for answering this question. Which CPU to test? What to take from literature and libraries? Reuse mulmod, or curve ops, or more? Slowest, least competent answers are most likely to be published. Situation is fully explainable by randomness + natural selection. There’s no evidence that Petit deliberately slowed down crypto. Paper intro software incentive slow, and report its Paper will functions, lengths, timing mechanism, maximize from old This is not what matters

SLIDE 69

17

situation: impact of application. sounds fast: er interesting? e published? sounds slower: interesting. speed problem! e published. motivate the problem.

18

Obvious question whenever an application considers crypto deployment: “Is it fast enough?” Many random methodologies for answering this question. Which CPU to test? What to take from literature and libraries? Reuse mulmod, or curve ops, or more? Slowest, least competent answers are most likely to be published. Situation is fully explainable by randomness + natural selection. There’s no evidence that Petit deliberately slowed down crypto. Paper introducing software or hardwa incentive to report slow, and analogous report its own crypto Paper will naturally functions, parameters lengths, platforms, timing mechanism, maximize reported from old to new. This is not the same what matters most

SLIDE 70

17

plication. interesting? published? er: roblem! blished. roblem.

18

Obvious question whenever an application considers crypto deployment: “Is it fast enough?” Many random methodologies for answering this question. Which CPU to test? What to take from literature and libraries? Reuse mulmod, or curve ops, or more? Slowest, least competent answers are most likely to be published. Situation is fully explainable by randomness + natural selection. There’s no evidence that Petit deliberately slowed down crypto. Paper introducing new crypto software or hardware has same incentive to report older crypto slow, and analogous incentive report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users.

SLIDE 71

18

Obvious question whenever an application considers crypto deployment: “Is it fast enough?” Many random methodologies for answering this question. Which CPU to test? What to take from literature and libraries? Reuse mulmod, or curve ops, or more? Slowest, least competent answers are most likely to be published. Situation is fully explainable by randomness + natural selection. There’s no evidence that Petit deliberately slowed down crypto.

19

Paper introducing new crypto software or hardware has same incentive to report older crypto as slow, and analogous incentive to report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users.

SLIDE 72

18

Obvious question whenever an application considers crypto yment: “Is it fast enough?” random methodologies for ering this question. Which to test? What to take from literature and libraries? Reuse d, or curve ops, or more? st, least competent answers most likely to be published. Situation is fully explainable by randomness + natural selection. There’s no evidence that Petit erately slowed down crypto.

19

Paper introducing new crypto software or hardware has same incentive to report older crypto as slow, and analogous incentive to report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users. Bit operations (assuming as listed key ops/bit 128 88 128 100 128 117 256 144 128 147.2 256 156 128 162.75 128 202.5 256 283.5

SLIDE 73

18

question whenever an considers crypto it fast enough?” methodologies for

question. Which

What to take from raries? Reuse curve ops, or more? competent answers to be published. explainable by natural selection. evidence that Petit ed down crypto.

19

Paper introducing new crypto software or hardware has same incentive to report older crypto as slow, and analogous incentive to report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users. Bit operations per (assuming precomputed as listed in recent key ops/bit cipher 128 88 Simon: 128 100 NOEKEON 128 117 Skinny 256 144 Simon: 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES

SLIDE 74

18

whenever an crypto enough?” dologies for Which e from Reuse more? answers published. explainable by selection. etit crypto.

19

Paper introducing new crypto software or hardware has same incentive to report older crypto as slow, and analogous incentive to report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users. Bit operations per bit of plaintext (assuming precomputed subk as listed in recent Skinny pap key ops/bit cipher 128 88 Simon: 60 ops 128 100 NOEKEON 128 117 Skinny 256 144 Simon: 106 op 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES

SLIDE 75

19

Paper introducing new crypto software or hardware has same incentive to report older crypto as slow, and analogous incentive to report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users.

20

Bit operations per bit of plaintext (assuming precomputed subkeys), as listed in recent Skinny paper: key ops/bit cipher 128 88 Simon: 60 ops broken 128 100 NOEKEON 128 117 Skinny 256 144 Simon: 106 ops broken 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES

SLIDE 76

19

Paper introducing new crypto software or hardware has same incentive to report older crypto as slow, and analogous incentive to report its own crypto as fast. Paper will naturally select functions, parameters, input lengths, platforms, I/O format, timing mechanism, etc. that maximize reported improvement from old to new. This is not the same as selecting what matters most for the users.

20

Bit operations per bit of plaintext (assuming precomputed subkeys), not entirely listed in Skinny paper: key ops/bit cipher 256 54 Salsa20/8 256 78 Salsa20/12 128 88 Simon: 60 ops broken 128 100 NOEKEON 128 117 Skinny 256 126 Salsa20 256 144 Simon: 106 ops broken 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES

SLIDE 77

19

introducing new crypto re or hardware has same incentive to report older crypto as and analogous incentive to its own crypto as fast. will naturally select functions, parameters, input lengths, platforms, I/O format, mechanism, etc. that maximize reported improvement

ld to new.

not the same as selecting matters most for the users.

20

Bit operations per bit of plaintext (assuming precomputed subkeys), not entirely listed in Skinny paper: key ops/bit cipher 256 54 Salsa20/8 256 78 Salsa20/12 128 88 Simon: 60 ops broken 128 100 NOEKEON 128 117 Skinny 256 126 Salsa20 256 144 Simon: 106 ops broken 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES Many bad backed b e.g. Do w

ptimized

the older Rely on “optimizing” “We come most architectures do much complete heuristics. get little where the slightly wrong

SLIDE 78

19

ducing new crypto rdware has same rt older crypto as analogous incentive to crypto as fast. rally select rameters, input rms, I/O format, mechanism, etc. that rted improvement new. same as selecting most for the users.

20

Bit operations per bit of plaintext (assuming precomputed subkeys), not entirely listed in Skinny paper: key ops/bit cipher 256 54 Salsa20/8 256 78 Salsa20/12 128 88 Simon: 60 ops broken 128 100 NOEKEON 128 117 Skinny 256 126 Salsa20 256 144 Simon: 106 ops broken 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES Many bad examples backed by tons of e.g. Do we bother

ptimized impleme

the older crypto? T Rely on “optimizing” “We come so close most architectures do much more without complete algorithms

heuristics. We can

get little niggles here where the heuristics slightly wrong answ

SLIDE 79

19

crypto same crypto as incentive to fast. input rmat, that rovement selecting users.

20

Bit operations per bit of plaintext (assuming precomputed subkeys), not entirely listed in Skinny paper: key ops/bit cipher 256 54 Salsa20/8 256 78 Salsa20/12 128 88 Simon: 60 ops broken 128 100 NOEKEON 128 117 Skinny 256 126 Salsa20 256 144 Simon: 106 ops broken 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES Many bad examples to imitate, backed by tons of misinformation. e.g. Do we bother searching

ptimized implementations of

the older crypto? Take any co Rely on “optimizing” compiler! “We come so close to optimal most architectures that we can’t do much more without using complete algorithms instead

heuristics. We can only try to

get little niggles here and there where the heuristics get slightly wrong answers.”

SLIDE 80

20

Bit operations per bit of plaintext (assuming precomputed subkeys), not entirely listed in Skinny paper: key ops/bit cipher 256 54 Salsa20/8 256 78 Salsa20/12 128 88 Simon: 60 ops broken 128 100 NOEKEON 128 117 Skinny 256 126 Salsa20 256 144 Simon: 106 ops broken 128 147.2 PRESENT 256 156 Skinny 128 162.75 Piccolo 128 202.5 AES 256 283.5 AES

21

Many bad examples to imitate, backed by tons of misinformation. e.g. Do we bother searching for

ptimized implementations of

the older crypto? Take any code! Rely on “optimizing” compiler! “We come so close to optimal on most architectures that we can’t do much more without using NP complete algorithms instead of

heuristics. We can only try to

get little niggles here and there where the heuristics get slightly wrong answers.”

SLIDE 81

20

erations per bit of plaintext (assuming precomputed subkeys), entirely listed in Skinny paper:

ps/bit cipher

54 Salsa20/8 78 Salsa20/12 88 Simon: 60 ops broken 100 NOEKEON 117 Skinny 126 Salsa20 144 Simon: 106 ops broken 147.2 PRESENT 156 Skinny 162.75 Piccolo 202.5 AES 283.5 AES

21

Many bad examples to imitate, backed by tons of misinformation. e.g. Do we bother searching for

ptimized implementations of

the older crypto? Take any code! Rely on “optimizing” compiler! “We come so close to optimal on most architectures that we can’t do much more without using NP complete algorithms instead of

heuristics. We can only try to

get little niggles here and there where the heuristics get slightly wrong answers.” Reality is

SLIDE 82

20

er bit of plaintext computed subkeys), listed in Skinny paper: cipher Salsa20/8 Salsa20/12 Simon: 60 ops broken NOEKEON Skinny Salsa20 Simon: 106 ops broken PRESENT Skinny Piccolo AES AES

21

Many bad examples to imitate, backed by tons of misinformation. e.g. Do we bother searching for

ptimized implementations of

the older crypto? Take any code! Rely on “optimizing” compiler! “We come so close to optimal on most architectures that we can’t do much more without using NP complete algorithms instead of

heuristics. We can only try to

get little niggles here and there where the heuristics get slightly wrong answers.” Reality is more complicated:

SLIDE 83

20

plaintext subkeys), Skinny paper:

ps broken
ps broken

21

Many bad examples to imitate, backed by tons of misinformation. e.g. Do we bother searching for

ptimized implementations of

the older crypto? Take any code! Rely on “optimizing” compiler! “We come so close to optimal on most architectures that we can’t do much more without using NP complete algorithms instead of

heuristics. We can only try to

get little niggles here and there where the heuristics get slightly wrong answers.” Reality is more complicated:

SLIDE 84

21

Many bad examples to imitate, backed by tons of misinformation. e.g. Do we bother searching for

ptimized implementations of

the older crypto? Take any code! Rely on “optimizing” compiler! “We come so close to optimal on most architectures that we can’t do much more without using NP complete algorithms instead of

heuristics. We can only try to

get little niggles here and there where the heuristics get slightly wrong answers.”

22

Reality is more complicated:

SLIDE 85

21

bad examples to imitate, by tons of misinformation. Do we bother searching for

ptimized implementations of
lder crypto? Take any code!
n “optimizing” compiler!
me so close to optimal on

rchitectures that we can’t much more without using NP complete algorithms instead of

heuristics. We can only try to

little niggles here and there the heuristics get slightly wrong answers.”

22

Reality is more complicated: SUPERCOP includes

f 595 cryptograph

>20 implementations Haswell: implementation gcc -O3 is 6:15× Salsa20 implementation. merged implementation with “machine-indep

ptimizations

compiler

SLIDE 86

21

xamples to imitate,

f misinformation.
ther searching for

implementations of crypto? Take any code! “optimizing” compiler! close to optimal on rchitectures that we can’t without using NP rithms instead of can only try to here and there heuristics get answers.”

22

Reality is more complicated: SUPERCOP benchma includes 2155 implementations

f 595 cryptograph

>20 implementations Haswell: Reasonably implementation compiled gcc -O3 -fomit-frame-pointer is 6:15× slower than Salsa20 implementation. merged implementation with “machine-indep

ptimizations and

compiler options:

SLIDE 87

21

imitate, rmation. rching for tions of any code! compiler!

ptimal on

can’t ing NP instead of try to there

22

Reality is more complicated: SUPERCOP benchmarking to includes 2155 implementations

f 595 cryptographic primitives.

>20 implementations of Salsa20. Haswell: Reasonably simple implementation compiled with gcc -O3 -fomit-frame-pointer is 6:15× slower than fastest Salsa20 implementation. merged implementation with “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slow

SLIDE 88

22

Reality is more complicated:

23

SUPERCOP benchmarking toolkit includes 2155 implementations

f 595 cryptographic primitives.

>20 implementations of Salsa20. Haswell: Reasonably simple ref implementation compiled with gcc -O3 -fomit-frame-pointer is 6:15× slower than fastest Salsa20 implementation. merged implementation with “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slower.

SLIDE 89

22

is more complicated:

23

SUPERCOP benchmarking toolkit includes 2155 implementations

f 595 cryptographic primitives.

>20 implementations of Salsa20. Haswell: Reasonably simple ref implementation compiled with gcc -O3 -fomit-frame-pointer is 6:15× slower than fastest Salsa20 implementation. merged implementation with “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slower. Another lattice-based means generating

f random

2017.03 Valencia–O’Sullivan–G Regazzoni sources of discrete benchma Qualitatively choice of sampling

SLIDE 90

22

complicated:

23

SUPERCOP benchmarking toolkit includes 2155 implementations

f 595 cryptographic primitives.

>20 implementations of Salsa20. Haswell: Reasonably simple ref implementation compiled with gcc -O3 -fomit-frame-pointer is 6:15× slower than fastest Salsa20 implementation. merged implementation with “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slower. Another interesting lattice-based signing means generating a

f random Gaussian

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G Regazzoni “An investigation sources of randomness discrete Gaussian sampling”: benchmarks for RNGs, Qualitatively large choice of RNG ⇒ sampling ⇒ cost of

SLIDE 91

22

complicated:

23

SUPERCOP benchmarking toolkit includes 2155 implementations

f 595 cryptographic primitives.

>20 implementations of Salsa20. Haswell: Reasonably simple ref implementation compiled with gcc -O3 -fomit-frame-pointer is 6:15× slower than fastest Salsa20 implementation. merged implementation with “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slower. Another interesting example: lattice-based signing typically means generating a huge numb

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing.

SLIDE 92

23

SUPERCOP benchmarking toolkit includes 2155 implementations

f 595 cryptographic primitives.

>20 implementations of Salsa20. Haswell: Reasonably simple ref implementation compiled with gcc -O3 -fomit-frame-pointer is 6:15× slower than fastest Salsa20 implementation. merged implementation with “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slower.

24

Another interesting example: lattice-based signing typically means generating a huge number

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing.

SLIDE 93

23

SUPERCOP benchmarking toolkit includes 2155 implementations cryptographic primitives. implementations of Salsa20. ell: Reasonably simple ref implementation compiled with

O3 -fomit-frame-pointer

× slower than fastest Salsa20 implementation. implementation “machine-independent”

ptimizations and best of 121

compiler options: 4:52× slower.

24

Another interesting example: lattice-based signing typically means generating a huge number

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing. Two examples in this 2017 Skylake (I 383.69 MByte/sec cycles/byte) using AES-NI; (32 cycles

SLIDE 94

23

enchmarking toolkit implementations cryptographic primitives. implementations of Salsa20. Reasonably simple ref compiled with

fomit-frame-pointer

than fastest implementation. implementation “machine-independent” and best of 121

ptions: 4:52× slower.

24

Another interesting example: lattice-based signing typically means generating a huge number

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing. Two examples of sp in this 2017 paper Skylake (Intel Core 383.69 MByte/sec cycles/byte) for AES using AES-NI; 106.07 (32 cycles/byte) fo

SLIDE 95

23

rking toolkit implementations rimitives. Salsa20. simple ref with

fomit-frame-pointer

fastest endent” 121 slower.

24

Another interesting example: lattice-based signing typically means generating a huge number

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing. Two examples of speed repo in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-D using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20.

SLIDE 96

24

Another interesting example: lattice-based signing typically means generating a huge number

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing.

25

Two examples of speed reported in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-DRBG using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20.

SLIDE 97

24

Another interesting example: lattice-based signing typically means generating a huge number

f random Gaussian samples.

2017.03 Brannigan–Smyth–Oder– Valencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: benchmarks for RNGs, samplers. Qualitatively large impacts: choice of RNG ⇒ cost of sampling ⇒ cost of signing.

25

Two examples of speed reported in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-DRBG using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20. But wait. eBACS reports 0.92 cycles/byte for AES-256-CTR, 1.18 cycles/byte for ChaCha20. Author non-response: “essential for us to examine standard open implementations”. Slow ones?

SLIDE 98

24

Another interesting example: lattice-based signing typically generating a huge number random Gaussian samples. 2017.03 Brannigan–Smyth–Oder– alencia–O’Sullivan–G¨ uneysu– Regazzoni “An investigation of sources of randomness within discrete Gaussian sampling”: enchmarks for RNGs, samplers. Qualitatively large impacts:

f RNG ⇒ cost of

sampling ⇒ cost of signing.

25

Two examples of speed reported in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-DRBG using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20. But wait. eBACS reports 0.92 cycles/byte for AES-256-CTR, 1.18 cycles/byte for ChaCha20. Author non-response: “essential for us to examine standard open implementations”. Slow ones?

SLIDE 99

24

interesting example: signing typically generating a huge number aussian samples. Brannigan–Smyth–Oder– alencia–O’Sullivan–G¨ uneysu– investigation of randomness within Gaussian sampling”: RNGs, samplers. rge impacts: cost of cost of signing.

25

Two examples of speed reported in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-DRBG using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20. But wait. eBACS reports 0.92 cycles/byte for AES-256-CTR, 1.18 cycles/byte for ChaCha20. Author non-response: “essential for us to examine standard open implementations”. Slow ones?

SLIDE 100

24

example: ypically number samples. Brannigan–Smyth–Oder– uneysu– investigation of within sampling”: samplers. impacts: signing.

25

Two examples of speed reported in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-DRBG using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20. But wait. eBACS reports 0.92 cycles/byte for AES-256-CTR, 1.18 cycles/byte for ChaCha20. Author non-response: “essential for us to examine standard open implementations”. Slow ones?

SLIDE 101

25

Two examples of speed reported in this 2017 paper for a 3.4GHz Skylake (Intel Core i7-6700): 383.69 MByte/sec (8.86 cycles/byte) for AES CTR-DRBG using AES-NI; 106.07 MByte/sec (32 cycles/byte) for ChaCha20. But wait. eBACS reports 0.92 cycles/byte for AES-256-CTR, 1.18 cycles/byte for ChaCha20. Author non-response: “essential for us to examine standard open implementations”. Slow ones?

26

SLIDE 102

25

examples of speed reported 2017 paper for a 3.4GHz e (Intel Core i7-6700): MByte/sec (8.86 cycles/byte) for AES CTR-DRBG AES-NI; 106.07 MByte/sec cycles/byte) for ChaCha20.

ait. eBACS reports

cycles/byte for AES-256-CTR, cycles/byte for ChaCha20. r non-response: “essential to examine standard open implementations”. Slow ones?

26

SLIDE 103

25

f speed reported

er for a 3.4GHz Core i7-6700): MByte/sec (8.86 AES CTR-DRBG 106.07 MByte/sec for ChaCha20. CS reports for AES-256-CTR, for ChaCha20.

nse: “essential

examine standard open implementations”. Slow ones?

26

SLIDE 104

25

reported 3.4GHz i7-6700): CTR-DRBG MByte/sec aCha20. 256-CTR, aCha20. “essential

pen
nes?

26

SLIDE 105

26 27

SLIDE 106

26 27

SLIDE 107

26 27

SLIDE 108

26 27

SLIDE 109