Low-Precision Arithmetic
CS4787 Lecture 21 — Spring 2020
Low-Precision Arithmetic CS4787 Lecture 21 Spring 2020 The - - PowerPoint PPT Presentation
Low-Precision Arithmetic CS4787 Lecture 21 Spring 2020 The standard approach Single-precision floating point (FP32) 32-bit floating point numbers 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9
CS4787 Lecture 21 — Spring 2020
The standard approach
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
sign 8-bit exponent 23-bit mantissa
represented number = (−1)sign · 2exponent−127 · 1.b22b21b20 . . . b0 leading
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
sign 8-bit exponent 23-bit mantissa
represented number = (−1)sign · 2−126 · 0.b22b21b20 . . . b0.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
sign 8-bit exponent 23-bit mantissa
If x ∈ R is a real number within the range of a float- ing point representation, and ˜ x is the closest representable floating-point number to it, then |˜ x − x| = | round(x) − x| ≤ |x| · εmachine. Here, εmachine is called the machine epsilon and bounds the relative error of the format.
<latexit sha1_base64="kHg8T0Z1m6scH+OTF1dqFe5rpg=">ADpHicfVLbtNAEHUSLiVAaeEBJF5GtKAipVZSgYCHShW8FImHcklbKQ7Vej1OVl3vWrvrkMjxA5/JH/AZzDopalrESJZHczkz5+zEuRTWdbu/Gs3WjZu3bq/dad+9d3/9wcbmw2OrC8Oxz7XU5jRmFqVQ2HfCSTzNDbIslngSn3/w+ZMJGiu0+uZmOQ4zNlIiFZw5Cp1tNn5+TGF7CpFQEGXMjeO4/FJtg7DAgIAkqCKL0cAP4cZU48YIhqkRgk6pIpWagNQIci2UowYablG5Gr0DTCWwHTkhE4RpDer7udQW7eVq2vYv1O4CajnWaRCu49tUOxrA/AIMdmE6h30K6BwNc9olmFpdKGSamf6cpGPCHfu/zRDqIJo5FWSCJeRg6nrswYJ1ZYVSFEw/YhGuzQwv+pq0lwJiUmNZcIs3x8kYZlV1UTj/0uC8YGJXGbUIEx2njpfDTVhQPza2umG3Nrju9JbOVrC0I3q0dpRoXmQkHZfM2kGvm7thyYwTXGLVjgqLOePnbIQDcr0ydljWx1LBc4okfjZ9pHMdvdxRszaWRZTpb8HezXng/KDQqXvh2WQuWFQ8UXg9JC+if0lweJMidnJHDuBG0K/AxM4w7us+VKTkz9lzkq0QyZkZC7e+Fr4UaliPUGTozq0i93lWtrjvHe2HvVfju897WwfuljmvB0+BZsBP0gjfBQXAYHAX9gDd+N9ebj5tPWi9an1pfW/1FabOx7HkUrFjr+x+yuC0Z</latexit>If x and y are real-numbers representable in a floating- point format, denotes an (infinite-precision) binary op- eration (such as +, ·, etc.) and • denotes the floating-point version of that operation, then x•y = round(xy) and |(x•y)(xy)| |xy|·εmachine, as long as the result is in range.
<latexit sha1_base64="RA8adcVwj6evcJ2Lniqp6Hjeql0=">ADz3icbVLbtNAEHUSLsVc2tJHXkbESIloqQCQR8qVe0LvLUSvUhxVK3XY2fV9a61uy61XCNe+Rd+iL9hnEQ0bVnJ8uxczp45M1EuhXWj0Z9Wu/Po8ZOna8/85y9evlrf2Hx9anVhOJ5wLbU5j5hFKRSeOEknucGWRZJPIsuD5v42RUaK7T65socpxlLlUgEZ45cF5ut318TCK4DYCqGoKS/QSAOVBFlEhXQjQonKMEoYJBITeUqHeRaKAeJNhlz2xCEXBgeQIxKO7QECT2hEqGEwGBcNHQ6EMkFDMl6BzNnAX0bMFnwCwE74M5TKwdGej4sL8gFkaFlOhusd0M79NYtgk6oShzt/jbTbywlcwxIStiD0OG1q4wuVFz3KNSwh7LvhzObM47VGLN6kUMc6lWvf9NbgerDAFbqbyAkoW7+OejeNAThFWmbWyFJ92qBmzE+o8HV9TaEU58EoFjaCNG0R6oX0oGwjeiGqRSHFxvd0XA0P/DQGC+Nrc8RzRfP4w1LzKaH5fM2sl4lLtpxYwTXGLth4VFauSpTghU7EM7bSa71UN78gTN+OljwSe1crKpZW2YRZdICzOz9WOP8X2xSuOTztBIqLxwqvngoKSQ4Dc2SQixoW5wsyWDcCOIKfMYM45mfOeVnBl7KfK7jWTMpELt7Qw/CjWtUtQZOlPWpN74vlYPjdOd4fjDcPd4p7t/sNRxzXvjvfV63tj75O17X7wj78Tj7a32bvugfdg57nzv/Oj8XKS2W8uaLe/O6fz6C3lFOHY=</latexit>more error than the model predicts
used for ML with smaller numbers
Google’s TPU NVIDIA’s GPUs Intel’s NNP
A low-precision alternative
x = (−1)sign bit · 2exponent−15 · 1.significand2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sign 5-bit exp 10-bit mantissa
SIMD Precision
32-bit float vector
F32 F32 F32 F32 F32 F32 F32 F32
16-bit int vector 8-bit int vector
SIMD Parallelism 8 multiplies/cycle
(vmulps instruction)
16 multiplies/cycle
(vpmaddwd instruction)
32 multiplies/cycle
(vpmaddubsw instruction) 64-bit float vector
F64 F64 F64 F64
4 multiplies/cycle
(vmulpd instruction)
Precision in DRAM
32-bit float vector
F32 F32 F32 F32 F32 F32
16-bit int vector 8-bit int vector
Memory Throughput 10 numbers /ns 20 numbers /ns 40 numbers /ns
64-bit float vector
F64 F64 F64
5 numbers/ns
(assuming ~40 GB/sec memory bandwidth) … … … … … … … …
Precision in DRAM
32-bit float vector
F32 F32 F32 F32 F32 F32
16-bit int vector 8-bit int vector
Memory Throughput 10 numbers /ns 20 numbers /ns 40 numbers /ns
(assuming ~40 GB/sec network bandwidth) … … … … … …
Specialized lossy compression
>40 numbers /ns
… …
float32 multiplier float32 float32 int16 mul int16 int16
float32 memory algorithm runtime float32
precision representation
machine
floats, ✏machine ≈ 1.2 × 10−7. about 6.5 × 104
±3.4 × 1038
about 6.0 × 10−8.
(about 1.4×10−45 for
rounding), ✏machine ≈ 9.8 × 10−4
Micikevicius et al. “Mixed Precision Training.” on arxiv, 2017.
One way to address limited range: more exponent bits
Q: What can we say about the range of bfloat16 numbers as compared with IEEE half-precision floats and single-precision floats? How does their machine epsilon compare?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sign 8-bit exp 7-bit mantissa
than they are to overflow
An alternative to low-precision floating point
x = (−1)sign bit integer part + 2−q · fractional part
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sign fixed-point number
numbers for ML?
something like fixed-point numbers in a programming assignment?
A powerful hybrid approach
lie in the same range.
8-bit shared exponent
A more specialized approach
these are the numbers a particular low-precision string represents
the Cannots, and a Little Bit of Deep Learning” (Zhang et al 2017)
different precisions
Weights
<latexit sha1_base64="UG5mxhHPUCMADfi6SWL+L8DxtxA=">AB/3icbVBNS8NAEN3Ur1q/oIXL8EieCpJFfRY9OKxgv2ANpTNdtou3WTD7kQosQf/ihcPinj1b3jz37hpc9DWBwOP92aYmRfEgmt03W+rsLK6tr5R3Cxtbe/s7tn7B0tE8WgwaSQqh1QDYJH0ECOAtqxAhoGAlrB+CbzWw+gNJfRPU5i8EM6jPiAM4pG6tlHXRmDoihVRENIW8CHI9Tnl12K+4MzjLxclImOeo9+6vblywJIUImqNYdz43RT6lCzgRMS91EQ0zZmA6hY2i2TPvp7P6pc2qUvjOQylSEzkz9PZHSUOtJGJjOkOJIL3qZ+J/XSXBw5ac8ihOEiM0XDRLhoHSyMJw+V8BQTAyhTHFzq8NGVFGJrKSCcFbfHmZNKsV7xSvbso167zOIrkmJyQM+KRS1Ijt6ROGoSR/JMXsmb9WS9WO/Wx7y1YOUzh+QPrM8fEmyWxw=</latexit>activations
<latexit sha1_base64="oKsRCAwYnFMyA+PFiFuM/BUJM6k=">ACA3icbVBNS8NAEN34WetX1JteFovgqSRV0GPRi8cK9gPaUDbTbt0kw27k0IJBS/+FS8eFPHqn/Dmv3GT5qCtDwYe780wM8+PBdfgON/Wyura+sZmau8vbO7t28fHLa0TBRlTSqFVB2faCZ4xJrAQbBOrBgJfcHa/vg289sTpjSX0QNMY+aFZBjxgFMCRurbxz0ZM0VAqoiELCU+CS39KxvV5yqkwMvE7cgFVSg0be/egNJk5BFQAXRus6MXgpUcCpYLNyL9EsJnRMhqxraLZQe2n+wyfGWA6lMRYBz9fdESkKtp6FvOkMCI73oZeJ/XjeB4NpLeRQnwCI6XxQkAoPEWSB4wBWjIKaGEKq4uRXTEVEmCRNb2YTgLr68TFq1qntRrd1fVuo3RwldIJO0Tly0RWqozvUQE1E0SN6Rq/ozXqyXqx362PeumIVM0foD6zPH4V/mLk=</latexit>activationsprev
<latexit sha1_base64="ms+MraOG1UAd0ZqFWDytEaDwtw=">ACGXicbVBNS8MwGE79nPOr6tFLcAieRjsFPQ69eJzgPmArJc3SLSxNSpIORunf8OJf8eJBEY968t+Ydj24zQcCD8/zfuR9gphRpR3nx1pb39jc2q7sVHf39g8O7aPjhKJxKSNBROyFyBFGOWkralmpBdLgqKAkW4wucv97pRIRQV/1LOYeBEacRpSjLSRfNsZiJhIpIXkKCIpwpOC0tlfrombnTLPtmlN3CsBV4pakBkq0fPtrMBQ4iQjXmCGl+q4Tay9FUlPMSFYdJIrECE/QiPQNzVcpLy0uy+C5UYwFNI8rmGh/u1IUaTULApMZYT0WC17ufif1090eOlMeJhzPF4UJg1rAPCY4pJgzWaGICyp+SvEYyRNPibMqgnBXT5lXQadfey3ni4qjVvyzgq4BScgQvgmvQBPegBdoAgyfwAt7Au/VsvVof1ue8dM0qe07AqzvX2Plowo=</latexit>backwardnext
<latexit sha1_base64="pwQr5bTKeUI8pGOBTjWNUR6/Z20=">ACFnicbVC7SgNBFJ31GeMramzGAQbw24UtAzaWEYwD0hCmJ3cJENmZ5aZu2pY9its/BUbC0Vsxc6/cfIoTOKBgcM593LmniAS3KDn/ThLyura+uZjezm1vbObm5v2pUrBlUmBJK1wNqQHAJFeQoB5poGEgoBYMrkd+7R604Ure4TCVkh7knc5o2ildu60qSLQFJWNIQkoGzwQHUnbSezhoRHTN2Lu8VvDHcReJPSZ5MUW7nvpsdxeIQJDJBjWn4XoSthGrkTECabcYGIptJe9CwdBRlWsn4rNQ9tkrH7Sptn0R3rP7dSGhozDAM7GRIsW/mvZH4n9eIsXvZSriMYgTJkHdWLio3FHbodrYCiGlCmuf2ry/pU4a2yawtwZ8/eZFUiwX/rFC8Pc+XrqZ1ZMghOSInxCcXpERuSJlUCNP5IW8kXfn2Xl1PpzPyeiSM905IDNwvn4BmpahiA=</latexit>storage
backward
<latexit sha1_base64="/1YGOJYwEch59UgBlJn3T3BvA4=">ACAHicbVC7TsMwFHXKq5RXgYGBxaJCYqSgRjBQtjkehDaqPKcW5aq04c2Q6oirLwKywMIMTKZ7DxNzhtBmg5kqWjc+7D93gxZ0rb9rdVWldW98ob1a2tnd296r7Bx0lEkmhTQUXsucRBZxF0NZMc+jFEkjoceh6k5vc7z6AVExE93oagxuSUcQCRok20rB6NBAxSKFjEgIqUfo5JFIPxtWa3bdngEvE6cgNVSgNax+DXxBkxAiTlRqu/YsXZTIjWjHLKIFEQm+lkBH1D823KTWcHZPjUKD4OhDQv0nim/u5ISajUNPRMZUj0WC16ufif1090cOWmLIoTDRGdLwoSjrXAeRrYZxKo5lNDCJXM/BXTMZGEapNZxYTgLJ68TDqNunNeb9xd1JrXRxldIxO0Bly0CVqolvUQm1EUYae0St6s56sF+vd+piXlqyi5xD9gfX5A9h5lzU=</latexit>Weight gradient Weight accumulator
Types of signals in backpropagation:
weights/parameters
among parallel workers
vectors
hard to get a sense of how they would perform.
= 𝑦 Using this, we can prove guarantees that SGD works with a low precision model…since a low-precision gradient is an unbiased estimator. 2.0 3.0 2.7 30% 70%
wt+1 = ˜ Q (wt αtrf(wt; xt, yt))
E[wt+1|wt] = E h ˜ Q (wt αtrf(wt; xt, yt))
i = E [wt αtrf(wt; xt, yt)|wt] = wt αtrf(wt)
sample u ⇠ Unif[0, 1], then set Q(x) = bx + uc
<latexit sha1_base64="xpahtTfrS8TbRYgaMN5Puptlas4=">ACSnicbVBNaxRBFOxZo8b1a6NHL48sQsSwzCRiFAkEveSYgJsEdoalp/dNtkl/DN1vZJdhf58XT7n5I7x4MAQv6ZksQY1q561f268lJT3H8PercWbl7/7qg+7DR4+fPO2tPTvytnICh8Iq605y7lFJg0OSpPCkdMh1rvA4P/vU+Mdf0HlpzWeal5hpfmpkIQWnI17PNW5ndWe61IhLKC1EsNqS3RcbLOcI31MCQWo3gzyTYh/QAp4YxqoCka8EghdbgxewW7kKpCWetgBq+bi1x7Gvf68SBuAbdJsiR9tsTBuHeTqyoNBoSins/SuKSspo7kLhoptWHksuzvgpjgJtNvRZ3VaxgJdBmUARlisIWjVPxM197PdR4mNaep/9drxP95o4qKd1ktTVkRGnH9UFEpIAtNrzCRDgWpeSBcOBl2BTHljgsK7XfbEt43eHvz5dvkaGuQbA+2D9/09z4u61hlL9g62AJ2F7bJ8dsCET7Cv7wX6xi+hb9DO6jH5fj3aiZeY5+wudlSvnfbGr</latexit>E[Q(x)] = bxc · P(Q(x) = bxc) + (bxc + 1) · P(Q(x) = bxc + 1) = bxc + P(Q(x) = bxc + 1) = bxc + P(bx + uc = bxc + 1) = bxc + P(x + u bxc + 1) = bxc + P(u bxc + 1 x) = bxc + 1 + (bxc + 1 x) = x.
<latexit sha1_base64="MB7cgmRPMUsElAREmzxBCDFNUrQ=">AD6nicpVNa9tAEN1I/UjVL6c59rLUNMiYGqkpSXoIhJZCjw7UScASZrVeyUtWrE7CjLCP6GXHhJCr/1FvfXfdGWr6YfTJLRzejvz3szbYTfKBdfged9WLPvW7Tt3V+859x8fPS4tfbkQMtCUTagUkh1FBHNBM/YADgIdpQrRtJIsMPo+G1dPzxhSnOZfYBpzsKUJBmPOSVgUqM1CwXASoji6t1suO+WndDZ2HUCEQspFS5xoBYoGMJ+Ae3P3NrLt7FS8yO03XcZX0X+52bN2nowaVezICbtnCulf8sd3Fxof0/S02vhP2zqyvU+AUur/LhG1IgWAzu3/XmzJMJ1Hsre6NW2+t58DLwG9AGzXRH7W+BmNJi5RlQAXReuh7OYQVUcCpYDMnKDTLCT0mCRsamJGU6bCaP9UZfm4yYxwbL7HMAM+zvyoqkmo9TSPDTAlM9J+1OnlZbVhAvBNWPMsLYBldDIoLgUHi+t3jMVeMgpgaQKjixiumE6IBfM7nPkSXtexdXHlZXDwsudv9jb3X7X3jTrWEVP0TPkIh9toz30HvXRAFErsT5ap9aZLexP9rn9eUG1VhrNOvot7C/fAcdHM5w=</latexit>From https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/
in our computations
resulting in less accurate learned systems