Large-scale Computation
Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569
Large-scale Computation Nathan Lam z5113345 Sophie Calland - - PowerPoint PPT Presentation
Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569 Contents Introduction + Benefits Sophie Calland Parallelisation Nathan Lam Memory Access Stephen Webb 2
Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569
○ Sophie Calland
○ Nathan Lam
○ Stephen Webb
2
Sophie Calland z5161776
○ Benefits of FPGAs ○ Disadvantages of FPGAs ○ Hybrid approaches
○ CPU-based ○ Hybrid approach
3
complex problem, or process an amount of data[4] ○ Useful in science, technology, finance, space/defence, academia, etc.
highly performant + cheap! ○ Amount of data requiring processing is only getting larger
○ Provide acceleration for specific functions ○ Reprogrammable = cheap, flexible
computation accessible to embedded systems
4
complex calculations fast
○ Deterministic, very specialised ■ GPU = must communicate via a CPU, buses ■ CPU = 50 microseconds is good ■ FPGA = at or below 1 microsecond ○ No Operating System to go through ○ Useful if you want quick calculation + response
5 Very cool F-35s - contain FPGAs Picture from
https://nationalinterest.org/blog/the-buzz/lockheed- martins-f-35-how-the-joint-strike-fighter-becoming-2 4259
○ Data sources can be connected directly to the chip ■ No intermediary bus or OS as required by CPU/GPU designs ○ Potential for much higher bandwidth (and latency)
6
○ Remove/fix bugs ○ Change accelerators per application ○ Reusable
complex
○ FPGA chips alone don’t have a lot of on-board memory ○ Larger data sets = might not be worth it alone
○ Cost, time ○ Might not be worth it
ASICs
○ Bitcoin mining
7 Bitcoin mining; specialised ASICs are better than previously used FPGAs[7]
approaches alone
○ Latency sensitive tasks and data processing delegated to FPGA ○ CPU has better memory locality ○ Best of both worlds? Kind of … ■ Power and engineering effort are still concerns
and lower space requirements can benefit
○ Example: space computers, smart cameras
8 CHREC Space Processor v1.0 board[10]
Example - High Performance Computing
Massively parallel CPU-based architecture
interconnected bus
9
Example - High Performance Computing
provide a hybrid approach
acceleration via FPGA ○ Faster compute time
supercomputer[8], OpenPOWER Foundation
10
11
No pipelining = wasted time + slow
independent tasks can benefit Pipelining = faster
independent tasks
Figures from [6]
Nathan Lam z5113345
○ Inter-FPGA ○ Intra-FPGA
○ Merge Sort ○ Paul’s Algorithm ○ Map-reduce
12
Advantages
the problem to more FPGAs)
Disadvantages
(synchronisation)
cluster of FPGA
CPU-FPGA bus)
13
Advantages
always goes through the same bus to the same FPGA)
pre-processing) Disadvantages
single FPGA
memory (if datasets are too large for
14
Merge sort
15
Map-reduce
Parallelisation: Merge Sort
data to and from memory when merging
parts: 1. CPU partition the data into sub-blocks 2. FPGA sorts sub-blocks of data (using quick-sort) 3. CPU merges the sorted sub-blocks back together
16
Parallelisation: Merge Sort
throughput
execution time spent on the FPGA
percentage execution time for the CPU
17
Parallelisation: Map-Reduce [5]
k-means algorithm which is an unsupervised machine learning model
FPGAs
connected by a PCIe driver
18
International Conference on Application-Specific Systems, Architectures and Processors, Zurich, 2014, pp. 9-16.
Parallelisation: Map-Reduce [5]
19
Parallelisation: Map-Reduce [5]
compute nodes and 1 head node.
to the software version on Hadoop
to the Mahout version on Hadoop
across 3 FPGAs consistently
reduced bandwidth requirement for each node)
20
Parallelisation: Map-Reduce [5]
across 3 FPGAs consistently
requirement for each node
21
Stephen Webb z5075569
memory access in LSC
○ Problem Space ○ Solution
○ Problem Space ○ Paul’s Algorithm ○ Solution
22
Need for large amount memory:
the GB
(Usually 100’s of kB)
23
Some Requirements for LSC:
reasonable bandwidth
Using a direct algorithm conversion to hardware without regards to memory bandwidth saw a slowdown
24
High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform [1]
Paper 1: Problem Space
25
platform
interfaces (Both FPGA and CPU)
cache lines
Paper 1: Solution Shared Memory
Using the CPU last level cache as a buffer
with the cache lines
CPU cache line
cache line This technique has seen about a 2-3x improvement compared to and FPGA only implementation [2]
26 26
27
An Efficient O(1) Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics [2]
Paper 2: Problem Space
and FPGA.
accessed independently and parallel
28
Algorithm
into different segments for different ranges of Δt
to be used
29
Data Structure
events
be stored as a link list to allow it to be dynamic
Constraints
and the ordered list
possible (Around 20)
30
This allows for prefetching and caching within the queue. Draw Backs
change
Paper 2: Solution FIFO Pre-Fetch/ Round-Robin Writeback
During every cycle:
the next address of the ordered list from off chip memory
into SRAM
SRAM and send it to the Queue Sorter
the Queue sorter and write it SRAM
SRAM back to off chip memory
31 31
Trading off latency for bandwidth
Other Paper - Memory Accelerated
In Line Compression: [4]
32
33
Large Scale Computing
Why? Parallelisation
Memory Access
34
[1] Zhang, C., Chen, R., & Prasanna, V. (2016, May). High throughput large scale sorting on a CPU-FPGA heterogeneous platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 148-155). IEEE. [2] Herbordt, M. C., Kosie, F., & Model, J. (2008, April). An efficient O (1) priority queue for large FPGA-based discrete event simulations of molecular dynamics. In 2008 16th International Symposium on Field-Programmable Custom Computing Machines (pp. 248-257). IEEE. [3] Fang, J., Mulder, Y.T.B., Hidders, J. et al. In-memory database acceleration on FPGAs: a survey. The VLDB Journal 29, 33–59 (2020). https://doi.org/10.1007/s00778-019-00581-w [4] Chew, W. C., & Jiang, L. (2013). Overview of Large-Scale Computing: The Past, the Present, and the Future. Proceedings of the IEEE, 101(2), 227-241. [5]
9-16. [6] Stanford - Pipelining - https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html [7] Image used https://www.buybitcoinworldwide.com/mining/hardware/ [8] Overview of Cygnus: a new supercomputer at CCS - https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About-Cygnus.pdf [9] Image used - https://www.researchgate.net/figure/Blue-Gene-Q-packaging-hierarchy_fig3_281396398 [10] Rudolph, D., Wilson, C., Stewart, J., Gauvin, P., George, A., Lam, H., ... & Stoddard, A. (2014). CSP: A multifaceted hybrid architecture for space computing.
35