large scale computation
play

Large-scale Computation Nathan Lam z5113345 Sophie Calland - PowerPoint PPT Presentation

Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569 Contents Introduction + Benefits Sophie Calland Parallelisation Nathan Lam Memory Access Stephen Webb 2


  1. Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569

  2. Contents ● Introduction + Benefits ○ Sophie Calland Parallelisation ● ○ Nathan Lam Memory Access ● ○ Stephen Webb 2

  3. Introduction + ● Large scale computation Benefits of FPGAs ○ ○ Disadvantages of FPGAs Benefits Hybrid approaches ○ ● Example: Supercomputers ○ CPU-based Sophie Calland z5161776 Hybrid approach ○ ● Pipelining review 3

  4. What is Large Scale Computation? ● Tool to speed up calculation of a ● Modern FPGAs can help to meet many complex problem, or process an of these requirements amount of data[4] ○ Provide acceleration for specific Useful in science, technology, functions ○ finance, space/defence, ○ Reprogrammable = cheap, flexible academia, etc. Hybrid CPU/FPGAs make large scale ● ● Solutions need to be large scale + computation accessible to embedded highly performant + cheap! systems ○ Amount of data requiring processing is only getting larger 4

  5. FPGA Benefits for Large Scale Computation ● Smaller devices require the ability to perform complex calculations fast Low latency ● ○ Deterministic, very specialised ■ GPU = must communicate via a CPU, buses ■ CPU = 50 microseconds is good ■ FPGA = at or below 1 microsecond Very cool F-35s - contain FPGAs ○ No Operating System to go through Picture from https://nationalinterest.org/blog/the-buzz/lockheed- ○ Useful if you want quick calculation + response martins-f-35-how-the-joint-strike-fighter-becoming-2 4259 5

  6. FPGA Benefits (cont) ● Reprogrammability ● Data connections Remove/fix bugs ○ ○ Data sources can be connected ○ Change accelerators per directly to the chip application ■ No intermediary bus or OS as ○ Reusable required by CPU/GPU designs ○ Potential for much higher bandwidth (and latency) 6

  7. FPGA Disadvantages ● Memory locality and sharing can be more complex FPGA chips alone don’t have a lot of on-board ○ memory Larger data sets = might not be worth it alone ○ ● Engineering effort is greater ○ Cost, time ○ Might not be worth it ● Power increases relative to specialised Bitcoin mining; specialised ASICs are better than previously used FPGAs[7] ASICs ○ Bitcoin mining 7

  8. Hybrid CPU/FPGA Approaches ● Can address some pitfalls in CPU or FPGA approaches alone Latency sensitive tasks and data processing delegated ○ to FPGA CPU has better memory locality ○ ○ Best of both worlds? Kind of … Power and engineering effort are still concerns ■ ● Embedded systems with high data throughput and lower space requirements can benefit CHREC Space Processor v1.0 board[10] Example: space computers, smart cameras ○ 8

  9. Example - High Performance Computing ● Older supercomputers = Massively parallel CPU-based architecture Nodes communicate via ● interconnected bus ● High memory throughput Example: IBM’s Blue Gene[9] ● 9

  10. Example - High Performance Computing ● Modern supercomputers provide a hybrid approach ● Allows for hardware acceleration via FPGA ○ Faster compute time ● Example: Cygnus supercomputer[8], OpenPOWER Foundation ● Pictured 10

  11. Pipelining Review No pipelining = wasted time + slow Problems that can be broken up into ● independent tasks can benefit Pipelining = faster Figures from [6] Increases throughput ● ● Can reduce latency for concurrent and independent tasks 11

  12. Types of Parallelisation ● ○ Inter-FPGA Parallelisation Intra-FPGA ○ ● Divide and Conquer Algorithms ○ Merge Sort Nathan Lam z5113345 ○ Paul’s Algorithm ○ Map-reduce 12

  13. Parallelisation: Inter-FPGA Advantages Disadvantages Higher degree of parallelisation (distribute More challenging memory management ● ● the problem to more FPGAs) (synchronisation) Ability to handle larger amounts of data Larger overhead in coordinating the ● ● cluster of FPGA More potential bottlenecks (local network, ● CPU-FPGA bus) 13

  14. Parallelisation: Intra-FPGA Advantages Disadvantages ● No overhead in managing the FPGA (data ● Limited to the computational power of a always goes through the same bus to the single FPGA same FPGA) ● Still can be challenging to manage ● Faster on smaller scales (less time memory (if datasets are too large for pre-processing) on-chip memory) 14

  15. Parallelisation: Algorithm Examples Merge sort Map-reduce 15

  16. Parallelisation: Merge Sort ● Using FPGA-only causes large overhead when transferring data to and from memory when merging ● Pipeline the algorithm into 3 parts: 1. CPU partition the data into sub-blocks 2. FPGA sorts sub-blocks of data (using quick-sort) 3. CPU merges the sorted sub-blocks back together 16

  17. Parallelisation: Merge Sort ● Hybrid solution has the highest throughput ● Smaller datasets, higher execution time spent on the FPGA ● Larger dataset has higher percentage execution time for the CPU 17

  18. Parallelisation: Map-Reduce [5] ● Hadoop Map-Reduce Algorithm ● One use-case is for calculating k-means algorithm which is an unsupervised machine learning model ● Use clusters of computers with FPGAs ● Each node in the cluster has its own CPU and FPGA resources connected by a PCIe driver Y. Choi and H. K. So, "Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster," 2014 IEEE 25th 18 International Conference on Application-Specific Systems, Architectures and Processors , Zurich, 2014, pp. 9-16.

  19. Parallelisation: Map-Reduce [5] 19

  20. Parallelisation: Map-Reduce [5] ● Measurements taken with 3 compute nodes and 1 head node. ● Up to 20.6x speedup compared to the software version on Hadoop ● Up to 16.3x speedup compared to the Mahout version on Hadoop ● Same number of mappers across 3 FPGAs consistently outperforms 1 FPGA (due to reduced bandwidth requirement for each node) 20

  21. Parallelisation: Map-Reduce [5] ● Same number of mappers across 3 FPGAs consistently outperforms 1 FPGA ● Attributed to reduced bandwidth requirement for each node 21

  22. ● Overview of the issues with Memory memory access in LSC ● Paper 1 Access Problem Space ○ ○ Solution Paper 2 ● ○ Problem Space Stephen Webb z5075569 Paul’s Algorithm ○ ○ Solution ● Other Paper 22

  23. Overview of the issues with memory access in LSC Need for large amount memory: Some Requirements for LSC: LSC is all about large data Need to be able to fetch the data in ● ● ● Dealing with tasks that have datasets in reasonable bandwidth the GB Fast Random Reads and Writes ● ● Unfeasible to store it all on FPGA memory ● Multi Parallel Reads and Writes (Usually 100’s of kB) ● System BUS is slow in comparison Using a direct algorithm conversion to hardware without regards to memory bandwidth saw a slowdown of 33x compared to a pure software solution in paper 2. [2] 23

  24. Paper 1 High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform [1] 24

  25. Paper 1: Problem Space ● CPU-FPGA heterogeneous platform Regular MultiCore CPU ● ● Coherent memory interfaces (Both FPGA and CPU) ● High speed interconnection DRAM is accessed through ● cache lines 25

  26. Paper 1: Solution Shared Memory Using the CPU last level cache as a buffer ● Ensure the Block meets up with the cache lines ● Fetch Blocks data from the CPU cache line ● Sort the Block Write Back The Block to the ● cache line This technique has seen about a 2-3x improvement compared to and FPGA only implementation [2] 26 26

  27. Paper 2 An Efficient O(1) Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics [2] 27

  28. Paper 2: Problem Space ● Separate Host Computer and FPGA. ● SMP config ● PCI will work ● Several 32-bit SRAM banks accessed independently and parallel 28

  29. Paper 2: Paul’s Algorithm Algorithm Data Structure ● Designed to speed up priority Queues ● Ordered List of all segments ● Discreetly break up the time series data ● Each Segment is an unordered list of into different segments for different events ranges of Δt ● Segments are limited to a finite size ● Each Segment is stored as a unsorted list ● Both the ordered list and segments should ● Only sort the segment just before it about be stored as a link list to allow it to be to be used dynamic 29

  30. Paper 2: Solution Data Structure Constraints Draw Backs ● Convert the Link Lists into discrete arrays ● Not flexible and cannot adapt well to ● Limit the max size of both the segments change and the ordered list ● Need to know beforehand segment size ● Have the size of the segments as small as possible (Around 20) This allows for prefetching and caching within the queue. 30

  31. Paper 2: Solution FIFO Pre-Fetch/ Round-Robin Writeback During every cycle: Fetch the next segment from ● the next address of the ordered list from off chip memory ● Store the last Fetch Segment into SRAM ● Retrieve the oldest Segment in SRAM and send it to the Queue Sorter Retrieve the sorted queue from ● the Queue sorter and write it SRAM ● Write back a sorted queue from SRAM back to off chip memory Trading off latency for bandwidth 31 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend