Large-scale Computation Nathan Lam z5113345 Sophie Calland - - PowerPoint PPT Presentation

large scale computation
SMART_READER_LITE
LIVE PREVIEW

Large-scale Computation Nathan Lam z5113345 Sophie Calland - - PowerPoint PPT Presentation

Large-scale Computation Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569 Contents Introduction + Benefits Sophie Calland Parallelisation Nathan Lam Memory Access Stephen Webb 2


slide-1
SLIDE 1

Large-scale Computation

Nathan Lam z5113345 Sophie Calland z5161776 Stephen Webb z5075569

slide-2
SLIDE 2

Contents

  • Introduction + Benefits

○ Sophie Calland

  • Parallelisation

○ Nathan Lam

  • Memory Access

○ Stephen Webb

2

slide-3
SLIDE 3

Introduction + Benefits

Sophie Calland z5161776

  • Large scale computation

○ Benefits of FPGAs ○ Disadvantages of FPGAs ○ Hybrid approaches

  • Example: Supercomputers

○ CPU-based ○ Hybrid approach

  • Pipelining review

3

slide-4
SLIDE 4

What is Large Scale Computation?

  • Tool to speed up calculation of a

complex problem, or process an amount of data[4] ○ Useful in science, technology, finance, space/defence, academia, etc.

  • Solutions need to be large scale +

highly performant + cheap! ○ Amount of data requiring processing is only getting larger

  • Modern FPGAs can help to meet many
  • f these requirements

○ Provide acceleration for specific functions ○ Reprogrammable = cheap, flexible

  • Hybrid CPU/FPGAs make large scale

computation accessible to embedded systems

4

slide-5
SLIDE 5

FPGA Benefits for Large Scale Computation

  • Smaller devices require the ability to perform

complex calculations fast

  • Low latency

○ Deterministic, very specialised ■ GPU = must communicate via a CPU, buses ■ CPU = 50 microseconds is good ■ FPGA = at or below 1 microsecond ○ No Operating System to go through ○ Useful if you want quick calculation + response

5 Very cool F-35s - contain FPGAs Picture from

https://nationalinterest.org/blog/the-buzz/lockheed- martins-f-35-how-the-joint-strike-fighter-becoming-2 4259

slide-6
SLIDE 6

FPGA Benefits (cont)

  • Data connections

○ Data sources can be connected directly to the chip ■ No intermediary bus or OS as required by CPU/GPU designs ○ Potential for much higher bandwidth (and latency)

6

  • Reprogrammability

○ Remove/fix bugs ○ Change accelerators per application ○ Reusable

slide-7
SLIDE 7

FPGA Disadvantages

  • Memory locality and sharing can be more

complex

○ FPGA chips alone don’t have a lot of on-board memory ○ Larger data sets = might not be worth it alone

  • Engineering effort is greater

○ Cost, time ○ Might not be worth it

  • Power increases relative to specialised

ASICs

○ Bitcoin mining

7 Bitcoin mining; specialised ASICs are better than previously used FPGAs[7]

slide-8
SLIDE 8

Hybrid CPU/FPGA Approaches

  • Can address some pitfalls in CPU or FPGA

approaches alone

○ Latency sensitive tasks and data processing delegated to FPGA ○ CPU has better memory locality ○ Best of both worlds? Kind of … ■ Power and engineering effort are still concerns

  • Embedded systems with high data throughput

and lower space requirements can benefit

○ Example: space computers, smart cameras

8 CHREC Space Processor v1.0 board[10]

slide-9
SLIDE 9

Example - High Performance Computing

  • Older supercomputers =

Massively parallel CPU-based architecture

  • Nodes communicate via

interconnected bus

  • High memory throughput
  • Example: IBM’s Blue Gene[9]

9

slide-10
SLIDE 10

Example - High Performance Computing

  • Modern supercomputers

provide a hybrid approach

  • Allows for hardware

acceleration via FPGA ○ Faster compute time

  • Example: Cygnus

supercomputer[8], OpenPOWER Foundation

  • Pictured

10

slide-11
SLIDE 11

Pipelining Review

11

No pipelining = wasted time + slow

  • Problems that can be broken up into

independent tasks can benefit Pipelining = faster

  • Increases throughput
  • Can reduce latency for concurrent and

independent tasks

Figures from [6]

slide-12
SLIDE 12

Parallelisation

Nathan Lam z5113345

  • Types of Parallelisation

○ Inter-FPGA ○ Intra-FPGA

  • Divide and Conquer Algorithms

○ Merge Sort ○ Paul’s Algorithm ○ Map-reduce

12

slide-13
SLIDE 13

Parallelisation: Inter-FPGA

Advantages

  • Higher degree of parallelisation (distribute

the problem to more FPGAs)

  • Ability to handle larger amounts of data

Disadvantages

  • More challenging memory management

(synchronisation)

  • Larger overhead in coordinating the

cluster of FPGA

  • More potential bottlenecks (local network,

CPU-FPGA bus)

13

slide-14
SLIDE 14

Parallelisation: Intra-FPGA

Advantages

  • No overhead in managing the FPGA (data

always goes through the same bus to the same FPGA)

  • Faster on smaller scales (less time

pre-processing) Disadvantages

  • Limited to the computational power of a

single FPGA

  • Still can be challenging to manage

memory (if datasets are too large for

  • n-chip memory)

14

slide-15
SLIDE 15

Parallelisation: Algorithm Examples

Merge sort

15

Map-reduce

slide-16
SLIDE 16

Parallelisation: Merge Sort

  • Using FPGA-only causes large
  • verhead when transferring

data to and from memory when merging

  • Pipeline the algorithm into 3

parts: 1. CPU partition the data into sub-blocks 2. FPGA sorts sub-blocks of data (using quick-sort) 3. CPU merges the sorted sub-blocks back together

16

slide-17
SLIDE 17

Parallelisation: Merge Sort

  • Hybrid solution has the highest

throughput

  • Smaller datasets, higher

execution time spent on the FPGA

  • Larger dataset has higher

percentage execution time for the CPU

17

slide-18
SLIDE 18

Parallelisation: Map-Reduce [5]

  • Hadoop Map-Reduce Algorithm
  • One use-case is for calculating

k-means algorithm which is an unsupervised machine learning model

  • Use clusters of computers with

FPGAs

  • Each node in the cluster has its
  • wn CPU and FPGA resources

connected by a PCIe driver

18

  • Y. Choi and H. K. So, "Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster," 2014 IEEE 25th

International Conference on Application-Specific Systems, Architectures and Processors, Zurich, 2014, pp. 9-16.

slide-19
SLIDE 19

Parallelisation: Map-Reduce [5]

19

slide-20
SLIDE 20

Parallelisation: Map-Reduce [5]

  • Measurements taken with 3

compute nodes and 1 head node.

  • Up to 20.6x speedup compared

to the software version on Hadoop

  • Up to 16.3x speedup compared

to the Mahout version on Hadoop

  • Same number of mappers

across 3 FPGAs consistently

  • utperforms 1 FPGA (due to

reduced bandwidth requirement for each node)

20

slide-21
SLIDE 21

Parallelisation: Map-Reduce [5]

  • Same number of mappers

across 3 FPGAs consistently

  • utperforms 1 FPGA
  • Attributed to reduced bandwidth

requirement for each node

21

slide-22
SLIDE 22

Memory Access

Stephen Webb z5075569

  • Overview of the issues with

memory access in LSC

  • Paper 1

○ Problem Space ○ Solution

  • Paper 2

○ Problem Space ○ Paul’s Algorithm ○ Solution

  • Other Paper

22

slide-23
SLIDE 23

Overview of the issues with memory access in LSC

Need for large amount memory:

  • LSC is all about large data
  • Dealing with tasks that have datasets in

the GB

  • Unfeasible to store it all on FPGA memory

(Usually 100’s of kB)

  • System BUS is slow in comparison

23

Some Requirements for LSC:

  • Need to be able to fetch the data in

reasonable bandwidth

  • Fast Random Reads and Writes
  • Multi Parallel Reads and Writes

Using a direct algorithm conversion to hardware without regards to memory bandwidth saw a slowdown

  • f 33x compared to a pure software solution in paper 2. [2]
slide-24
SLIDE 24

Paper 1

24

High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform [1]

slide-25
SLIDE 25

Paper 1: Problem Space

25

  • CPU-FPGA heterogeneous

platform

  • Regular MultiCore CPU
  • Coherent memory

interfaces (Both FPGA and CPU)

  • High speed interconnection
  • DRAM is accessed through

cache lines

slide-26
SLIDE 26

Paper 1: Solution Shared Memory

Using the CPU last level cache as a buffer

  • Ensure the Block meets up

with the cache lines

  • Fetch Blocks data from the

CPU cache line

  • Sort the Block
  • Write Back The Block to the

cache line This technique has seen about a 2-3x improvement compared to and FPGA only implementation [2]

26 26

slide-27
SLIDE 27

Paper 2

27

An Efficient O(1) Priority Queue for Large FPGA-Based Discrete Event Simulations of Molecular Dynamics [2]

slide-28
SLIDE 28

Paper 2: Problem Space

  • Separate Host Computer

and FPGA.

  • SMP config
  • PCI will work
  • Several 32-bit SRAM banks

accessed independently and parallel

28

slide-29
SLIDE 29

Paper 2: Paul’s Algorithm

Algorithm

  • Designed to speed up priority Queues
  • Discreetly break up the time series data

into different segments for different ranges of Δt

  • Each Segment is stored as a unsorted list
  • Only sort the segment just before it about

to be used

29

Data Structure

  • Ordered List of all segments
  • Each Segment is an unordered list of

events

  • Segments are limited to a finite size
  • Both the ordered list and segments should

be stored as a link list to allow it to be dynamic

slide-30
SLIDE 30

Paper 2: Solution Data Structure

Constraints

  • Convert the Link Lists into discrete arrays
  • Limit the max size of both the segments

and the ordered list

  • Have the size of the segments as small as

possible (Around 20)

30

This allows for prefetching and caching within the queue. Draw Backs

  • Not flexible and cannot adapt well to

change

  • Need to know beforehand segment size
slide-31
SLIDE 31

Paper 2: Solution FIFO Pre-Fetch/ Round-Robin Writeback

During every cycle:

  • Fetch the next segment from

the next address of the ordered list from off chip memory

  • Store the last Fetch Segment

into SRAM

  • Retrieve the oldest Segment in

SRAM and send it to the Queue Sorter

  • Retrieve the sorted queue from

the Queue sorter and write it SRAM

  • Write back a sorted queue from

SRAM back to off chip memory

31 31

Trading off latency for bandwidth

slide-32
SLIDE 32

Other Paper - Memory Accelerated

In Line Compression: [4]

  • Larger bandwidth
  • Smaller latency
  • Less required memory

32

slide-33
SLIDE 33

Thanks for listening! Any Questions?

33

slide-34
SLIDE 34

Questions?

Large Scale Computing

  • Can you think of industries or systems that would benefit from FPGAs either in a hybrid configuration or alone?

Why? Parallelisation

  • What algorithms could NOT be parallelised?
  • Are there any other examples of parallelisation with FPGAs that people are aware of?

Memory Access

  • What's the biggest drawback with heavy constrained input?

34

slide-35
SLIDE 35

References

[1] Zhang, C., Chen, R., & Prasanna, V. (2016, May). High throughput large scale sorting on a CPU-FPGA heterogeneous platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 148-155). IEEE. [2] Herbordt, M. C., Kosie, F., & Model, J. (2008, April). An efficient O (1) priority queue for large FPGA-based discrete event simulations of molecular dynamics. In 2008 16th International Symposium on Field-Programmable Custom Computing Machines (pp. 248-257). IEEE. [3] Fang, J., Mulder, Y.T.B., Hidders, J. et al. In-memory database acceleration on FPGAs: a survey. The VLDB Journal 29, 33–59 (2020). https://doi.org/10.1007/s00778-019-00581-w [4] Chew, W. C., & Jiang, L. (2013). Overview of Large-Scale Computing: The Past, the Present, and the Future. Proceedings of the IEEE, 101(2), 227-241. [5]

  • Y. Choi and H. K. So, "Map-reduce processing of k-means algorithm with FPGA-accelerated computer cluster," 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, Zurich, 2014, pp.

9-16. [6] Stanford - Pipelining - https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html [7] Image used https://www.buybitcoinworldwide.com/mining/hardware/ [8] Overview of Cygnus: a new supercomputer at CCS - https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2018/12/About-Cygnus.pdf [9] Image used - https://www.researchgate.net/figure/Blue-Gene-Q-packaging-hierarchy_fig3_281396398 [10] Rudolph, D., Wilson, C., Stewart, J., Gauvin, P., George, A., Lam, H., ... & Stoddard, A. (2014). CSP: A multifaceted hybrid architecture for space computing.

35