Benchmarking the Memory Interface Controller bus of the Cell - - PowerPoint PPT Presentation

benchmarking the memory interface controller bus of the
SMART_READER_LITE
LIVE PREVIEW

Benchmarking the Memory Interface Controller bus of the Cell - - PowerPoint PPT Presentation

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis Whats missing? Benchmarking the Memory Interface Controller bus of the Cell processor Nathalie Casati EPFL December 18, 2007 CBEA The benchmarks Getting


slide-1
SLIDE 1

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

Benchmarking the Memory Interface Controller bus of the Cell processor

Nathalie Casati

EPFL

December 18, 2007

slide-2
SLIDE 2

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

The Elements

The main elements of the processor are :

  • The Power Processing

Element (PPE)

slide-3
SLIDE 3

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

The Elements

The main elements of the processor are :

  • The Power Processing

Element (PPE)

  • The 6 (usable) Synergistic

Processing Elements (SPEs)

slide-4
SLIDE 4

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

The Elements

The main elements of the processor are :

  • The Power Processing

Element (PPE)

  • The 6 (usable) Synergistic

Processing Elements (SPEs)

  • The Element Interconnect

Bus (EIB)

slide-5
SLIDE 5

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

The Elements

The main elements of the processor are :

  • The Power Processing

Element (PPE)

  • The 6 (usable) Synergistic

Processing Elements (SPEs)

  • The Element Interconnect

Bus (EIB)

  • The I/O Controller (XIO)
slide-6
SLIDE 6

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

The Elements

The main elements of the processor are :

  • The Power Processing

Element (PPE)

  • The 6 (usable) Synergistic

Processing Elements (SPEs)

  • The Element Interconnect

Bus (EIB)

  • The I/O Controller (XIO)
  • The Memory Interface

Controller (MIC)

slide-7
SLIDE 7

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

Other important facts

  • Each SPE has a 256KB local store and no cache
slide-8
SLIDE 8

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

Other important facts

  • Each SPE has a 256KB local store and no cache
  • Each element is connected to the EIB with 25.6 GB/s

bandwidth (ingoing / outgoing)

slide-9
SLIDE 9

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

Other important facts

  • Each SPE has a 256KB local store and no cache
  • Each element is connected to the EIB with 25.6 GB/s

bandwidth (ingoing / outgoing)

  • The 256MB main memory is connected to an external two

channel Rambus XDR

slide-10
SLIDE 10

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Cell Broadband Engine Architecture

Other important facts

  • Each SPE has a 256KB local store and no cache
  • Each element is connected to the EIB with 25.6 GB/s

bandwidth (ingoing / outgoing)

  • The 256MB main memory is connected to an external two

channel Rambus XDR

  • Each data transfer between SPEs and main memory is an

explicit DMA operation up to 16KB

slide-11
SLIDE 11

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

The main question

What happens if we use n SPEs at the same time while the MIC has only 25.6 GB/s bandwidth ?

(for a DMA get operation)

slide-12
SLIDE 12

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

How to get the full bandwidth ?

  • Sequential accesses read or write equal amounts of data to all

memory banks

slide-13
SLIDE 13

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

How to get the full bandwidth ?

  • Sequential accesses read or write equal amounts of data to all

memory banks

  • Both effective address and the local storage address are

128-byte aligned

slide-14
SLIDE 14

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

How to get the full bandwidth ?

  • Sequential accesses read or write equal amounts of data to all

memory banks

  • Both effective address and the local storage address are

128-byte aligned

  • Other factors like avoiding TLB misses
slide-15
SLIDE 15

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Part 1

Bandwidth graphs

16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 1 request 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 3 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 2 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 4 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth

slide-16
SLIDE 16

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing? Part 2

Bandwidth graphs

16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 5 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 7 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 6 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth 16B 32B 64B 128B 256B 512B 1KB 2KB 4KB 8KB 16KB Read DMA transfer size, 8 requests 0.1 0.2 0.5 1 2 5 10 20 25.6 Bandwidth [GB/s] 1 SPE 2 SPEs 3 SPEs 4 SPEs 5 SPEs 6 SPEs Max bandwidth

slide-17
SLIDE 17

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

Results analysis and conclusion

  • One SPE never gets the whole bandwidth (only about half)

→ favours parallel accesses

slide-18
SLIDE 18

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

Results analysis and conclusion

  • One SPE never gets the whole bandwidth (only about half)

→ favours parallel accesses

  • The EIB is optimized for larger transfers
slide-19
SLIDE 19

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

Results analysis and conclusion

  • One SPE never gets the whole bandwidth (only about half)

→ favours parallel accesses

  • The EIB is optimized for larger transfers
  • It is a good idea to use multibuffering
slide-20
SLIDE 20

CBEA The benchmarks Getting the full bandwidth Bandwidth graphs Results analysis What’s missing?

What’s missing?

The same benchmark with a DMA put operation