Platform IO DMA Transaction Acceleration ICS/CACHES Steen Larsen - - PowerPoint PPT Presentation

platform io dma
SMART_READER_LITE
LIVE PREVIEW

Platform IO DMA Transaction Acceleration ICS/CACHES Steen Larsen - - PowerPoint PPT Presentation

Platform IO DMA Transaction Acceleration ICS/CACHES Steen Larsen (steen.larsen@intel.com) Ben Lee (benl@eecs.oregonstate.edu) June 4 2011 Outline Introduction & Motivation Background Proposal Experiments & Analysis


slide-1
SLIDE 1

ICS/CACHES Steen Larsen (steen.larsen@intel.com) Ben Lee (benl@eecs.oregonstate.edu) June 4 2011

Platform IO DMA Transaction Acceleration

slide-2
SLIDE 2

Outline

  • Introduction & Motivation
  • Background
  • Proposal
  • Experiments & Analysis
  • Related & Future work
slide-3
SLIDE 3

10,000 foot view of IO

IO growth is not matching CPU and memory bandwidth growth.

  • Multi-core processors (CMP, SMT)
  • NUMA
slide-4
SLIDE 4

Typical platform configuration and IO interface

slide-5
SLIDE 5

Legacy TX

slide-6
SLIDE 6

Legacy RX

slide-7
SLIDE 7

Critical path latency (10GbE 64B)

slide-8
SLIDE 8

IO transmit breakdown (10GbE 64B)

slide-9
SLIDE 9

PCIe bandwidth utilization

slide-10
SLIDE 10

Factor Measurement unit Descri ptor DMA iDMA Estimat ed Improv ement Comment/justification Latency microseconds to send a TCP/IP message between two systems 8.8 7.38 16% Descriptors are no longer latency critical Bandwidth- per-pin Gbps per serial lane link 2.5 2.67 17% Descriptors no longer consume chip-to-chip bandwidth

Basic proposal claims

slide-11
SLIDE 11

Proposed TX

slide-12
SLIDE 12

Proposed RX

slide-13
SLIDE 13

iDMA internals

slide-14
SLIDE 14

Related work

Sun Niagara2 Memory coherent IO

slide-15
SLIDE 15

Factor Measurement unit Descript

  • r DMA iDMA

Estimated Improvem ent Comment/justification Latency microseconds to send a TCP/IP message between two systems 8.8 7.38 16% Descriptors are no longer latency critical Bandwidth- per-pin Gbps per serial lane link 2.5 2.67 17% Descriptors no longer consume chip-to-chip bandwidth Bandwidth scalability Not quantifiable Reduced silicon area and power Power efficiency Normalized core power (maximum) 100% 29% 71% Power reduction due to more efficient core allocation of IO Quality of service Nanoseconds to control connection priority from software perspective 600 50 92% Round trip latency to queuing control reduced from PCIe to system memory Multiple IO complexity Die cost reduction 100% <50% >50% Silicon, power regulation and cooling cost reduction of multiple IO controllers into a single iDMA instance Security na na na not quantifiable

slide-16
SLIDE 16

Thank you! Questions?