BigStation: Enable Scalable Real-time Signal Processing in Large - - PowerPoint PPT Presentation

bigstation enable scalable real time signal
SMART_READER_LITE
LIVE PREVIEW

BigStation: Enable Scalable Real-time Signal Processing in Large - - PowerPoint PPT Presentation

BigStation: Enable Scalable Real-time Signal Processing in Large MU-MIMO Systems Qing Yang Xiaoxiao Li Hongyi Yao Ji Fang Kun Tan Wenjun Hu Jiansong Zhang Yongguang Zhang Microsoft Research Asia, Beijing, China


slide-1
SLIDE 1

BigStation: Enable Scalable Real-time Signal Processing in Large MU-MIMO Systems

Qing Yang  Xiaoxiao Li § Hongyi Yao¶ Ji Fang ‡ Kun Tan † Wenjun Hu † Jiansong Zhang† Yongguang Zhang †

†Microsoft Research Asia, Beijing, China

 MSRA and CUHK, Hong Kong § MSRA and Tsinghua University, Beijing, China ¶ MSRA and USTC, He Fei, An Hui, China ‡ MSRA and BJTU, Beijing, China

slide-2
SLIDE 2

Motivation

  • Demand for more wireless capacity

– Proliferation of mobile devices: wireless access is primary – Data-intensive applications: video, tele-presence – “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO

VNI 2011-2016)

SIGCOMM 2013, Hong Kong, Aug 2013 2

slide-3
SLIDE 3

Motivation

  • Demand for more wireless capacity

– Proliferation of mobile devices: wireless access is primary – Data-intensive applications: video, tele-presence – “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO

VNI 2011-2016)

SIGCOMM 2013, Hong Kong, Aug 2013 3

Can we engineer next wireless network to match existing wired network – Giga-bit wireless throughput to every user?

slide-4
SLIDE 4

How to Gain More Wireless Capacity

  • More spectrum (DSA)

– Spectrum is scarce, shared resource and there is a limit

  • Spectrum reuse (micro cell, pico cell, …)

– Existing cells are already small (like Wi-Fi) – Increased deployment and management complexity

  • Spatial multiplexing (MU-MIMO)

– More promising

SIGCOMM 2013, Hong Kong, Aug 2013 4

slide-5
SLIDE 5

Background: MU-MIMO

  • Transmit to/Receive from multiple mobile stations

SIGCOMM 2013, Hong Kong, Aug 2013 5

Access Point (AP)

Joint Signal Processing

mobile mobile mobile

m AP antennas

n total client antennas mobile mobile mobile

𝑍 = 𝐼S, 𝑌 = 𝐼∗(𝐼𝐼∗−1)𝐼𝑍 Uplink: S = 𝐼∗𝐼 −1𝐼∗𝑌 Y = 𝐼𝑇 = 𝑌 Downlink: S = 𝑌

  • In theory, linearly scale capacity with # of AP

antennas

slide-6
SLIDE 6

How Many Antennas do We need

  • … for giga-bit wireless link per user

SIGCOMM 2013, Hong Kong, Aug 2013 6

# of ant 1 2 4 8 16 32 64 128 20MHz 72.2M 144M 289M 578M 1.2G 2.3G 4.6G 9.2G 40MHz 150M 300M 600M 1.2G 2.4G 4.8G 9.6G 19.2G 80MHz 325M 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G 160MHz 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G 83.2G

802.11n 802.11ac Large-scale MU-MIMO systems Giga-bit to 20 concurrent users: 160MHz channel with at least 40 antennas

slide-7
SLIDE 7

Challenge

  • Can we build a scalable AP to support such large-

scale MU-MIMO operation?

– When n, so as m, increases large?

SIGCOMM 2013, Hong Kong, Aug 2013 7

Access Point (AP)

Joint Signal Processing

mobile mobile mobile

m AP antennas

n total client antennas mobile mobile mobile

slide-8
SLIDE 8

Computation and Throughput Requirement: a Back-of-Envelope Estimation

  • Setting: 160MHz, 40 antennas
  • Data path:

– 160MHz channel width  𝑠 = 5Gbps sa. per ant. – 40 antennas 200Gbps in total

  • Computation:

– Channel inverse (once every frame): 𝑃(𝑛𝑜2𝑠/𝑢𝑔)  269 GOPS – Spatial demutiplexing/precoding: 𝑃(𝑛𝑜𝑠) 1.5 TOPS – Channel Decoding: 𝑃(𝑜𝑠)  5.5 TOPS – 7.27 TOPS in total!

  • State-of-art multi-core CPU achieves only 50 GOPS

SIGCOMM 2013, Hong Kong, Aug 2013 8

slide-9
SLIDE 9

A Single Central Processing Unit

SIGCOMM 2013, Hong Kong, Aug 2013 9

Access Point (AP)

Joint Signal Processing

mobile mobile mobile

m AP antennas

n total client antennas mobile mobile mobile

slide-10
SLIDE 10

BigStation AP

BigStation: Parallelizing to Scale

SIGCOMM 2013, Hong Kong, Aug 2013 10

Simple Processing Unit

mobile mobile mobile

m AP antennas

n total client antennas mobile mobile mobile

Simple Processing Unit Simple Processing Unit Simple Processing Unit Simple Processing Unit

Inter-connecting Network

slide-11
SLIDE 11

Outline

  • Parallel architecture
  • Parallel algorithms and optimization
  • Performance
  • Conclusion

SIGCOMM 2013, Hong Kong, Aug 2013 11

slide-12
SLIDE 12

Naive Architecture

  • A pool of processing servers

– Sending samples of the same frame to one server…

SIGCOMM 2013, Hong Kong, Aug 2013 12

  • A pool of processing servers
  • Enough processing capability with ⌈𝑢𝑞/𝑢𝑔⌉ servers
slide-13
SLIDE 13

Naive Architecture

SIGCOMM 2013, Hong Kong, Aug 2013 13

  • Issue: long processing latency for a frame (~1𝑡)
  • Wireless protocols requirement: milliseconds
slide-14
SLIDE 14

Our Approach: Distributed Pipeline

SIGCOMM 2013, Hong Kong, Aug 2013 14

  • Parallelizing MU-MIMO processing into 3-stage pipeline
  • At each stage, the computation is further parallelized

among multiple servers

Channel inversion Spatial demultiplexing Channel decoding

slide-15
SLIDE 15

Channel inversion Spatial demultiplexing Channel decoding

Data Partitioning across Servers

  • Exploiting data parallelism inside MU-MIMO

SIGCOMM 2013, Hong Kong, Aug 2013 15

OFDM signal Partitioning by subcarriers

slide-16
SLIDE 16

Channel inversion Spatial demultiplexing Channel decoding

Data Partitioning across Servers

  • Exploiting data parallelism inside MU-MIMO

SIGCOMM 2013, Hong Kong, Aug 2013 16

OFDM signal Partitioning by spatial streams

slide-17
SLIDE 17

Example

  • Giga-bit to 20 users

– 160MHz  468 parallel subcarriers

  • Subcarrier partitioning

– Each server needs to handle a minimum of 10Mbps data

  • Spatial stream partitioning

– Each server needs to handle 5Gbps data

  • Generally within existing server’s processing capability

– Multi-core (4~16) – 10G NIC

SIGCOMM 2013, Hong Kong, Aug 2013 17

slide-18
SLIDE 18

Summary

  • Distributed pipeline for low latency
  • Exploiting data parallelism across servers at

each processing stage

  • If single datum is still beyond capability of a

single processing unit

– Building deeper pipeline (see paper for details)

SIGCOMM 2013, Hong Kong, Aug 2013 18

slide-19
SLIDE 19

Outline

  • Parallel architecture
  • Parallel algorithms and optimization
  • Performance
  • Conclusion

SIGCOMM 2013, Hong Kong, Aug 2013 19

slide-20
SLIDE 20

Computation Partitioning in a Server

  • Three key operations in MU-MIMO

– Matrix multiplication – Matrix inversion – Viterbi decoding (channel decoding)

SIGCOMM 2013, Hong Kong, Aug 2013 20

slide-21
SLIDE 21

Parallel Matrix Multiplication

  • Divide-and-conquer

SIGCOMM 2013, Hong Kong, Aug 2013 21

𝐼∗ 𝐼 = 𝐼1

∗𝐼2 ∗

𝐼1 𝐼2 = 𝐼1

∗𝐼1 + 𝐼2 ∗𝐼2

Core 1 Core 2

slide-22
SLIDE 22

Parallel Matrix Inversion

  • Based on Gauss-Jordan method

SIGCOMM 2013, Hong Kong, Aug 2013 22

ℎ11 ℎ12 ℎ21 ℎ22 ℎ1𝑜 ℎ2𝑜 ℎ31 ℎ32 ⋱ ⋮ ℎ𝑜1 ℎ𝑜2 … ℎ𝑜𝑜 1 1 ⋱ ⋮ … 1

Core 1 Core 2

slide-23
SLIDE 23

Parallel Matrix Inversion

  • Based on Gauss-Jordan method

SIGCOMM 2013, Hong Kong, Aug 2013 23

ℎ11 ℎ12 ℎ21 ℎ22 ℎ1𝑜 ℎ2𝑜 ℎ31 ℎ32 ⋱ ⋮ ℎ𝑜1 ℎ𝑜2 … ℎ𝑜𝑜 1 1 ⋱ ⋮ … 1

Core 1 Core 2

1 1 ⋱ ⋮ … 1 𝑗11 𝑗12 𝑗21 𝑗22 𝑗1𝑜 𝑗2𝑜 𝑗31 𝑗32 ⋱ ⋮ 𝑗𝑜1 𝑗𝑜2 … 𝑗𝑜𝑜

slide-24
SLIDE 24

Parallel Viterbi Decoding

  • Challenge: sequential operations on a continuous

(soft-)bit stream

  • Solution:

– Artificially divide bit-stream into blocks

SIGCOMM 2013, Hong Kong, Aug 2013 24

Core 1 Core 2

slide-25
SLIDE 25

Parallel Viterbi Decoding

  • Challenge: sequential operations on a continuous

(soft-)bit stream

  • Solution:

– Artificially divide bit-stream into blocks – Add overlaps to ensure converging to optimal

SIGCOMM 2013, Hong Kong, Aug 2013 25

Core 1 Core 2

slide-26
SLIDE 26

Parallel Viterbi Decoding

  • How to choose a right block size?

– The tradeoff between latency and overhead

  • Our goal: fully utilize the computation capacity while

keeping 𝑀 minimal

  • Optimal size: 𝑀∗ = 2𝐸𝑣/(𝑛𝑤 − 𝑣)

SIGCOMM 2013, Hong Kong, Aug 2013 26

Core 1 Core 2

L D G

𝑣: stream bit rate 𝑤: processing rate per core 𝑛: # of cores

slide-27
SLIDE 27

Optimization: Lock-free Computing Structure

  • Complex interaction between communication

and computation threads

SIGCOMM 2013, Hong Kong, Aug 2013 27

(1.31x ) Contention at output buffer Lock free

slide-28
SLIDE 28

Optimization: Communication

  • Parallelizing communication among multiple

cores

  • Dealing with incast problem

– Application-level flow control

  • Isolating communication and computation on

different cores

SIGCOMM 2013, Hong Kong, Aug 2013 28

slide-29
SLIDE 29

Outline

  • Parallel architecture
  • Parallel algorithms and optimization
  • Performance
  • Conclusion

SIGCOMM 2013, Hong Kong, Aug 2013 29

slide-30
SLIDE 30

Micro-benchmarks

  • Platform: Dell server with an Intel Xeon E5520 CPU (2.26 GHz,

4 cores)

SIGCOMM 2013, Hong Kong, Aug 2013 30

Channel inversion

slide-31
SLIDE 31

Micro-benchmarks

SIGCOMM 2013, Hong Kong, Aug 2013 31

Spatial demultiplexing Viterbi decoding

slide-32
SLIDE 32

Micro-benchmarks

SIGCOMM 2013, Hong Kong, Aug 2013 32

6 users, 100Mbps 20 users, 600Mbps 50 users, 1Gbps

slide-33
SLIDE 33

Prototype

  • Software radio: Sora MIMO Kit

– 4x phase coherent radio chains – Extensible with an external clock

SIGCOMM 2013, Hong Kong, Aug 2013 33

slide-34
SLIDE 34

Capacity Gain

SIGCOMM 2013, Hong Kong, Aug 2013 34

Caped at a constant value due to random-user selection!

slide-35
SLIDE 35

Capacity Gain

SIGCOMM 2013, Hong Kong, Aug 2013 35

Overprovisioned AP antennas

6.8x

slide-36
SLIDE 36

Processing Delay

SIGCOMM 2013, Hong Kong, Aug 2013 36

Light load (1 frame per 10ms) Heavy load (back-to-back frames) 860𝜈𝑡

slide-37
SLIDE 37

Things I didn’t talk about

  • How to get channel state in a scalable way

– Argos [Shepard, et al., mobicom 2012] – JMB [Rahul, et al., SIGCOMM 2012]

  • MU-MIMO MAC

– Better user selection than random? (Future work)

  • Automatic gain control in large scale MU-MIMO

– Future work

SIGCOMM 2013, Hong Kong, Aug 2013 37

(related and future work)

slide-38
SLIDE 38

Conclusions

  • Scale processing of large MU-MIMO systems is

possible

– Exploiting parallelism of MU-MIMO operations and processing servers – Developing a distributed processing pipeline

  • Large-scale MU-MIMO is a promising way to scale

wireless capacity by another 100x

– Yet, many challenges remains (user-selection, AGC …)

SIGCOMM 2013, Hong Kong, Aug 2013 38

slide-39
SLIDE 39

Thanks! Take you questions!

SIGCOMM 2013, Hong Kong, Aug 2013 39