The search engine you can see Connects people to information and - - PowerPoint PPT Presentation
The search engine you can see Connects people to information and - - PowerPoint PPT Presentation
The search engine you can see Connects people to information and services The search engine you cannot see Total data: ~1EB Processing data : ~100PB/day Total web pages: ~1000 Billion Web pages updated: ~10Billion/day requests: ~10Billion/day
The search engine you cannot see
Total data: ~1EB Processing data : ~100PB/day Total web pages: ~1000 Billion Web pages updated: ~10Billion/day requests: ~10Billion/day Total logs : ~100PB Logs updated: ~1PB/day
The search engine you don’t see
Large scale distributed computing Large scale distributed storage
Speech Image
- Rec. sys
Intelligent HCI Other cutting edge tech.
The history of Moore’s law
- The Moore’s law is going to the end
1 10 100 1000 10000 100000
PERFORMANCE OF PROCESSOR
The history of data center
1 10 100 1000 10000 100000
PERFORMANCE OF PROCESSOR
Mainframe PC cluster SDDC
History of data center
- 2000~now
- Scalability
PC cluster SDDC
- 2013~
- Efficiency
Outline
- What is the PC cluster
- What is the SDDC
- Baidu’s practice
- Conclusion
PC cluster
- Background
– Web scale applications – The performance and cost limitations of mainframe
- Scale PC server by Ethernet
– Up to 10K servers per cluster
- Typical configurations
– Commodity hardware: X86 CPU, INSPUR, HUAWEI… – Software: MR/HDFS/Spark…
PC cluster
- Typical stack
– Each layer are independent – The interfaces are highly abstract – Follows the technology paradigms of PC
- Limitations
– Multiple highly abstract layers block to exploit the performance potential – Commodity hardware cannot support emerging applications, such as AI and big data
- The end of Moore’s law
Commodity hardware Distributed software Applications
Software-Defined Data Center - SDDC
- What is SDDC
– Applications driven hardware and software – Whole-stack co-design
- How
– Algorithm
- Customized for new hardware and architecture
– System and software
- separate data path and control path
– Hardware
- Expose low level API, fully controlled by software
- Customized for applications
hardware software Applications Commodity hardware Distributed software Applications
Software-Defined Data Center - SDDC
- Why SDDC
– Exploit performance potential cross multiple layer – Customized hardware to extend Moore’s law for emerging applications
- AI and big data
– Achieve extreme efficiency
- The FPGA in SDDC
– Enable the possibility of whole-stack co-design
SDDC – Baidu’s visions and practice
- Vision
– Shift PC cluster to SDDC in next 3 years – Define and design the SDDC, collaborating with partners
- Practice
– SDF: software-defined flash – SDA: software-defined accelerator
2011: SDF 2013: SDA 2015: design the distributed SD system
Software-defined flash – background
- Traditional SSD limitations
– Low bandwidth utilization
- 40% or less in real workload
– Limited capacity utilization
- Only 50%~70% for applications
– Less predictable performance
- Large-scale
– 10,000+ SSD deployment per year (10PB+ capacity)
- Challenges
– Acquisition of extra devices – Higher cost
Software-defined flash – designs
- Software defined
– Expose low level hardware interface to software – Software can control hardware completely
- New hardware architecture
– Expose hardware channels to software – Individual FTL controller for each channel
- New HW/SW interface
– Write in the unit of erase block size
– Leverage global resource for data persistency
- Removes across-channel parity
coding
... ... ... ... Flash ch_0 Flash CH_0 Flash ch_0 Flash CH_1 Flash ch_0 Flash CH_N SSD Controller /dev/sda Flash ch_0 Flash CH_0 Flash ch_0 Flash CH_1 Flash ch_0 Flash CH_N SSD Ctrl /dev/sda0 ~/dev/sdaN Conventional SSD Conventional SSD SDF SDF SSD Ctrl SSD Ctrl
Software-defined flash – designs
- Removing unnecessary software layers
– To reduce latency and CPU cycles – To remove complexity of kernel configurations
- User-defined scheduler
– Data layout – Erase scheduling
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1VFS Generic Block Layer Generic Block Layer IO Scheduler PCIE SCSI Mid-layer SATA and SAS Translation Block Device File System Low Level Device Driver Conventional SSD User Space IOCTRL IOCTRL Kernel Space Kernel Space User Space User Space Buffered IO Buffered IO Direct IO Direct IO (a) (b) PCIE Driver SDF Page Cache
1Software-defined flash – designs
- Hardware
– 25nm MLC NAND, 44 channels, ONFI 1.x asynchronous 40Mhz – 5 FPGA, 4 Spartan-6 for FTL, Virtex-5 for PCIE
PCIEx8 Virtex-5 Spartan-6 Spartan-6 Spartan-6 Spartan-6 11 channels 11 channels 11 channels 11 channels
Software-defined flash – conclusions
- Key ideas
– Exposes flash channels to software – SW/HW co-design
- Results
– 95% write and 99% read bandwidth utilization – 99% capacity utilization – 50% cost reduction per GB compared with SSD for workload on the production systems
- 3000+ deployment in Baidu Webpage storage system
– 3x performance better than commodity SSD – 50% cost reduction
Software-defined accelerator – background
- AI is the core technology
– speech, image, page ranking and Ads.
- Extremely computing density
– GPU
- High cost
- High power and high space consumption
- Higher demand on data center cooling,
power supply, and space utilization
– CPU
- Medium cost and power consumption
- Low speed
– FPGA
- Most potential
- Need faster iteration of development
Software-defined accelerator – design
- Xilinx K7 FPGA
– Best performance/cost/power consumption
- Evaluations
- Batch size=8, layer=8
- Workload1
– Weight matrix size=512 – FPGA is 4.1x than GPU – FPGA is 3x than CPU
Workload2
– Weight matrix size=2048 – FPGA is 2.5x than GPU – FPGA is 3.5x than CPU
- Conclusions
– FPGA can merge the small requests to improve performance – Throughput in Req/s of FPGA scales better
100 200 300 400 500 600 700 1 2 4 8 12 16 24 32 CPU GPU FPGA 1000 2000 3000 4000 5000 6000 7000 1 2 4 8 12 16 24 32 40 48 56 64 CPU GPU FPGA
Thread # Thread # Req/s Fig a:workload1 Fig b: workload2 4.1 x 3x 2.5 x 3.5 x
Conclusion
- Paradigm shift
– From PC cluster to SDDC
- What is SDDC
– Applications driven – Whole-stack co-deign and tuning
- The FPGA in SDDC
– Enable SDDC
- Baidu’s vision and practice