China Railway BUSINESS SITUATION China Railway A solely - - PowerPoint PPT Presentation
China Railway BUSINESS SITUATION China Railway A solely - - PowerPoint PPT Presentation
The Challenge of Large-Scale Ceph Cluster China Railway BUSINESS SITUATION China Railway A solely state-owned 4.02Bt enterprise 40.2 3.64Bt The main artery of the 36.4 3.37B 3200 33.7 national economy 32. MAIN BUSINESS Passenger
Operating Mileage Passenger Volume Freight Volume Computer Room Area Equipment quantity 2010 2018
China Railway A solely state-owned enterprise The main artery of the national economy
MAIN BUSINESS
Passenger and freight transport service
BUSINESS FEATURE
Large scale, wide coverage, uninterrupted
ENTERPRISE GOAL
To be a world-class modern logistics enterprise
BUSINESS SITUATION
91k km 131k km 1.67B 3.37B 3.64Bt 4.02Bt 690㎡ 3200㎡ 800 12000
Starting in 2014
Built the cloud date center in stages, and gradually migrated production applications to the cloud environment.
OpenStack-base for Cloud Data Center
Currently reached a scale of thousands of physical machines, and finished deploying more than 280 applications, including passenger transport, freight, scheduling, locomotive and public infrastructure platform.
Cloud Computing Powering Data Center Hub
Expected to reach the scale of newly built Data Center Hub with above 15,000 physical machines by the end of 2020.
Cloud Computing Development in China Railway
Application System
Stratgic Decision Management & Development ransportation ProductionT Resource Management Inplementation management Intergrated Collaboration
Cloud Security
Acess Security Host Security System Security Data Security Network Security Application Security
Cloud Management
Resource Management Authority Management Operation & Maintenance Management Global Monitor Configuration Management Asset Management
PaaS of China Railway Cloud
Storage & Back-up Service Big Data Service Container Service Database Service Middleware Service
IaaS of China Railway Cloud
Computing Management Network Management Block Storage Management Secured Authentication Object Storage Management Resource Orchestration File Storage Management Resource Metering Image Management Bare-metal Management Load Balance Flexible Resource Expansion Physical Machine Monitoring HA of Virtual Machine Log Optimization
Computing Resource Pool
KVM/Xen VMware PowerVM Physical devices Management
Network Resource Pool
Virtual Switch Virtual Router VLAN VXLAN
Storage Resource Pool
Distributed Storage Centralized Storage
SRCloud Architecture and Components
As shown in this figure, it is the current storage usage of China Railway data center, where the centralized storage serves for the database service. Both centralized storage and distributed storage serve as the back-end storage of VMware and OpenStack.
Resource type
Critical database resource pool Centralized storage Vmware based resource pool Centralized storage & distributed storage
Storage type
KVM based resource pool
Storage Usage in China Railway
Centralized storage & distributed storage
Sharing of Experience in Distributed Storage
OpenStack
3 controller nodes 202 compute nodes 5
Ceph Monitor Nodes Storage Nodes OSD Capacity
3 controller nodes 161 compute nodes 3 controller nodes 94 compute nodes 5 3 84 68 40 1512 1224 720 2.15P 1.71P 1.02P ……
Deployment Architecture
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
1.How many Ceph Mon nodes were deployed?
Based on our large-scale testing experience (with 600 compute nodes, over 1000 OSDs), five Ceph Mon nodes are necessary to ensure the stability of Ceph cluster.
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
2.How to configure the failure domain?
There is no difference in available capacity when separating replicas either across hosts or across racks. However, they have different fault-tolerance levels, one for host-level fault, and the other for rack-level fault. We separates replicas across racks to tolerate the rack-level fault.
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
3.How to configure the network?
2 * 10G Public Network 2 * 10G Cluster Network
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
4.Stop the Ceph services before adjusting network configuration
The error of Ceph cluster and recovery failure occur when we adjust the configuration of netowrk equipment in the running state of Ceph cluster. It is suggested that if you need to adjust the physical network, stop all services of the Ceph cluster first.
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
5.Creating 100 incremental VMs in batches with high failure rate
It is mainly caused by the incremental snapshot. Concurrent creation is an operation of incremental cloning for the rbd layer. The rbd side is designed with a distributed exclusive
- lock. In the concurrent cloning operation, librbd needs to obtain the exclusive lock first to
perform the following operations, which leads to the failure of the competition of exclusive locks in the case of high concurrency. Eventually we closed the exclusive lock.
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
- 6. Creating 100 incremental VMs in batches with high failure rate
when there are already 1500 VMs
In Cinder's code, each time you create a disk with image, you will first create a snapshot for the disk. The rbd client needs to query all the snapshot lists of the target volume when creating a snapshot. Therefore, if the snapshot list is too long, the interface call will time out, and the creation fails. We modified the Cinder code to create a snapshot for each image. When you create a disk based on this image later, no new snapshots are created.
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
- 7. Restart a Ceph Mon service
Due to the operation and maintenance requirements, a Ceph Mon service was restarted. After restarting, 87% of pg unknowns and dozens of slow requests were found, and a recovery was triggered for a while. It was found through analysis that the clock was inconsistent.
Challenges in The Depolyment of Ceph Cluster with 1512 OSDs
- 8. Test the impact of latency on Ceph
TC is used to increase a 1000-millisecond latency on a certain data network card of OSD, resulting in the failure of some instances on the platform to read and write. After the latency is deleted, the virtual machine returns to normal. In order to discover network problems in time, we have strengthened monitoring of the network.
Performance of Ceph Cluster with 1512 OSDs
30821 34694 30465.5 90784.5
30000 60000 90000 120000 512K Seq W 512K Seq R
512K Seq R/W Bandwith(MB/s)
HDD PCIE
241.4k 219k 442.7k 3270k
1000 2000 3000 4000 4K Random W 4K Random R
4K Random R/W IOPS
HDD PCIE Note: 1344 HDD OSDs, 168 PCIE OSDs
HA Test Case of Ceph Cluster with 1512 OSDs
Testing Scenarios Result The consistency of data read and write when two Ceph Mon nodes restart Pass The consistence of data read and write when two failure domains restart Pass The pause time of VM I/O request should be less than 10 seconds when the sysrq kernel exception occurs on one OSD node Failed The pause time of VM I/O request should be less than 10 seconds when one OSD node reboots Pass The pause time of VM I/O request should be less than 10 seconds when one disk of OSD node can not be read or written Pass The pause time of VM I/O request should be less than 10 seconds when one disk of OSD node is disconnected Pass The VM's I/O request should be continuous when latency occurs on a network card of OSD node Failed The VM's I/O request should be continuous when packet loss occurs on a network card of OSD node Pass The pause time of VM I/O request should be less than 10 seconds when one OSD node is disconnected Failed