China Railway BUSINESS SITUATION China Railway A solely - - PowerPoint PPT Presentation

china railway business situation
SMART_READER_LITE
LIVE PREVIEW

China Railway BUSINESS SITUATION China Railway A solely - - PowerPoint PPT Presentation

The Challenge of Large-Scale Ceph Cluster China Railway BUSINESS SITUATION China Railway A solely state-owned 4.02Bt enterprise 40.2 3.64Bt The main artery of the 36.4 3.37B 3200 33.7 national economy 32. MAIN BUSINESS Passenger


slide-1
SLIDE 1

The Challenge of Large-Scale Ceph Cluster

China Railway

slide-2
SLIDE 2 9.1 16.7 36.4 6.9 0.8 13.1 33.7 40.2 32. 12.

Operating Mileage Passenger Volume Freight Volume Computer Room Area Equipment quantity 2010 2018

China Railway A solely state-owned enterprise The main artery of the national economy

MAIN BUSINESS

Passenger and freight transport service

BUSINESS FEATURE

Large scale, wide coverage, uninterrupted

ENTERPRISE GOAL

To be a world-class modern logistics enterprise

BUSINESS SITUATION

91k km 131k km 1.67B 3.37B 3.64Bt 4.02Bt 690㎡ 3200㎡ 800 12000

slide-3
SLIDE 3

Starting in 2014

Built the cloud date center in stages, and gradually migrated production applications to the cloud environment.

OpenStack-base for Cloud Data Center

Currently reached a scale of thousands of physical machines, and finished deploying more than 280 applications, including passenger transport, freight, scheduling, locomotive and public infrastructure platform.

Cloud Computing Powering Data Center Hub

Expected to reach the scale of newly built Data Center Hub with above 15,000 physical machines by the end of 2020.

Cloud Computing Development in China Railway

slide-4
SLIDE 4

Application System

Stratgic Decision Management & Development ransportation ProductionT Resource Management Inplementation management Intergrated Collaboration

Cloud Security

Acess Security Host Security System Security Data Security Network Security Application Security

Cloud Management

Resource Management Authority Management Operation & Maintenance Management Global Monitor Configuration Management Asset Management

PaaS of China Railway Cloud

Storage & Back-up Service Big Data Service Container Service Database Service Middleware Service

IaaS of China Railway Cloud

Computing Management Network Management Block Storage Management Secured Authentication Object Storage Management Resource Orchestration File Storage Management Resource Metering Image Management Bare-metal Management Load Balance Flexible Resource Expansion Physical Machine Monitoring HA of Virtual Machine Log Optimization

Computing Resource Pool

KVM/Xen VMware PowerVM Physical devices Management

Network Resource Pool

Virtual Switch Virtual Router VLAN VXLAN

Storage Resource Pool

Distributed Storage Centralized Storage

SRCloud Architecture and Components

slide-5
SLIDE 5

As shown in this figure, it is the current storage usage of China Railway data center, where the centralized storage serves for the database service. Both centralized storage and distributed storage serve as the back-end storage of VMware and OpenStack.

Resource type

Critical database resource pool Centralized storage Vmware based resource pool Centralized storage & distributed storage

Storage type

KVM based resource pool

Storage Usage in China Railway

Centralized storage & distributed storage

slide-6
SLIDE 6

Sharing of Experience in Distributed Storage

OpenStack

3 controller nodes 202 compute nodes 5

Ceph Monitor Nodes Storage Nodes OSD Capacity

3 controller nodes 161 compute nodes 3 controller nodes 94 compute nodes 5 3 84 68 40 1512 1224 720 2.15P 1.71P 1.02P ……

slide-7
SLIDE 7

Deployment Architecture

slide-8
SLIDE 8

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

1.How many Ceph Mon nodes were deployed?

Based on our large-scale testing experience (with 600 compute nodes, over 1000 OSDs), five Ceph Mon nodes are necessary to ensure the stability of Ceph cluster.

slide-9
SLIDE 9

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

2.How to configure the failure domain?

There is no difference in available capacity when separating replicas either across hosts or across racks. However, they have different fault-tolerance levels, one for host-level fault, and the other for rack-level fault. We separates replicas across racks to tolerate the rack-level fault.

slide-10
SLIDE 10

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

3.How to configure the network?

2 * 10G Public Network 2 * 10G Cluster Network

slide-11
SLIDE 11

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

4.Stop the Ceph services before adjusting network configuration

The error of Ceph cluster and recovery failure occur when we adjust the configuration of netowrk equipment in the running state of Ceph cluster. It is suggested that if you need to adjust the physical network, stop all services of the Ceph cluster first.

slide-12
SLIDE 12

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

5.Creating 100 incremental VMs in batches with high failure rate

It is mainly caused by the incremental snapshot. Concurrent creation is an operation of incremental cloning for the rbd layer. The rbd side is designed with a distributed exclusive

  • lock. In the concurrent cloning operation, librbd needs to obtain the exclusive lock first to

perform the following operations, which leads to the failure of the competition of exclusive locks in the case of high concurrency. Eventually we closed the exclusive lock.

slide-13
SLIDE 13

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

  • 6. Creating 100 incremental VMs in batches with high failure rate

when there are already 1500 VMs

In Cinder's code, each time you create a disk with image, you will first create a snapshot for the disk. The rbd client needs to query all the snapshot lists of the target volume when creating a snapshot. Therefore, if the snapshot list is too long, the interface call will time out, and the creation fails. We modified the Cinder code to create a snapshot for each image. When you create a disk based on this image later, no new snapshots are created.

slide-14
SLIDE 14

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

  • 7. Restart a Ceph Mon service

Due to the operation and maintenance requirements, a Ceph Mon service was restarted. After restarting, 87% of pg unknowns and dozens of slow requests were found, and a recovery was triggered for a while. It was found through analysis that the clock was inconsistent.

slide-15
SLIDE 15

Challenges in The Depolyment of Ceph Cluster with 1512 OSDs

  • 8. Test the impact of latency on Ceph

TC is used to increase a 1000-millisecond latency on a certain data network card of OSD, resulting in the failure of some instances on the platform to read and write. After the latency is deleted, the virtual machine returns to normal. In order to discover network problems in time, we have strengthened monitoring of the network.

slide-16
SLIDE 16

Performance of Ceph Cluster with 1512 OSDs

30821 34694 30465.5 90784.5

30000 60000 90000 120000 512K Seq W 512K Seq R

512K Seq R/W Bandwith(MB/s)

HDD PCIE

241.4k 219k 442.7k 3270k

1000 2000 3000 4000 4K Random W 4K Random R

4K Random R/W IOPS

HDD PCIE Note: 1344 HDD OSDs, 168 PCIE OSDs

slide-17
SLIDE 17

HA Test Case of Ceph Cluster with 1512 OSDs

Testing Scenarios Result The consistency of data read and write when two Ceph Mon nodes restart Pass The consistence of data read and write when two failure domains restart Pass The pause time of VM I/O request should be less than 10 seconds when the sysrq kernel exception occurs on one OSD node Failed The pause time of VM I/O request should be less than 10 seconds when one OSD node reboots Pass The pause time of VM I/O request should be less than 10 seconds when one disk of OSD node can not be read or written Pass The pause time of VM I/O request should be less than 10 seconds when one disk of OSD node is disconnected Pass The VM's I/O request should be continuous when latency occurs on a network card of OSD node Failed The VM's I/O request should be continuous when packet loss occurs on a network card of OSD node Pass The pause time of VM I/O request should be less than 10 seconds when one OSD node is disconnected Failed

slide-18
SLIDE 18

THANK YOU