Stateful workloads on kubernetes with ceph
네이버 유장선
Stateful workloads on kubernetes with ceph Agenda CaaS - - PowerPoint PPT Presentation
Stateful workloads on kubernetes with ceph Agenda CaaS Kubernetes Ceph Storage Operation NVRAMOS 2019 10/28/2019 Cloud Service Model On Premises IaaS CaaS PaaS SaaS Applications Applications
네이버 유장선
10/28/2019
NVRAMOS 2019
Agenda ► CaaS ▶ Kubernetes ▶ Ceph Storage ▶ Operation
10/28/2019
NVRAMOS 2019
Cloud Service Model
Applications Data Runtime Middleware OS Virtualization Server Storage Network
On Premises 원하는대로(비표준), 비용 증가, 시간 증가 표준화, 비용 절감, On Demand
Applications Data Runtime Middleware OS Virtualization Server Storage Network
IaaS CaaS
Applications Data Runtime Middleware OS Virtualization Server Storage Network
PaaS
Applications Data Runtime Middleware OS Virtualization Server Storage Network
SaaS
Applications Data Runtime Middleware OS Virtualization Server Storage Network
10/28/2019
NVRAMOS 2019
Transformation of deployment
App Bin/Library OS Hardware
Traditional
App App Hypervisor OS Hardware
Virtualized
Container Runtime OS Hardware
Containerized
Ap p Bin/Librar y OS Ap p
VM
Ap p Bin/Librar y OS Ap p
VM
App 간의 간섭 발생 Library 호환성 이슈 Node 분리 시 비용 증가 VM 으로 격리시킴 보안성 향상 리소스 효율화 / 확장성 증가 VM OS 로 인한 리소스 증가 부팅 시간 증가
App Container Bin/ Library App Container Bin/ Library App Container Bin/ Library
VM 이 비해 가벼움(OS 공유) 배포가 빠름 Namespace 로 격리 작고, 독립적인 단위로 분리 고효율 / 고집적
10/28/2019
NVRAMOS 2019
MSA (micro-service architecture)
Monolithic 코드가 커지고, 복잡해짐 수정 시 QA 범위가 커짐 연계 서비스 변경에 따른 영향 장애 시 리스크 증가 시장의 빠른 변화 개발 패러다임 변화
Application Server Service A Service B Service C Service D
DB
Service A
DB
Service B
DB
Service C
DB
Service D
DB
Microservice 서비스를 작게 나눔 배포를 단순화 시킴 서비스별 기술 다변화 (libraries, languages, framework) 확장성 증가
10/28/2019
NVRAMOS 2019
Container Orchestration
Svc A Svc C Svc D Svc B Svc A Svc C Svc D Svc B Svc A Svc C Svc B Svc A Svc D Svc B
worker node
Svc A Svc C Svc B Svc A Svc D Svc B
worker node
Svc A Svc C Svc B Svc A Svc D Svc B
worker node
Svc A Svc C Svc B Svc A Svc D Svc B
worker node worker node … Docker Swarm Mesos Marathon Cloud Foundry CoreOS Fleet Kubernetes Google Container Engine Amazon ECS Azure Container Service
10/28/2019
NVRAMOS 2019
Kubernetes (K8S)
10/28/2019
NVRAMOS 2019
Kubernetes
Worker Node YAML
Kind : deployment Selector : app: nginx Replicas : 3 Template: image : nginx label : app:nginx
Worker Node Worker Node
nginx nginx nginx
Kind : service Selector : app: nginx Type : LoadBalancer
External IP DNS/LB
K K K
Kubectl
API
K8s Master
R
client service BE BE BE Internal IP LB
10/28/2019
NVRAMOS 2019
CI/CD Pipeline
Developer
Commit & Push Code
Git Repository CI Server
Build & Run Tests Build Docker Image Push Docker Image Update Kubernetes Depoyment
Docker Registry Kubernetes
Create New Pod Health Check New Pod Healthy Pod Delete Old Pod
Canary / Blue-Green Deployments
Not Healthy Pod Old Pod Running Restart New Pod
10/28/2019
NVRAMOS 2019
Autoscaling
HPA Metrics
Horizontal Pod Autoscaler
Check metrics Threshold is met ?
DEPLOYMENT
Change number of replicas
PO D PO D PO D PO D
… Scale in / out number of pods
VPA
Vertical Pod Autoscaler
Check metrics Threshold is met ?
DEPLOYMENT
PO D PO D PO D
Scale up / down number of pods Change cpu / memory values
10/28/2019
NVRAMOS 2019
10/28/2019
NVRAMOS 2019
Storage in K8S
Local Storage Remote Storage
Ephemeral Shared Storage Block Storage
데이터 저장 Pod(컨테이너) 내부 호스트 로컬 디스크 외부 네트워크 스토리지 (여러 Pod 가 스토리지 공유) 외부 네트워크 스토리지 (Pod 별 스토리지 할당) Pod 삭제 시 데이터도 함께 삭제 삭제되지 않음 삭제되지 않음 삭제되지 않음 Host 장애 시 데이터 사용 불가 데이터 사용 불가 서비스 영향 없음 서비스 영향 없음
POD / Container HOST POD / Container HOST Local Disk POD HOST POD POD HOST POD
10/28/2019
NVRAMOS 2019
Volume Plugin in Kubernetes
Plugin 자체 개발 진행
10/28/2019
NVRAMOS 2019
Statefulset
apiVersion: apps/v1 kind: StatefulSet spec: replicas: 3 template: spec: containers:
image: k8s.gcr.io/nginx-slim:0.8 volumeMounts:
mountPath: /usr/share/nginx/html volumeClaimTemplates:
name: www spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi WEB-0
PODs PVC PV
WWW-WEB-O PV-uuid
Vol
WEB-1 WWW-WEB-1 PV-uuid WEB-2 WWW-WEB-2 PV-uuid
… …
10/28/2019
NVRAMOS 2019
Volume Plugin in NAVER
Statefulset Ceph Provisioner Cinder Ceph
vol
Ceph Driver
vol vol
Keystone PV Check 생성요청 인증 생성 YAML 정의 Attach/mount Attach / Mount Rbd (kernel map / mount )
POD vol vol vol vol vol vol
/dev/rbd0
10/28/2019
NVRAMOS 2019
Distributed platform on distributed storage
KAFKA #1 KAFKA #2 KAFKA #3 C C C C C C C C C
3 copy -> 9 copy Kafka on ceph rbd (3 copy) Elastic Search (Warm) on ceph ec (1.5 copy)
ES #1 (Warm) ES #2 (Warm) C
ES : 2 copy EC : 1.5 copy = 3 copy
C C C C C C C C C C C
10/28/2019
NVRAMOS 2019
Single Copy Storage
KAFKA #1 KAFKA #2 KAFKA #3 Zone Group #1
…
DISK DISK DISK
…
VOL#1 VOL#2 VOL#3 DISK DISK DISK
… …
VG VG
…
DISK DISK DISK
…
VOL#1 VOL#2 VOL#3 DISK DISK DISK
… …
VG VG
…
DISK DISK DISK
…
VOL#1 VOL#2 VOL#3 DISK DISK DISK
… …
VG VG Zone Group #2 Zone Group #3
iSCSI
10/28/2019
NVRAMOS 2019
10/28/2019
NVRAMOS 2019
Ceph Storage Service
SWIFT S3 OpenStack 인증 Docker Registry 사내 스토리지 Object -> NFS Docker/K8 S
rbd. ko nbd. ko lib rbd
PM/VM
fuse kernel
CEPHFS -> NFS iSCSI Export QEMU
10/28/2019
NVRAMOS 2019
FileStore 6TB * 8 = 48 TB (66% 사용)
BlueStore 전환 (+ NVMe)
DISK DISK SSD SSD DISK DISK DISK DISK DISK DISK DISK DISK
Raid1 / OS Journal DATA BlueStore 6TB * 12 = 72 TB (100% 사용)
DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK
… NVMe DATA
OS Docker DB/WAL BCache
10/28/2019
NVRAMOS 2019
CephFS 제공
https://docs.ceph.com/docs/master/_images/cephfs-architecture.svg
10/28/2019
NVRAMOS 2019
MDS High Availability : Standby MDS
MDS #1 (RANK: 0)
Floating Standby MDS
MDS #2 (RANK: 1)
MDS (STANDB Y)
Hot Standby MDS
MDS #1 (RANK: 0) MDS #2 (RANK: 1)
MDS #1 (H/S) MDS #2 (H/S)
10/28/2019
NVRAMOS 2019
Multiple MDS
MDS #1 (RANK: 0)
Single MDS Multiple MDS
MDS #2 (RANK: 1) MDS #3 (RANK: 2)
Bottleneck
MDS #1 (RANK: 0)
ceph fs set <fs_name> max_mds 3
10/28/2019
NVRAMOS 2019
Type
Subtree Pinning (static)
setfattr –n ceph.dir.pin -v <rank> </path>
Shard Data … … … … Root
MDS #2 (RANK: 1) MDS #3 (RANK: 2) MDS #1 (RANK: 0)
cephfs_volume_prefix = /ceph_ssd setfattr -n ceph.dir.layout.pool -v SSD_POOL /ceph_ssd /ceph_hdd /ceph_ssd
10/28/2019
NVRAMOS 2019
User space Kernel space Application VFS CEPH- FUSE FUSE
ceph-fuse kernel mount
User space Kernel space Application VFS CephFS Kernel
Fast Support Quotas CephFS : fuse / kernel
10/28/2019
NVRAMOS 2019 https://pommi.nethuis.nl/ssd-caching-using-linux-and-bcache/
EnhanceIO
Block Cache : bcache
10/28/2019
NVRAMOS 2019
/dev/bcache0p1
/var/lib/ceph/osd/dd961c1a-0a05-4581-af03-77c28fb8fcbc
mount /var/lib/ceph/osd/ceph-2/ bind ceph-osd … --id 2 read
block block.db /dev/disk/by-partuuid/4c06c8f6-2967-4165-9e85-fa0382d6ab17
link
/dev/disk/by-partuuid/e530dbba-706a-4f2f-91e3-b90b0df3a650
link /dev/bcache0p2 link /dev/bcache0 /sys/fs/bcache/99c3ab27-e819-4f18-8947-924c53045bbc
bdev0 cache0
HDD
sda sdb sdc sde sdf sdx
NVMe
/dev/nvme0n1p6 /dev/nvme0n1p17 …
For DB / WAL
/dev/nvme0n1p18 /dev/nvme0n1p29 …
For bcache link
bcache
10/28/2019
NVRAMOS 2019
Frontend QoS (current) Backend QoS (WIP)
Container #1 Container #2 Container #3 docker cgroups cpu memory io network Cephfs (file) (based token-bucket algorism) Ceph rbd (block) (based dm-clock)
QoS (Quality of Service)
/sys/fs/cgroup/blkio/docker/{container id}/ blkio.throttle.read_bps_device blkio.throttle.read_iops_device blkio.throttle.write_bps_device blkio.throttle.write_iops_device {Major}:{minor} {value} 252:48 1048576
10/28/2019
NVRAMOS 2019 Docker Registry Worker #1 Worker #2 Worker Pull Images ?? minutes Worker
Single Private Docker Registry
USA EUROPE Worker Push Images KOREA
10/28/2019
NVRAMOS 2019 Harbor Registry (M) Ceph Push Images Harbor Registry (L) Ceph Replicate Images Worker #1 Worker #2 Pull Images Worker #3 Worker #4 Pull Images
Replicate Repository : Harbor
Emergency
10/28/2019
NVRAMOS 2019
Master Local
push replicate #1 replicate #2 replicate #N 부하발생
Worker
pull
Harbor Master Ceph RGW Worker Harbor Local 2 Ceph RGW
Replication Bottleneck
Local
Harbor Local 1 Ceph RGW
Ceph Ceph Ceph
10/28/2019
NVRAMOS 2019
Ceph RGW Multi-Site : Sync
Push Images Europe Harbor Ceph Radosgw Master Harbor Ceph Radosgw USA Harbor Ceph Radosgw Sync Sync Workers In Europe Workers In USA Emergency
10/28/2019
NVRAMOS 2019
Extend PG NUM
PG Count =
Doubling the PGs (e.g. from 8192 to 16384) $ ceph osd <pool> pg_num <int> $ ceph osd <pool> pgp_num <int>
10/28/2019
NVRAMOS 2019
Extend PG NUM
10/28/2019
NVRAMOS 2019
Extend PG NUM
1개씩 8번의 pg 변경 시 7번의 이동 발생함 한번에 pg 8개 변경 시 7번의 이동 발생함 결론, 128 개 이상 단위로 PG 변경을 하지 않는다면, 1개씩 진행하나, 여러개씩 진행하나 비슷하게 이동됨 PG 1개 변경 시 시간과 PG 64개 변경 시 시간도 거의 유사함
10/28/2019
NVRAMOS 2019
감사합니다.