Stateful workloads on kubernetes with ceph Agenda CaaS - - PowerPoint PPT Presentation

stateful workloads on kubernetes with ceph
SMART_READER_LITE
LIVE PREVIEW

Stateful workloads on kubernetes with ceph Agenda CaaS - - PowerPoint PPT Presentation

Stateful workloads on kubernetes with ceph Agenda CaaS Kubernetes Ceph Storage Operation NVRAMOS 2019 10/28/2019 Cloud Service Model On Premises IaaS CaaS PaaS SaaS Applications Applications


slide-1
SLIDE 1

Stateful workloads on kubernetes with ceph

네이버 유장선

slide-2
SLIDE 2

10/28/2019

NVRAMOS 2019

Agenda ► CaaS ▶ Kubernetes ▶ Ceph Storage ▶ Operation

slide-3
SLIDE 3

10/28/2019

NVRAMOS 2019

Cloud Service Model

Applications Data Runtime Middleware OS Virtualization Server Storage Network

On Premises 원하는대로(비표준), 비용 증가, 시간 증가 표준화, 비용 절감, On Demand

Applications Data Runtime Middleware OS Virtualization Server Storage Network

IaaS CaaS

Applications Data Runtime Middleware OS Virtualization Server Storage Network

PaaS

Applications Data Runtime Middleware OS Virtualization Server Storage Network

SaaS

Applications Data Runtime Middleware OS Virtualization Server Storage Network

slide-4
SLIDE 4

10/28/2019

NVRAMOS 2019

Transformation of deployment

App Bin/Library OS Hardware

Traditional

App App Hypervisor OS Hardware

Virtualized

Container Runtime OS Hardware

Containerized

Ap p Bin/Librar y OS Ap p

VM

Ap p Bin/Librar y OS Ap p

VM

App 간의 간섭 발생 Library 호환성 이슈 Node 분리 시 비용 증가 VM 으로 격리시킴 보안성 향상 리소스 효율화 / 확장성 증가 VM OS 로 인한 리소스 증가 부팅 시간 증가

App Container Bin/ Library App Container Bin/ Library App Container Bin/ Library

VM 이 비해 가벼움(OS 공유) 배포가 빠름 Namespace 로 격리 작고, 독립적인 단위로 분리 고효율 / 고집적

slide-5
SLIDE 5

10/28/2019

NVRAMOS 2019

MSA (micro-service architecture)

Monolithic 코드가 커지고, 복잡해짐 수정 시 QA 범위가 커짐 연계 서비스 변경에 따른 영향 장애 시 리스크 증가 시장의 빠른 변화 개발 패러다임 변화

Application Server Service A Service B Service C Service D

DB

Service A

DB

Service B

DB

Service C

DB

Service D

DB

Microservice 서비스를 작게 나눔 배포를 단순화 시킴 서비스별 기술 다변화 (libraries, languages, framework) 확장성 증가

slide-6
SLIDE 6

10/28/2019

NVRAMOS 2019

Container Orchestration

  • Provisioning / Deployment of containers
  • Fault Tolerance (Replicas)
  • Load Balancing
  • Service Discovery
  • Auto Scaling (Scale in/out)
  • Resource Limit Control
  • Scheduling
  • Health Checking
  • Cluster Management
  • Configuration Management
  • Monitoring

Svc A Svc C Svc D Svc B Svc A Svc C Svc D Svc B Svc A Svc C Svc B Svc A Svc D Svc B

worker node

Svc A Svc C Svc B Svc A Svc D Svc B

worker node

Svc A Svc C Svc B Svc A Svc D Svc B

worker node

Svc A Svc C Svc B Svc A Svc D Svc B

worker node worker node … Docker Swarm Mesos Marathon Cloud Foundry CoreOS Fleet Kubernetes Google Container Engine Amazon ECS Azure Container Service

slide-7
SLIDE 7

10/28/2019

NVRAMOS 2019

Kubernetes (K8S)

  • Container Orchestrator 의 de-factor
  • Google 내부 컨테이너 서비스인 Borg 의 오픈소스 버전 (15년의 운영경험)
  • CNCF 기부 (Cloud Native Computing Foundation)
  • 다양한 클라우드 및 베어메탈 환경 지원
  • Go 언어로 작성
  • Self-healing
  • Horizontal Scaling
  • Service Discovery / Load Balancing
  • Automatic rollouts / rollbacks
  • Secret / configuration management
  • Storage orchestration
  • Batch execution (crontab)
slide-8
SLIDE 8

10/28/2019

NVRAMOS 2019

Kubernetes

Worker Node YAML

Kind : deployment Selector : app: nginx Replicas : 3 Template: image : nginx label : app:nginx

Worker Node Worker Node

nginx nginx nginx

Kind : service Selector : app: nginx Type : LoadBalancer

External IP DNS/LB

K K K

Kubectl

API

K8s Master

R

client service BE BE BE Internal IP LB

slide-9
SLIDE 9

10/28/2019

NVRAMOS 2019

CI/CD Pipeline

Developer

Commit & Push Code

Git Repository CI Server

Build & Run Tests Build Docker Image Push Docker Image Update Kubernetes Depoyment

Docker Registry Kubernetes

Create New Pod Health Check New Pod Healthy Pod Delete Old Pod

Canary / Blue-Green Deployments

Not Healthy Pod Old Pod Running Restart New Pod

slide-10
SLIDE 10

10/28/2019

NVRAMOS 2019

Autoscaling

HPA Metrics

Horizontal Pod Autoscaler

Check metrics Threshold is met ?

DEPLOYMENT

Change number of replicas

PO D PO D PO D PO D

… Scale in / out number of pods

VPA

Vertical Pod Autoscaler

Check metrics Threshold is met ?

DEPLOYMENT

PO D PO D PO D

Scale up / down number of pods Change cpu / memory values

slide-11
SLIDE 11

10/28/2019

NVRAMOS 2019

Stateful workloads

slide-12
SLIDE 12

10/28/2019

NVRAMOS 2019

Storage in K8S

Local Storage Remote Storage

Ephemeral Shared Storage Block Storage

데이터 저장 Pod(컨테이너) 내부 호스트 로컬 디스크 외부 네트워크 스토리지 (여러 Pod 가 스토리지 공유) 외부 네트워크 스토리지 (Pod 별 스토리지 할당) Pod 삭제 시 데이터도 함께 삭제 삭제되지 않음 삭제되지 않음 삭제되지 않음 Host 장애 시 데이터 사용 불가 데이터 사용 불가 서비스 영향 없음 서비스 영향 없음

POD / Container HOST POD / Container HOST Local Disk POD HOST POD POD HOST POD

slide-13
SLIDE 13

10/28/2019

NVRAMOS 2019

Volume Plugin in Kubernetes

  • Openstack(Cinder) 와 연동
  • Multi-tenancy 지원
  • 인증 / 권한 연동 (사내 인증 연동)
  • Flexvolume 방식 개발
  • Docker Volume Plugin 운영 노하우 적용
  • On-line Resize 지원
  • Read-only Multi-attached 지원
  • Snapshot 지원
  • Cephfs Fuse / Kernel mount 지원
  • RBD Multi-attached 방지(lock)
  • Node Drain 시 BlackList 추가 기능
  • IO Monitoring
  • Front-end QoS (using cgroups)
  • Quotas 지원

Plugin 자체 개발 진행

slide-14
SLIDE 14

10/28/2019

NVRAMOS 2019

Statefulset

apiVersion: apps/v1 kind: StatefulSet spec: replicas: 3 template: spec: containers:

  • name: nginx

image: k8s.gcr.io/nginx-slim:0.8 volumeMounts:

  • name: www

mountPath: /usr/share/nginx/html volumeClaimTemplates:

  • metadata:

name: www spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 1Gi WEB-0

PODs PVC PV

WWW-WEB-O PV-uuid

Vol

WEB-1 WWW-WEB-1 PV-uuid WEB-2 WWW-WEB-2 PV-uuid

… …

slide-15
SLIDE 15

10/28/2019

NVRAMOS 2019

Volume Plugin in NAVER

Statefulset Ceph Provisioner Cinder Ceph

vol

Ceph Driver

vol vol

Keystone PV Check 생성요청 인증 생성 YAML 정의 Attach/mount Attach / Mount Rbd (kernel map / mount )

POD vol vol vol vol vol vol

/dev/rbd0

slide-16
SLIDE 16

10/28/2019

NVRAMOS 2019

Distributed platform on distributed storage

KAFKA #1 KAFKA #2 KAFKA #3 C C C C C C C C C

3 copy -> 9 copy Kafka on ceph rbd (3 copy) Elastic Search (Warm) on ceph ec (1.5 copy)

ES #1 (Warm) ES #2 (Warm) C

ES : 2 copy EC : 1.5 copy = 3 copy

C C C C C C C C C C C

slide-17
SLIDE 17

10/28/2019

NVRAMOS 2019

Single Copy Storage

KAFKA #1 KAFKA #2 KAFKA #3 Zone Group #1

DISK DISK DISK

VOL#1 VOL#2 VOL#3 DISK DISK DISK

… …

VG VG

DISK DISK DISK

VOL#1 VOL#2 VOL#3 DISK DISK DISK

… …

VG VG

DISK DISK DISK

VOL#1 VOL#2 VOL#3 DISK DISK DISK

… …

VG VG Zone Group #2 Zone Group #3

iSCSI

slide-18
SLIDE 18

10/28/2019

NVRAMOS 2019

Ceph Storage

slide-19
SLIDE 19

10/28/2019

NVRAMOS 2019

Ceph Storage Service

SWIFT S3 OpenStack 인증 Docker Registry 사내 스토리지 Object -> NFS Docker/K8 S

rbd. ko nbd. ko lib rbd

PM/VM

fuse kernel

CEPHFS -> NFS iSCSI Export QEMU

slide-20
SLIDE 20

10/28/2019

NVRAMOS 2019

FileStore 6TB * 8 = 48 TB (66% 사용)

BlueStore 전환 (+ NVMe)

DISK DISK SSD SSD DISK DISK DISK DISK DISK DISK DISK DISK

Raid1 / OS Journal DATA BlueStore 6TB * 12 = 72 TB (100% 사용)

DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK DISK

… NVMe DATA

OS Docker DB/WAL BCache

slide-21
SLIDE 21

10/28/2019

NVRAMOS 2019

CephFS 제공

  • Shared File System (like NFS)
  • POSIX-compliant file system
  • Data Pool (RBD 동시 사용)
  • Metadata Pool
  • Multiple MDS Server
  • Hot Standby / Standby MDS
  • Scheduling
  • Direct Access file data blocks
  • Fuse / Kernel Mount
  • Quota Support

https://docs.ceph.com/docs/master/_images/cephfs-architecture.svg

slide-22
SLIDE 22

10/28/2019

NVRAMOS 2019

MDS High Availability : Standby MDS

MDS #1 (RANK: 0)

Floating Standby MDS

MDS #2 (RANK: 1)

MDS (STANDB Y)

Hot Standby MDS

MDS #1 (RANK: 0) MDS #2 (RANK: 1)

MDS #1 (H/S) MDS #2 (H/S)

slide-23
SLIDE 23

10/28/2019

NVRAMOS 2019

Multiple MDS

MDS #1 (RANK: 0)

Single MDS Multiple MDS

MDS #2 (RANK: 1) MDS #3 (RANK: 2)

Bottleneck

MDS #1 (RANK: 0)

ceph fs set <fs_name> max_mds 3

slide-24
SLIDE 24

10/28/2019

NVRAMOS 2019

Type

Subtree Pinning (static)

setfattr –n ceph.dir.pin -v <rank> </path>

Shard Data … … … … Root

MDS #2 (RANK: 1) MDS #3 (RANK: 2) MDS #1 (RANK: 0)

cephfs_volume_prefix = /ceph_ssd setfattr -n ceph.dir.layout.pool -v SSD_POOL /ceph_ssd /ceph_hdd /ceph_ssd

slide-25
SLIDE 25

10/28/2019

NVRAMOS 2019

User space Kernel space Application VFS CEPH- FUSE FUSE

ceph-fuse kernel mount

User space Kernel space Application VFS CephFS Kernel

Fast Support Quotas CephFS : fuse / kernel

slide-26
SLIDE 26

10/28/2019

NVRAMOS 2019 https://pommi.nethuis.nl/ssd-caching-using-linux-and-bcache/

  • bcache (writeback)
  • kernel 3.10
  • Flashcache (facebook),

EnhanceIO

  • NVMe
  • Random RW

Block Cache : bcache

slide-27
SLIDE 27

10/28/2019

NVRAMOS 2019

/dev/bcache0p1

/var/lib/ceph/osd/dd961c1a-0a05-4581-af03-77c28fb8fcbc

mount /var/lib/ceph/osd/ceph-2/ bind ceph-osd … --id 2 read

block block.db /dev/disk/by-partuuid/4c06c8f6-2967-4165-9e85-fa0382d6ab17

link

/dev/disk/by-partuuid/e530dbba-706a-4f2f-91e3-b90b0df3a650

link /dev/bcache0p2 link /dev/bcache0 /sys/fs/bcache/99c3ab27-e819-4f18-8947-924c53045bbc

bdev0 cache0

HDD

sda sdb sdc sde sdf sdx

NVMe

/dev/nvme0n1p6 /dev/nvme0n1p17 …

For DB / WAL

/dev/nvme0n1p18 /dev/nvme0n1p29 …

For bcache link

bcache

slide-28
SLIDE 28

10/28/2019

NVRAMOS 2019

Frontend QoS (current) Backend QoS (WIP)

Container #1 Container #2 Container #3 docker cgroups cpu memory io network Cephfs (file) (based token-bucket algorism) Ceph rbd (block) (based dm-clock)

QoS (Quality of Service)

/sys/fs/cgroup/blkio/docker/{container id}/ blkio.throttle.read_bps_device blkio.throttle.read_iops_device blkio.throttle.write_bps_device blkio.throttle.write_iops_device {Major}:{minor} {value} 252:48 1048576

slide-29
SLIDE 29

10/28/2019

NVRAMOS 2019 Docker Registry Worker #1 Worker #2 Worker Pull Images ?? minutes Worker

Single Private Docker Registry

USA EUROPE Worker Push Images KOREA

slide-30
SLIDE 30

10/28/2019

NVRAMOS 2019 Harbor Registry (M) Ceph Push Images Harbor Registry (L) Ceph Replicate Images Worker #1 Worker #2 Pull Images Worker #3 Worker #4 Pull Images

Replicate Repository : Harbor

Emergency

slide-31
SLIDE 31

10/28/2019

NVRAMOS 2019

Master Local

push replicate #1 replicate #2 replicate #N 부하발생

Worker

pull

Harbor Master Ceph RGW Worker Harbor Local 2 Ceph RGW

Replication Bottleneck

Local

Harbor Local 1 Ceph RGW

Ceph Ceph Ceph

slide-32
SLIDE 32

10/28/2019

NVRAMOS 2019

Ceph RGW Multi-Site : Sync

Push Images Europe Harbor Ceph Radosgw Master Harbor Ceph Radosgw USA Harbor Ceph Radosgw Sync Sync Workers In Europe Workers In USA Emergency

slide-33
SLIDE 33

10/28/2019

NVRAMOS 2019

Extend PG NUM

PG Count =

  • Too few PGs per OSD can lead to drawbacks regarding performance
  • recovery time and uneven data distribution

Doubling the PGs (e.g. from 8192 to 16384) $ ceph osd <pool> pg_num <int> $ ceph osd <pool> pgp_num <int>

  • pg_num : splitting
  • pgp_num : rebalancing / backfilling
slide-34
SLIDE 34

10/28/2019

NVRAMOS 2019

Extend PG NUM

  • 1개 증가 시 0.1 % misplace 발생
  • 64개까지는 동일한 비율로 증가함
  • 128개 증가 시 9.8%로 2 ~ 3% misplace 가 덜 발생함
  • 16개 이상 증가 시 duration 도 함께 증가됨
slide-35
SLIDE 35

10/28/2019

NVRAMOS 2019

Extend PG NUM

1개씩 8번의 pg 변경 시 7번의 이동 발생함 한번에 pg 8개 변경 시 7번의 이동 발생함 결론, 128 개 이상 단위로 PG 변경을 하지 않는다면, 1개씩 진행하나, 여러개씩 진행하나 비슷하게 이동됨 PG 1개 변경 시 시간과 PG 64개 변경 시 시간도 거의 유사함

slide-36
SLIDE 36

10/28/2019

NVRAMOS 2019

감사합니다.

Q&A