Evaluating Lustres Metadata Server on a Multi-Socket Platform - - PowerPoint PPT Presentation

evaluating lustre s metadata server on a multi socket
SMART_READER_LITE
LIVE PREVIEW

Evaluating Lustres Metadata Server on a Multi-Socket Platform - - PowerPoint PPT Presentation

Evaluating Lustres Metadata Server on a Multi-Socket Platform Konstantinos Chasapis Scientific Computing Department of Informatics University of Hamburg 9th Parallel Data Storage Workshop Motivation Motivation Metadata performance can be


slide-1
SLIDE 1

Evaluating Lustre’s Metadata Server

  • n a Multi-Socket Platform

Konstantinos Chasapis

Scientific Computing Department of Informatics University of Hamburg

9th Parallel Data Storage Workshop

slide-2
SLIDE 2

Motivation

Motivation

Metadata performance can be crucial to overall system performance Applications create thousands of files (file-per-process)

Normal output files Store their current state, snapshots files

Stat operation, to check application’s state Clean “old”- snapshot files and temporary files Solutions: Improve metadata architectures and algorithms Use more sophisticated hardware on the metadata servers

Increase processing power in the same server, add cores and sockets Replace HDDs with SSDs

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 2 / 24

slide-3
SLIDE 3

Motivation

Our work

Evaluate Lustre Metadata Server Performance when using Multi-Socket Platforms Contributions: An extensive performance evaluation and analysis of the create and unlink operations in Lustre A comparison of Lustre’s metadata performance with the local file systems ext4 and XFS The identification of hardware best suited for Lustre’s metadata server

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 3 / 24

slide-4
SLIDE 4

Outline

1 Motivation 2 Related Work 3 Lustre Overview 4 Methodology 5 Experimental Results 6 Conclusions

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 4 / 24

slide-5
SLIDE 5

Related Work

Related Work

Lustre Metadata performance: Alam et al. “Parallel I/O and the metadata wall,” PDSW ’11

Measured the implications of the network overhead on the file systems’ scalability Evaluate performance improvements when using SSDs instead of HDDs

Shipman et al. “Lessons Learned in Deploying the World s Largest Scale Lustre File System,” ORNL Tech. Rep. 2010

Configurations that can optimize metadata performance in Lustre

Performance scaling with the number of cores: Boyd-Wickizer et al. “An analysis of Linux scalability to many cores,” OSDI’10

Performed an analysis of the Linux kernel while running on a 48-core server

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 5 / 24

slide-6
SLIDE 6

Lustre Overview

Lustre Overview

Parallel distributed file system Separates metadata to data servers 2.6 version supports distributed metadata Uses back end file-system to store data (ldiskfs and ZFS)

Network

Object Storage Servers (OSS) Object Storage T argets (OST) Metadata Servers (MDS)

Compute Nodes

Metadata T arget (MDT)

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 6 / 24

slide-7
SLIDE 7

Lustre Overview

Lustre metadata operation path

Complex path - goes through many layers VFS since it is POSIX compliant LNET network communication protocol File data stored in OSSs Lustre inode store object-id ost mapping

Llite MDD ldiskfs jbd2 MDC

Client MDS

VFS PTL-RPC LNET PTL-RPC LNET VFS

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 7 / 24

slide-8
SLIDE 8

Methodology

Methodology

Use a single multi-socket server Use mdtest metadata generator benchmark to stress the MDS Compare with XFS and ext4

ext4, since ldiskfs is based on it XFS, that is a high-performance local node file system ZFS, has poor metadata performance that’s we do not include it

Measure creat and unlink operations

stat performance heavily depends in the OSSs

Collect system statistics: CPU utilization Block device utilization OProfile stats: % of CPU consumed by Lustre’s modules

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 8 / 24

slide-9
SLIDE 9

Methodology

Hardware specifications

We use a four socket server consisting of: Supermicro motherboard model H8QG6 4× AMD Opteron 6168 Magny-Cours 12-core processors 1.9 GHz 128 GB of DDR3 main memory running at 1,333 MHz Western Digital Caviar Green 2 TB SATA2 HDD and 2× Samsung 840 Pro Series 128 GB SATA3 SSDs Memory throughput: 8.7 GB/s for local and 4.0 GB/s for remote

Socket 0 NUMA Node 0

M E M 1 2 5 3 4 M E M

NUMA Node 1

6 7 8 11 9 10

Socket 2 NUMA Node 4

24 M E M 25 26 29 27 28 M E M

NUMA Node 5

30 31 32 35 33 34 HT Link

Socket 1 NUMA Node 2

12 M E M 13 14 17 15 16 M E M

NUMA Node 3

18 19 20 23 21 22

Socket 3 NUMA Node 6

36 M E M 37 38 41 39 40 M E M

NUMA Node 7

42 43 44 47 45 46 HT Link HT Link HT Link

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 9 / 24

slide-10
SLIDE 10

Methodology

Testbed Setup

CentOS 6.4 with the patched Lustre kernel version 2.6.32-358.6.2.el6 Lustre 2.4 version RPMS provided by Intel (Whamcloud) Linux governor is set to performance, which operates all the cores at the maximum frequency and gets the maximum memory bandwidth An SSD for the OST For the ext4 and XFS experiments, we use an SSD cfg device scheduler for the MDT

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 10 / 24

slide-11
SLIDE 11

Methodology

mdtest Benchmark

MPI-parallelized benchmark that runs in phases, where in each phase a single type of POSIX metadata operation is issued to the underlying file system. Configuration: Private directories per process 500.000 files for Lustre 3.000.000 files for ext4 and XFS Unmount the file system and flush kernel caches after each operation

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 11 / 24

slide-12
SLIDE 12

Methodology

Configurations

1 Scaling with the number of cores

Increase the workload and the active cores

2 Bind mdtest processes to specific sockets

Same workload and divide the mdtest processes among the active sockets

3 Use of multiple mount points

Increase the mount points used to access the file system

4 Back-end device limitation

Measure MDS performance while using kernel RAMDISK as MDT

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 12 / 24

slide-13
SLIDE 13

Experimental Results

Configuration: Scaling with the number of cores

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 13 / 24

slide-14
SLIDE 14

Experimental Results

Lustre’s performance vs. Active cores

#mdtest processes equals 2× #active cores 6 mount points Lustre’s modules CPU usage drops by 2x from 12 to 24 sockets

9 10 11 12 13 14 15 16 17 18 6 12 18 24 30 36 42 48 20 40 60 80 100 kOps/sec % CPU util. #active cores create util. unlink util. create perf. unlink perf.

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 14 / 24

slide-15
SLIDE 15

Experimental Results

ext4 and XFS performance vs. Active cores

#mdtest processes equals 2× #active cores 6 mount points

10 20 30 40 50 60 70 80 6 12 18 24 30 36 42 48 kOps/sec # cores XFS create XFS unlink ext4 create ext4 unlink

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 15 / 24

slide-16
SLIDE 16

Experimental Results

Configuration: Bind mdtest processes per socket

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 16 / 24

slide-17
SLIDE 17

Experimental Results

Lustre performance, bind per socket configuration

All cores are active Group mdtest processes per socket 12 mdtest processes 12 mount points

6 8 10 12 14 16 18 20 1 2 3 4 20 40 60 80 100 kOps/sec % CPU util. # sockets create util. unlink util. create perf. unlink perf.

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 17 / 24

slide-18
SLIDE 18

Experimental Results

Configuration: Use of multiple mount points (MPs)

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 18 / 24

slide-19
SLIDE 19

Experimental Results

Multiple mount point configuration

Mount the file system in several directories Access the file system from different paths

.... mnt_lustre_1/

mdtest_1

....

dir_1 dir_n

mnt_lustre_m/ ....

dir_m dir m+n

mdtest_m mdtest_n mdtest_m+n

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 19 / 24

slide-20
SLIDE 20

Experimental Results

Lustre’s performance vs. Mount points

24 mdtest processes 12 active cores Lustre’s modules CPU usage increases by 5x from 1 MP to 12 MPs

2 4 6 8 10 12 14 16 1 2 6 12 24 20 40 60 80 100 kOps/sec % CPU util. # Mount points create util. unlink util. create perf. unlink perf.

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 20 / 24

slide-21
SLIDE 21

Experimental Results

ext4 and XFS performance vs. Mount points

24 mdtest processes 12 active cores

10 20 30 40 50 60 70 1 2 6 12 24 kOps/sec # Mount points XFS create XFS unlink ext4 create ext4 unlink

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 21 / 24

slide-22
SLIDE 22

Experimental Results

Configuration: Back-end device limitation

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 22 / 24

slide-23
SLIDE 23

Experimental Results

Lustre’s performance using different back-end devices

#mdtest processes equals 2× #active cores 6 mount points

10 12 14 16 18 20 12 24 36 48 kOps/sec # active cores create RAM disk create SSD unlink RAM disk unlink SSD

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 23 / 24

slide-24
SLIDE 24

Conclusions

Conclusions

Main observations: Lustre MDS performance improvement is limited to a single socket MDT device does not seem to be the bottleneck Using multiple mount points per client can significantly increase performance Previous work: The number of cores is less significant than the CPU clock Thank you - Questions? konstantinos.chasapis@informatik.uni-hamburg.de

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 24 / 24

slide-25
SLIDE 25

Backup slides

Lustre modules usage increasing the number of active cores

create unlink Module 12 C 24 C 12 C 24 C

  • bdclass

6.38 3.58 12.96 8.57 mdtest 4.22 3.59 <0.1 <0.1 ldiskfs 5.30 2.26 3.01 1.70 lnet <0.1 <0.1 3.24 2.17 libcfs 2.61 1.28 3.30 2.04

  • sd ldiskfs

2.16 0.86 2.06 1.11 jbd2 1.76 0.97 1.23 0.78 lvfs 1.41 0.98 1.93 1.60 lustre 1.32 0.56 1.81 0.88 mdd 1.65 0.30 0.9 0.42 mdt 1.65 0.73 1.84 0.97

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 25 / 24

slide-26
SLIDE 26

Backup slides

Lustre modules usage increasing the number of MPs

create unlink Module 1 M 12 M 1 M 12 M ptl-rcp 2.38 11.38 3.04 11.67

  • bdclass

1.56 7.53 3.03 12.86 ldiskfs 1.30 6.53 0.62 3.12 lnet 0.82 3.75 <0.1 <0.1 libcfs 0.68 3.06 <0.1 <0.1

  • sd ldiskfs

0.47 2.61 0.43 2.10 jbd2 0.45 2.15 0.38 1.26 mdt 0.36 2.11 <0.1 <0.1 lvfs 0.30 1.74 0.44 1.97 lustre 0.29 1.70 0.42 1.84 lod 0.18 1.06 <0.1 <0.1 mdd 0.18 1.03 <0.1 <0.1 mdc 0.13 0.71 <0.1 <0.1 mdt <0.1 <0.1 0.42 1.87

  • K. Chasapis (Uni. Hamburg)

Lustre’s MDS Evaluation PDSW’14 26 / 24