Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab - - PowerPoint PPT Presentation

memory saving techniques
SMART_READER_LITE
LIVE PREVIEW

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab - - PowerPoint PPT Presentation

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab February 5, 2013 1 / 13 1 Session Introduction 2 Kernel-compressed Memory 2 / 13 Introduction Assumptions The number of cores grows faster than the amount of memory


slide-1
SLIDE 1

Memory Saving Techniques

Annual Concurrency Forum Meeting Fermilab February 5, 2013

1 / 13

slide-2
SLIDE 2

1 Session Introduction 2 Kernel-compressed Memory

2 / 13

slide-3
SLIDE 3

Introduction

Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts

3 / 13

slide-4
SLIDE 4

Introduction

Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts

∙ On the Grid: 2 GB per core

3 / 13

slide-5
SLIDE 5

Introduction

Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts

∙ On the Grid: 2 GB per core ∙ ARM servers (or: hyper-threading): 1 GB per core/thread

3 / 13

slide-6
SLIDE 6

Introduction

Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts

∙ On the Grid: 2 GB per core ∙ ARM servers (or: hyper-threading): 1 GB per core/thread ∙ Xeon Phi (MIC): 100 MB per core ∙ GPUs: order of magnitude less

memory per core, change of memory model

3 / 13

slide-7
SLIDE 7

Explored Memory Saving Techniques

Summary of so far explored memory saving techniques: Memory Sharing ∙ Fork and copy-on-write Fork should be done reasonably late ∙ Kernel SamePage Merging Sharing is done automatically at the cost of speed ∙ Multi-threaded application (Geant4-MT) Can go beyond page-wise sharing in the fork model Reduction of Memory Consumption ∙ Kernel Compressed Memory (zRam, frontswap, cleancache) Virtual swap area used to compress unused memory ∙ X32 ABI: x86_64 semantics with 32bit pointers Restricts address space to 4 GB (which should be acceptable) These techniques are all (relatively) non-intrusive

4 / 13

slide-8
SLIDE 8

Discussion Items

Job scheduling ∙ For memory sharing: jobs with similar input data should be co-scheduled ∙ In general: a good mix of jobs should be scheduled Techniques provided by the Linux kernel ∙ Many of the new features are not available in SL6 ∙ Virtual Machines can be used to couple a new kernel with an SL6 user land ∙ Automatically adjusting kernel parameters can be difficult New platforms ∙ There might be a need to recompile (and verify) the software stack for ARM and/or X32

5 / 13

slide-9
SLIDE 9

1 Session Introduction 2 Kernel-compressed Memory

6 / 13

slide-10
SLIDE 10

Kernel-compressed Memory – Principle

∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks)

Application Pages Kernel Pages Swap Device

7 / 13

slide-11
SLIDE 11

Kernel-compressed Memory – Principle

∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks)

Application Pages Kernel Pages

/dev/zram0

LZO Compression

7 / 13

slide-12
SLIDE 12

Kernel-compressed Memory – Principle

∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks)

Application Pages Kernel Pages

/dev/zram0

LZO Compression

Change in strategy: not swap at all ↦→ swap whenever possible (/proc/sys/vm/swappiness)

7 / 13

slide-13
SLIDE 13

Kernel-compressed Memory and cgroups

The system memory pressure and the swappiness are not fine-grained enough handles for measurements Linux cgroups allow to put the application into a limited memory container: $ mkdir /sys/fs/cgroup/memory/restricted $ echo $((150*1024*1024)) > \ /sys/fs/cgroup/memory/restricted/memory.limit_in_bytes $ echo $PID > /sys/fs/cgroup/memory/restricted/tasks

8 / 13

slide-14
SLIDE 14

Kernel-compressed Memory – Figures

AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN)

Normal Run

500 1000 1500 2000 2500 3000 3500 50 100 150 200 250 300 350 400 450 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

9 / 13

slide-15
SLIDE 15

Kernel-compressed Memory – Figures

AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN)

cgroup memory restriction to 950 MB

500 1000 1500 2000 2500 3000 3500 50 100 150 200 250 300 350 400 450 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

9 / 13

slide-16
SLIDE 16

Kernel-compressed Memory – Figures

AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN)

cgroup memory restriction to 240 MB

500 1000 1500 2000 2500 3000 3500 50 100 150 200 250 300 350 400 450 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

9 / 13

slide-17
SLIDE 17

Kernel-compressed Memory – Figures

AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN)

Normal Run

500 1000 1500 2000 2500 3000 3500 200 400 600 800 1000 1200 1400 1600 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

10 / 13

slide-18
SLIDE 18

Kernel-compressed Memory – Figures

AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN)

cgroup memory restriction to 900 MB

500 1000 1500 2000 2500 3000 3500 200 400 600 800 1000 1200 1400 1600 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

10 / 13

slide-19
SLIDE 19

Kernel-compressed Memory – Figures

AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN)

cgroup memory restriction to 450 MB

500 1000 1500 2000 2500 3000 3500 200 400 600 800 1000 1200 1400 1600 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

10 / 13

slide-20
SLIDE 20

Kernel-compressed Memory – Figures

AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN)

cgroup memory restriction to 150 MB

500 1000 1500 2000 2500 3000 3500 200 400 600 800 1000 1200 1400 1600 Memory [MB] Time [s] vss rss physical mem swapped 0-pages rss+swapped

10 / 13

slide-21
SLIDE 21

Zero Pages

X-Check: scan through a core dump of the application Can we get rid of these hundreds of Megabytes of continuous zeros? ∙ No change by using automatic garbage collection (Boehm’s GC) ∙ Zero pages in LHCb DaVinci: ≈ 700 MB out of 2.3 GB ∙ Zero pages in CMS reconstruction

  • 180 MB out of 900 MB without output
  • 280 MB out of 1.4 GB with output

11 / 13

slide-22
SLIDE 22

Forensics: First Results

Idea: Inspect memset() calls >4 kB

Dead pages (AliRoot reco)

∙ ≈ 40 % zero pages traced back to source code ∙ Breaks down to half a dozen memsets with high impact ∙ No hits after detector initialization ∙ Scattered over uses of TClonesArray

Remaining zero pages

∙ Excluded: read(), mmap() ∙ Excluded: ROOT buffers ∙ Measurement uncertainties at memset boundaries ∙ Only literal memset() covered, standard constructors: int *a = new int[1024*1024]();

12 / 13

slide-23
SLIDE 23

Next Steps

1 Forensics: Track back large zero-runs to a malloc() 2 How to choose zram parameters for an optimal tradeoff wrt.

throughput? Perhaps zram can also be used as an “overflow” mechanism to make sure that a job finishes

13 / 13