memory saving techniques
play

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab - PowerPoint PPT Presentation

Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab February 5, 2013 1 / 13 1 Session Introduction 2 Kernel-compressed Memory 2 / 13 Introduction Assumptions The number of cores grows faster than the amount of memory


  1. Memory Saving Techniques Annual Concurrency Forum Meeting Fermilab February 5, 2013 1 / 13

  2. 1 Session Introduction 2 Kernel-compressed Memory 2 / 13

  3. Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts 3 / 13

  4. Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts ∙ On the Grid: 2 GB per core 3 / 13

  5. Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts ∙ On the Grid: 2 GB per core ∙ ARM servers (or: hyper-threading): 1 GB per core/thread 3 / 13

  6. Introduction Assumptions The number of cores grows faster than the amount of memory Event-level parallelism: Memory consumption per event has to decrease Orthogonal to other parallelization efforts ∙ On the Grid: 2 GB per core ∙ ARM servers (or: hyper-threading): 1 GB per core/thread ∙ GPUs: order of magnitude less ∙ Xeon Phi (MIC): 100 MB per core memory per core, change of memory model 3 / 13

  7. Explored Memory Saving Techniques Summary of so far explored memory saving techniques: Memory Sharing ∙ Fork and copy-on-write Fork should be done reasonably late ∙ Kernel SamePage Merging Sharing is done automatically at the cost of speed ∙ Multi-threaded application (Geant4-MT) Can go beyond page-wise sharing in the fork model Reduction of Memory Consumption ∙ Kernel Compressed Memory ( zRam , frontswap , cleancache ) Virtual swap area used to compress unused memory ∙ X32 ABI: x86_64 semantics with 32bit pointers Restricts address space to 4 GB (which should be acceptable) These techniques are all (relatively) non-intrusive 4 / 13

  8. Discussion Items Job scheduling ∙ For memory sharing: jobs with similar input data should be co-scheduled ∙ In general: a good mix of jobs should be scheduled Techniques provided by the Linux kernel ∙ Many of the new features are not available in SL6 ∙ Virtual Machines can be used to couple a new kernel with an SL6 user land ∙ Automatically adjusting kernel parameters can be difficult New platforms ∙ There might be a need to recompile (and verify) the software stack for ARM and/or X32 5 / 13

  9. 1 Session Introduction 2 Kernel-compressed Memory 6 / 13

  10. Kernel-compressed Memory – Principle ∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks) Application Pages Kernel Pages Swap Device 7 / 13

  11. Kernel-compressed Memory – Principle ∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks) Application Pages Kernel Pages LZO Compression /dev/zram0 7 / 13

  12. Kernel-compressed Memory – Principle ∙ Kernel module compcache / zram provides a virtual block device for swapping ∙ Originally developed for “small” devices (Netbooks, phones, . . . ) ∙ Part of Kernel >= 2.6.34, can be compiled for SLC6 (with drawbacks) Application Pages Kernel Pages LZO Compression /dev/zram0 Change in strategy: not swap at all ↦→ swap whenever possible ( /proc/sys/vm/swappiness ) 7 / 13

  13. Kernel-compressed Memory and cgroups The system memory pressure and the swappiness are not fine-grained enough handles for measurements Linux cgroups allow to put the application into a limited memory container: $ mkdir /sys/fs/cgroup/memory/restricted $ echo $((150*1024*1024)) > \ /sys/fs/cgroup/memory/restricted/memory.limit_in_bytes $ echo $PID > /sys/fs/cgroup/memory/restricted/tasks 8 / 13

  14. Kernel-compressed Memory – Figures AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN) Normal Run 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 Time [s] 9 / 13

  15. Kernel-compressed Memory – Figures AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN) cgroup memory restriction to 950 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 Time [s] 9 / 13

  16. Kernel-compressed Memory – Figures AliRoot reconstruction of 2 simulated pp Events (v5-04-25-AN) cgroup memory restriction to 240 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 50 100 150 200 250 300 350 400 450 Time [s] 9 / 13

  17. Kernel-compressed Memory – Figures AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN) Normal Run 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 1600 Time [s] 10 / 13

  18. Kernel-compressed Memory – Figures AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN) cgroup memory restriction to 900 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 1600 Time [s] 10 / 13

  19. Kernel-compressed Memory – Figures AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN) cgroup memory restriction to 450 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 1600 Time [s] 10 / 13

  20. Kernel-compressed Memory – Figures AliRoot reconstruction of 10 PbPb Events (v5-03-62-AN) cgroup memory restriction to 150 MB 3500 vss rss 3000 physical mem swapped 0-pages rss+swapped 2500 Memory [MB] 2000 1500 1000 500 0 0 200 400 600 800 1000 1200 1400 1600 Time [s] 10 / 13

  21. Zero Pages X-Check: scan through a core dump of the application Can we get rid of these hundreds of Megabytes of continuous zeros? ∙ No change by using automatic garbage collection (Boehm’s GC) ∙ Zero pages in LHCb DaVinci: ≈ 700 MB out of 2 . 3 GB ∙ Zero pages in CMS reconstruction • 180 MB out of 900 MB without output • 280 MB out of 1 . 4 GB with output 11 / 13

  22. Forensics: First Results Idea: Inspect memset() calls >4 kB Dead pages (AliRoot reco) Remaining zero pages ∙ Excluded: read() , mmap() ∙ ≈ 40 % zero pages traced back to source code ∙ Excluded: ROOT buffers ∙ Breaks down to half a dozen ∙ Measurement uncertainties at memsets with high impact memset boundaries ∙ No hits after detector ∙ Only literal memset() covered, initialization standard constructors: ∙ Scattered over uses of int *a = TClonesArray new int[1024*1024](); 12 / 13

  23. Next Steps 1 Forensics: Track back large zero-runs to a malloc() 2 How to choose zram parameters for an optimal tradeoff wrt. throughput? Perhaps zram can also be used as an “overflow” mechanism to make sure that a job finishes 13 / 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend