NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement - PDF document

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines M´ arcio Castro, Luiz Gustavo Fernandes Christiane Pousa, Jean-Franc ¸ois M´ ehaut GMAP, PPGCC Laboratoire d’Informatique Grenoble Pontif´ ıcia Universidade Cat´ olica do Rio Grande do Sul Grenoble Universit´ e Porto Alegre - Brazil Grenoble - France { mcastro, gustavo } @inf.pucrs.br { christiane.pousa, mehaut } @imag.fr Marilton Sanchotene de Aguiar GMFC, PPGInf Universidade Cat´ olica de Pelotas Pelotas - Brazil marilton@atlas.ucpel.tche.br Abstract which have not been thoroughly analyzed [3]. ICTM (Interval Categorizer Tessellation Model) is a tes- In geophysics, the appropriate subdivision of a region sellation model for the simultaneous categorization of geo- into segments is extremely important. ICTM (Interval Cat- graphic regions considering several different characteristics egorizer Tesselation Model) is an application that cat- (relief, vegetation, climate, land use, etc.) using informa- egorizes geographic regions using information extracted tion extracted from satellite images. The analysis of the from satellite images. The categorization of large regions function monotonicity, which is embedded in the rules of is a computational intensive problem, what justifies the the model, categorizes each tessellation cell, with respect to proposal and development of parallel solutions in order the whole considered region, according to its declivity sig- to improve its applicability. Recent advances in multi- nal (positive, negative or null). The first formalization of processor architectures lead to the emergence of NUMA ICTM, a single-layered model for the relief categorization (Non-Uniform Memory Access) machines. In this work, of geographic regions, called Topo-ICTM (Interval Catego- we present NUMA-ICTM: a parallel solution of ICTM rizer Tessellation Model for Reliable Topographic Segmen- for NUMA machines. First, we parallelize ICTM using tation), was initially presented in [4]. Through this work, OpenMP. After, we improve the OpenMP solution using the it was possible to find out that the categorization of large MAI (Memory Affinity Interface) library, which allows a regions requires a high computational power, resulting in control of memory allocation in NUMA machines. The re- large execution times over single processor machines. sults show that the optimization of memory allocation leads Previous works investigated the possibility to parallelize to significant performance gains over the pure OpenMP ICTM using distributed memory platforms such as clusters parallel solution. and grids (see Section 2.2), however these platforms introduce two important limitations to the ICTM parallelization: (i) they do not allow parallel approaches which need in- 1. Introduction tensive processes communication, since the communication cost is too significant, and (ii) such platforms usually do not present nodes with large local memories, which are neces- An adequate subdivision of geographic areas into seg- sary to compute very large regions. ments presenting similar characteristics is often convenient in Geophysics. This appropriate subdivision enables us to Traditional UMA (Uniform Memory Access) architec- extrapolate the results obtained in some locations within tures present a single memory controller, which is shared the segment, in which extensive research have been done, by all processors. This single memory connection often to other locations less explored within the same segment. becomes a bottleneck when many processors accesses the Thus, we can have a good understanding of the locations memory at the same time. This problem is even worse

Input data in systems with a higher number of processors, in which Satellite image Subdivision Tesselation the single memory controller does not scale satisfactorily. Therefore, these architectures may not fulfill our require- y ments to develop an efficient parallel solution for ICTM. NUMA (Non-Uniform Memory Access) architectures appear as an interesting alternative to surpass the UMA x scalability problem. In NUMA architectures the system Average values is split into multiple nodes [6]. These machines have, as their main characteristics, multiple memory levels that are Figure 1. ICTM input data. seen by the developers as a single memory. They combine the efficiency and scalability of MPP (Massively Parallel Processing) architectures with the programming facility of by their latitude and longitude coordinates. The geographic SMP (Symmetric Multiprocessing) machines [9]. However, region is represented by a regular tessellation that is deter- due to the fact that the memory is divided in blocks, the time mined by the subdivision of the total area into sufficiently spent to access the memory is conditioned by the “distance” small rectangular subareas, each one represented by one cell between the processor (which accesses the memory) and the of the tessellation (Figure 1). This subdivision is done ac- memory block (in which the data is allocated). cording to a cell size, established by the geophysics or ecol- A parallel solution of ICTM for NUMA machines ex- ogy analyst and it is directly associated to the refinement ploiting memory affinity in order to achieve better perfor- degree of the tessellation. mances is the aim of in this paper. First, we describe how ICTM was parallelized using OpenMP. After that, consider- 2.1. Categorization Process ing the fact that OpenMP has been originally developed to parallelize applications for UMA machines, we chose the In order to categorize the regions of each layer, ICTM MAI (Memory Affinity Interface) library in order to control executes sequential phases, where each phase uses the re- de memory allocation and threads placement. sults obtained from the previous one (Figure 2). The tesse- This paper is structured as follows: Section 2 describes lation showed in Figure 1 is represented as a matrix with n r the general workflow of ICTM and the related work about rows and n c columns. ICTM parallel versions for other high performance platforms. In Section 3, we briefly present how ICTM was parallelized using just OpenMP library, we describe the machines used to run our experiments and the case studies used to evaluate its performance. In order to face the pure OpenMP solution limitations, in Section 4 we introduce the Figure 2. ICTM categorization process. MAI functionalities to fine tune memory allocations. Fi- nally, concluding remarks and future works are pointed out In topographic analysis, usually there are too many data, in Section 5. most of which is geophysically irrelevant. Thus, for each subdivision, the average value of a specific feature at the 2. ICTM points supplied by radar or sattelite images is taken. The first phase of the categorization process involves this input data reading (average values) and these data are stored on a ICTM is a multi-layered and multi-dimensional tessel- matrix called Absolute Matrix . lation model for the categorization of geographic regions considering several different characteristics (relief, vegeta- The categorization proceeds to the next phase, in which tion, climate, land use, etc.). The number of characteristics the data is simplified. The Absolute matrix is normalized that should be studied determines the number of layers of by dividing each element by the largest one, creating the the model. In each layer, a different analysis of the region Relative Matrix . Considering the fact that the data that has is performed. An appropriate projection of all layers to a been extracted from the satellite images are very accurate, basic layer of the model leads to a meaningful subdivision the errors contained in the Relative Matrix come from the of the region and to a categorization of the sub-regions that discretization of the region into tessellation cells. Due to consider the simultaneous occurrence of all characteristics, this fact, Interval Mathematics techniques [8] are used to according to some weights, permitting interesting analyses control the errors associated to cell values (advantages of about their mutual dependency. using intervals can be seen in [3] and [5]). Thus, in the The input data is extracted from satellite images, in next phase, two Interval Matrices are created in which the which the information is given in certain points referenced interval values for x and y coordinates are stored. 2

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement - PDF document

NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Placement Strategies for NUMA Machines M arcio Castro, Luiz Gustavo Fernandes Christiane Pousa, Jean-Franc ois M ehaut GMAP, PPGCC Laboratoire dInformatique Grenoble Pontif

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

NUMA-Friendly Stack (using Delegation and Elimination) Irina Calciu Justin Gottschlich Maurice

NUMA-aware Matrix-Matrix-Multiplication Max Reimann, Philipp Otto 1 About this talk

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

NUMA obliviousness through memory mapping Mrunal Gawade Martin Kersten CWI, Amsterdam

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

COMP 633 - Parallel Computing Lecture 12 September 22, 2020 CC-NUMA (3) Synchronization

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

COMP 633 - Parallel Computing Lecture 12 September 17, 2020 CC-NUMA (2) Memory Consistency

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

ECS Corporate Presentation 3Q15 Financial Results Update 5 November 2015 Preferred Supplier of

VANGUARD Compliance Manager Support Compliance Customization Supports RACF and Performs

October 23, 2013 Forward Looking Statements We are making some forward looking statements today

WHAT WOULD YOU DO!! Group 1 Question In a review of a SFAs formal procurement of US Foods

Verification of Delayed Differential Dynamics Based on Validated Simulation Mingshuai Chen 1 ,

Chapter 3: Data Abstraction Modularity and Abstraction Abstraction, modularity, information

Update on MiniBooNE H. A. Tanaka a a Department of Physics, Joseph Henry Laboratories Princeton

Analysis, and Virtual Screening Gaspar Pinto @ EJIBCE 2018 Introduction Who we are Praha