Towards Unification of HPC and Big Data Paradigms Jess Carretero - PowerPoint PPT Presentation

Universidad Complutense de Madrid Conferencias de Postgrado Towards Unification of HPC and Big Data Paradigms Jesús Carretero Computer Science and Engineering Department University Carlos III of Madrid jcarrete@inf.uc3m.es

Science research is changing q Inference Spiral of System Science q As models become more complex and new data bring in more information, we require ever increasing computational resources 2 University Carlos III of Madrid

Who is generating Big Data Mobile devices (tracking all objects all the time) Social media and networks Scientific instruments (all of us are generating data) (collecting all sorts of data) Sensor technology and Companies and e-commerce networks (Collecting and warehousing data) (measuring all kinds of data) 3 University Carlos III of Madrid

Parallel applications require more data everyday … q Simulation has become the way to research and develop new scientific and engineering solutions. q Used nowadays in leading science domains like aerospace industry, astrophysics, etc. Challenges related to the complexity, scalability and data production of q the simulators arise. Impact on the relaying IT infrastructure. q 4 University Carlos III of Madrid

IoT: the paradigmatic challenge q The progress and innovation is no longer hindered by the ability to collect data q But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 5 University Carlos III of Madrid

Cross fertilization needed q Cross fertilization among High Performance Computing (HPC), large scale distributed systems, and big data management is needed. q Mechanisms should be valid for HPC, HTC and workflows … q Data will play an increasingly substantial role in the near future q Huge amounts of data produced by real-world devices, applications and systems (checkpoint, monitoring, …) 6 University Carlos III of Madrid

Areas of convergence q HPC Simulations and data q Challenges related to the complexity, scalability and data production of the simulators arise. q High-Performance data analytics (HPDA) q More input data (ingestion) q More output data for integration/analysis q Real time, near-real time requirements 7 University Carlos III of Madrid

HPC-BD convergence motivation Systems are expensive and not integrating misses opportunities q q Leveraging investments and purchasing power Integration of Computation and Observation cycles implicitly requires q convergence q Expanded cross disciplinary teams of researchers are needed to explore the most challenging problems for society q Data Consolidation trends span Big Data and HPC q Categorization of Data q Structured, Semi-structured and Unstructured Data q Computer Generated and Observed Data 8 University Carlos III of Madrid

Context HPC BIG DATA q Focus : CPU-intensive q Focus : large volumes of tightly-coupled applications loosely-coupled tasks. q Architecture : compute and q Architecture : co-located storage are decoupled, high- computation and data, speed interconnections. elasticity is required . HPC-Big Data convergence is a must q Data-intensive scientic computing q High-performance data analytics q Convergence at the infrastructure layer q virtualisation for HPC, deeper storage hierarchy, … 9 University Carlos III of Madrid

HPC and Big Data models q HPC requires Computing-Centric Models (CCM) q Big Data requires Data- Centric Models (DCM) 10 University Carlos III of Madrid

Platforms & paradigms General or specific Physical or virtual q Processing paradigms q Clusters and supercomputers q Open MP and MPI q Collective model (PGAs,…) q HPC and supercomputing q MapReduce model q Clouds q Iterative MapReduce model q Virtualized resources q DAG model q Higher-level model q Graph model 11 University Carlos III of Madrid

Data analytics and computing ecosystem compared Daniel A. Reed And Jack Dongarra . Exascale Computing and Big Data.Communications Of The Acm. 58(1). July 2015. 7 12 University Carlos III of Madrid

Non-Convergent system architectures q HPC system q Big Data Platforms Compute Local disk farm processor High speed Network network Storage farm Physical resources Virtualized resources 13 University Carlos III of Madrid

Integration of computation and observation q Traditional approach: open loop results Off-line data Simulation analytics results q Desired approach: closed loop results data Simulation On-line visualization analytics 14 University Carlos III of Madrid

But we need to … q Integrate the platform layer and data abstractions for both HPC and Big Data platforms q We can use Mpi-based MapReduce, but we loose all BD existing facilities. q Solution: Connection of MPI applications and Spark. q Avoid data copies between simulation and analysis every iteration. q HPC and BigData use different file systems q Copying data will lead to poor performance and huge storage space q Solution: Scalable I/O system architecture. q Have data-aware allocation of tasks in HPC. q Schedulers are CPU oriented q Solution: connecting scheduler with data allocation. 15 University Carlos III of Madrid

Convergence in programming environments? q HPC and BD have separate computing environment heritages. q Data: R, Python, Hadoop, MAHOUT, MLLIB, SPARK q HPC: Fortran, C, C++, BLAS, LAPACK, HSL, PETSc, Trilinos. q Determine capabilities, requirements (application, system, user), opportunities and gaps for: q Leveraging HPC library capabilities in BD (e.g., scalable solvers). q Providing algorithms in native BD environments. q Providing HPC apps, libraries as appliances (containers aaS). 16 University Carlos III of Madrid

MapReduce is the leading paradigm q A simple programming model q Functional model q A combination of the Map and Reduce models with an associated implementation q For large-scale data processing q Exploits large set of commodity computers q Executes process in distributed manner q Offers high availability q Used for processing and generating large data sets 17 University Carlos III of Madrid

Data-driven distribution q In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in. q An underlying distributed file systems (HDFS) splits large data files into chunks which are managed by different nodes in the cluster Input data: A large file Node 1 Node 2 Node 3 Chunk of input data Chunk of input data Chunk of input data q Even though the file chunks are distributed across several machines, they form a single namespace (key, value) q Scale: Large number of commodity hardware disks: say, 1000 disks 1TB each 18 University Carlos III of Madrid

Classes of problems “mapreducable” q Benchmark for comparing: Jim Gray’s challenge on data-intensive computing. Ex: “Sort” q Google uses it (we think) for wordcount, adwords, pagerank, indexing data. q Simple algorithms such as grep, text-indexing, reverse indexing q Bayesian classification: data mining domain q Facebook uses it for various operations: demographics q Financial services use it for analytics q Astronomy: Gaussian analysis for locating extra-terrestrial objects. q Expected to play a critical role in semantic web and web3.0 19 University Carlos III of Madrid

Data-centric adaptation q Find the way to divide the original simulation q into smaller independent simulations (BSP model) q Analyse the original simulation domain in order to find an independent variable Tx that can act as index for the partitioned input data. q Independent time-domain steps q Spatial divisions q Range of simulation parameters The goal is to run the same simulation kernel but on fragments of the full partitioned data set 20 University Carlos III of Madrid

Methodology: two phase approach q Data adaptation phase: first Map-Reduce task q Reads the input files and indexes all the necessary parameters by Tx q Reducers provide intermediate <key, value> output for next step q The original data is partitioned q Subsequent simulations can run autonomously for each (Tx; parameters) entry. q Simulation phase: second Map-Reduce task q Runs the simulation kernel for each value of the independent variable q With the necessary data that was mapped to them in the previous stage q Plus the required simulation parameters that are common for every partition q Reducers are able to gather all the output and provide final results as the original application. 21 University Carlos III of Madrid

Data-driven architectural model "Efficient design assessment in the railway electric infrastructure domain using cloud computing", S. Caíno-Lores, A. García, F. García-Carballeira, J. Carretero, Integrated Computer-Aided Engineng , vol. 24, no. 1, pp. 57-72, December, 2016. 22 University Carlos III of Madrid

Hydrogeology simulator adaptation q The ensemble of realizations constitute the parallelizable domain (i.e. key). q Columns of the model are distributed per realization. 23 University Carlos III of Madrid

Problem: Scalability Cluster EC2 24 University Carlos III of Madrid

Towards Unification of HPC and Big Data Paradigms Jess Carretero - PowerPoint PPT Presentation

Universidad Complutense de Madrid Conferencias de Postgrado Towards Unification of HPC and Big Data Paradigms Jess Carretero Computer Science and Engineering Department University Carlos III of Madrid jcarrete@inf.uc3m.es Science research

unification 2016 unification strategic roadmap succession unification strategic roadmap

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Unification in the Description Logic EL EL - unification Minimal unifiers Franz Baader and

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Projective unification in modal logic II Projective unification in modal logic II Piotr Wojtylak

Introduction to Unification Theory Syntactic Unification Temur Kutsia RISC, Johannes Kepler

Introduction to Unification Theory Higher-Order Unification Temur Kutsia RISC, Johannes Kepler

Unification on Subvarieties of Introduction Algebraic Unification Pseudocomplemented lattices

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Breaking Paradigms in Control Building Design By Robert Frye Tennessee Valley Authority April 6,

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Hydrogen Chemical Electrical Where does the energy upon burning come

Measuring stellar masses We measure mass using gravity. Direct mass measurements are possible only

Jet Quenching Liliana Apolinrio 26th February 2019 COST THOR School, Lund, Sweden

Simulation and Benchmarking of Modelica Simulation and Benchmarking of Modelica Models on

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi

Capacity Development in Land ( Academic and Professional Development ) Prof.dr. Jaap Zevenbergen

Offshore Wind Meet Oil & Gas, Defence, Space Sandpit Session Dr Nee-Joo Teh Energy

Session 3: Hydrology & Clouds 3:00- 5:30 PM Session 3: Hydrology & Clouds 3:00- 5:30 PM

Towards Unification of HPC and Big Data Paradigms Jess Carretero - PowerPoint PPT Presentation

Universidad Complutense de Madrid Conferencias de Postgrado Towards Unification of HPC and Big Data Paradigms Jess Carretero Computer Science and Engineering Department University Carlos III of Madrid jcarrete@inf.uc3m.es Science research

unification 2016 unification strategic roadmap succession unification strategic roadmap

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Unification in the Description Logic EL EL - unification Minimal unifiers Franz Baader and

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Projective unification in modal logic II Projective unification in modal logic II Piotr Wojtylak

Introduction to Unification Theory Syntactic Unification Temur Kutsia RISC, Johannes Kepler

Introduction to Unification Theory Higher-Order Unification Temur Kutsia RISC, Johannes Kepler

Unification on Subvarieties of Introduction Algebraic Unification Pseudocomplemented lattices

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Breaking Paradigms in Control Building Design By Robert Frye Tennessee Valley Authority April 6,

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Hydrogen Chemical Electrical Where does the energy upon burning come

Measuring stellar masses We measure mass using gravity. Direct mass measurements are possible only

Jet Quenching Liliana Apolinrio 26th February 2019 COST THOR School, Lund, Sweden

Simulation and Benchmarking of Modelica Simulation and Benchmarking of Modelica Models on

Indexing of textual databases based on lexical resources: A case study for Serbian Ranka Stankovi

Capacity Development in Land ( Academic and Professional Development ) Prof.dr. Jaap Zevenbergen

Offshore Wind Meet Oil &amp; Gas, Defence, Space Sandpit Session Dr Nee-Joo Teh Energy

Session 3: Hydrology &amp; Clouds 3:00- 5:30 PM Session 3: Hydrology &amp; Clouds 3:00- 5:30 PM

Offshore Wind Meet Oil & Gas, Defence, Space Sandpit Session Dr Nee-Joo Teh Energy

Session 3: Hydrology & Clouds 3:00- 5:30 PM Session 3: Hydrology & Clouds 3:00- 5:30 PM