Introducing MapReduce to High End Computing Grant Mackey, Julio - PowerPoint PPT Presentation

Aug 17, 2022 •103 likes •234 views

Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie Melon University, Los Alamos National Laboratory Scientific Applications As the

Introducing MapReduce to High End Computing Grant Mackey, Julio Lopez, Saba Sehrish, John Bent, Salman Habib, Jun Wang University of Central Florida, Carnegie Melon University, Los Alamos National Laboratory
Scientific Applications As the computational scale of scientific applications grows, so does the amount of data. Dealing with that amount of data becomes difficult. •Data analytics become difficult •The data becomes too large to move •Applications become resource intensive •More difficult to program for • Do the older existing solutions scale?
Scientific Applications Bioinformatics (Basic Local Alignment Search Tool) •Genomics machines generate large datasets (GB~TB) •Data is manually distributed in parallel through an adhoc job manager script •The method of parallelizing BLAST is conceptually a manual MapReduce operation •Using Hadoop would abstract away the manual parallelization of tasks and would provide task resiliency
Scientific Applications Cyber-Security: Real-time network analysis •In a massively multi-user network environment, petabytes of information can pass of the network in a matter of months •Need a scalable FS that can accommodate the large streaming datasets •Network events are data independent •A programming model that abstracts parallelization from the user is convenient
Scientific Applications Astrophysics: Halo Finding •Current issues •Hadoop Solutions •Ad Hoc: The approach is •Provides a standard unique approach •Too much data movement •No data movement •Parallel halo finding tasks •Hadoop ensures task are unreliable resilience
Halo Finding Method used to find clusters of particles in large astrophysics datasets.
Friends of Friends Algorithm used to perform halo finding
MapReduce model for Halo-Finding HDFS FoF R M
Experiences There is a reason why people think that Hadoop and is only good for data mining applications There exists little to no functionality for data types beyond text Learning curve for the language is steep for applications that deal with different data types such as binary The programmer has to deal with the new programming model and write their own input classes The Hadoop community is very active and incredibly helpful/prompt with responding to issues/bugs
Conclusion Hadoop can be used as a viable resource for large data intensive computing Hadoop runs on an inexpensive commodity computing platform, but provides powerful tools for large scale data analytics The Hadoop architecture provides for task resiliency that other scientific computing methods cannot Hadoop allows for a strict model in which to parallelize a task and the parallelization has been shown to scale to 1000+ node cluster environments (Amazon’s S3 cluster) Hadoop needs more functionality in its API for other data formats
Contact Grant Mackey: gmackey@cs.ucf.edu Julio Lopez: jclopez@andrew.cmu.edu Saba Sehrish: ssehrish@cs.ucf.edu John Bent: johnbent@lanl.gov Jun Wang: jwang@cs.ucf.edu

Recommend

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012 MapReduce MapReduce

432 views • 29 slides

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations

373 views • 26 slides

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2 Challenge with Spot Market 3 Cloud MapReduce Hadoop Our prior work MapReduce App MapReduce App Cloud MapReduce Hadoop Cloud OS Amazon

185 views • 7 slides

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and implementation used to process and generate large data sets. The map component of a MapReduce job typically parses input data and distills it down to

532 views • 5 slides

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2 Logical View of MapReduce During MapReduce, the

422 views • 23 slides

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large Scale Data Processing MapReduce Idea: simple, highly scalable, generic parallelization model Want to process lots of data ( > 1 TB)

781 views • 49 slides

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the concept Hadoop : the implementation Query Languages for Hadoop Spark : the improvement MapReduce vs databases Conclusion 340151

788 views • 29 slides

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the basic data structure in MapReduce Keys and values can be: integers, float, strings, raw bytes They can also be arbitrary data structures

1.77k views • 65 slides

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a framework for batch processing of Big Data: http://research.google.com/archive/mapreduce-osdi04-slides] Framework: A system used by programmers to build

186 views • 3 slides

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Laboratory Session: MapReduce 1 / 63 Algorithm Design Preliminaries Preliminaries Pietro Michiardi (Eurecom) Laboratory

814 views • 62 slides

Introducing more people Introducing more people Introducing more people Introducing more people

Introducing more people Introducing more people Introducing more people Introducing more people Cricket Wales Board to the power of cricket to the power of cricket to the power of cricket to the power of cricket December 2018 1. Marketing

351 views • 22 slides

Writing reliable end to end tests End to end browser tests They take a long time to run. Around

Writing reliable end to end tests End to end browser tests They take a long time to run. Around 4-12 hours Long feedback cycles Tough to read or modify Flaky Not part of the development life cycle Unit tests are End to end important but they End

530 views • 38 slides

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice deletes intermediate results of MapReduce jobs These results are not useless A system that reuses the output of MapReduce jobs / sub-jobs --

802 views • 59 slides

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer Agenda MapReduce What is it? Case Study Entropy Timeseries Scaling MapReduces Other thoughts, Conclusions MapReduce: What is it? A parallel

380 views • 12 slides

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and Michael Schatz Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons

398 views • 29 slides

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling MapReduce How and why did we come up with our model? [Karloff, Suri, Vassilvitskii SODA 2010] MapReduce algorithms for counting triangles in a

482 views • 29 slides

Accelerate Search and Recognition Workloads with SSE 4.2 String and g Text Processing

Accelerate Search and Recognition Workloads with SSE 4.2 String and g Text Processing Instructions Guangyu Shi, Min Li and Mikko Lipasti University of Wisconsin-Madison ISPASS 2011 April 12, 2011 Executive Summary STTNI can be used to

785 views • 32 slides

Using Docker with GPUs Sandra Gesing sandra.gesing@nd.edu

CCL Workshop on Scalable ScienCfic CompuCng 2016 Using Docker with GPUs Sandra Gesing sandra.gesing@nd.edu 20 October 2016 State of the Art

314 views • 15 slides

Sequence Analysis Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven

Sequence Analysis Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Overview Strings Pattern Matching Alignments Scoring Alignments (Cost, Score)

815 views • 36 slides

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr

CS481: Bioinformatics Algorithms Can Alkan EA224 calkan@cs.bilkent.edu.tr http://www.cs.bilkent.edu.tr/~calkan/teaching/cs481/ Heuristic Similarity Searches Genomes are huge: Smith-Waterman quadratic alignment algorithms are too slow

1.54k views • 47 slides

Performing Large Science Experiments on Azure: Pitfalls and Solutions Wei Lu, Jared Jackson,

Performing Large Science Experiments on Azure: Pitfalls and Solutions Wei Lu, Jared Jackson, Jaliya Ekanayake, Roger Barga, Nelson Araujo Microsoft eXtreme Computing Group CloudCom2010, Indianapolis , IN Windows Azure Application Storage

630 views • 26 slides

iRODS functionality within the Grassroots Infrastructure Simon Tyrrell, Xingdong Bian and Robert

iRODS functionality within the Grassroots Infrastructure Simon Tyrrell, Xingdong Bian and Robert P. Davey Earlham Institute, Norwich Research Park, Norwich, NR4 7UZ, UK http://www.earlham.ac.uk/ Background Grassroots is part of the Wheat

549 views • 44 slides

Performance of Scientific Applications Lonnie D. Crosby, R. Glenn Brook, Bhanu Rekapalli,

A Pragmatic Approach to Improving the Large-scale Parallel I/O Performance of Scientific Applications Lonnie D. Crosby, R. Glenn Brook, Bhanu Rekapalli, Mikhail Sekachev, Aaron Vose, and Kwai Wong A Pragmatic Approach Data Movement

393 views • 17 slides

Developing and Using Special Developing and Using Special Developing and Using Special Purpose

Developing and Using Special Developing and Using Special Developing and Using Special Purpose Hidden Markov Model Purpose Hidden Markov Model Purpose Hidden Markov Model Databases Databases Databases Martin Gollery Associate Director of

831 views • 81 slides