Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan - PowerPoint PPT Presentation

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan; University of California, Santa Barbara

Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that • Process of extracting patterns and meaningful data from large data sets • Useful for business, research, medicine, etc. 2

Data Mining Applications 3

What is MapReduce and Hadoop? • MapReduce was invented by Google and used to index the web • Hadoop is software that implements MapReduce • Map and Reduce refer to the two main steps in algorithm • MapReduce steps and final results are key-value pairs 1. map (k1,v1) → list(k2,v2) 2. reduce (k2,list(v2)) → list(k3,v3) 4

Word Count in MapReduce Input Splitting Mapping Shuffling Reducing Final Result Courtesy of JTeam/Martijn van Groningen <http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/> 5

The Problem of Skew • MapReduce is a sequential algorithm • Heterogeneous computing environments and non-random datasets can cause each task, or partition of data, to complete at varying times, also known as skew • Skew can mean that a cluster will not be utilized efficiently • SkewReduce, a framework developed by Washington researchers, solves the issue of skew 6

Skew Illustrated Time Elapsed Time wasted Time doing task #2 #3 #4 #5 #6 #1 Courtesy of Skew-Resistant Parallel Processing […] YongChul Kwon, Magdalena Balazinksa, Bill Howe, and Jerome Rolia 7

SkewReduce • SkewReduce is a framework built on top of Hadoop • Has an API, which is tied to processing specific types of data • Also has an optimizer – Makes use of cost analysis functions – Cost is used to partition data so that each computer finishes its task at about the same time as the rest • Cost functions require sample data and more programming, not out of the box 8

Project Goals • Setup Hadoop cluster, run SkewReduce • Work off of SkewReduce as a base • Leave API alone, remove optimizer • Implement a task scheduler that does not make use of cost functions or respective sample data • Compare performance with default “dumb” Hadoop task scheduler and SkewReduce’s optimizer 9

Our Optimized Task Scheduler • Novel and clever way of “fast-tracking” tasks to completion • Does not care about underlying data or algorithm • Tasks which are deemed to take too long in comparison to all other equally-sized tasks on a computer are stopped and split up for rest of the cluster • Removes the need for cost functions or sample data 10

Task Scheduler Visualized Time Elapsed Killed tasks Incomplete, running too long, task Redistributed task chunks Complete task #2 #3 #4 #5 #6 #1

Hadoop Cluster Performance Tuning Test cluster with 8665 books from Gutenberg project, or ~3.2 GB, using word count • Seven node cluster, Core 2 Duo 2.8GHz, 3GB RAM, and 160GB HD each • Run # Configuration Runtime (each run inherits last configuration) 1 8665 separate files, 1 hrs, 3 mins, 12 sec replication 2 2 Compiled single file 3 mins, 30 sec 3 Increase file buffer size 3 mins, 25 sec 4 Turn off speculative execution 3 mins, 20 sec 5 Increase MapReduce memory 3 mins, 30 sec to 512MB from 200MB 6 Increase block size to 128MB 3 mins, 21 sec from 64MB 12

Experimental Methods • Use SkewReduce’s following datasets: Dataset Size # Items Description Astro 18 GB 900 M Cosmology simulation Seaflow 1.9 GB 59 M Flow cytometry • Use SkewReduce’s included MapReduce algorithms which identifies clusters of particles • Benchmark SkewReduce emulating Hadoop behavior, SkewReduce’s Optimizer, and our task scheduler with both datasets 13

Expected Runtime Results 100 87.2 90 80 70 60 Dataset (time scale) 50 Astro (hours) 40 Seaflow (minutes) 30 20 14.1 14.1 10 1.6 0 Hadoop's default Our Task Scheduler SkewReduce's scheduler Goal Optimizer 14

Challenges and Future Work • Large learning curve for MapReduce, Hadoop, and SkewReduce • Finish task scheduler • Ensure task scheduler requires no changes to algorithm or dataset • Experiment with small variations to task scheduler algorithm to improve upon • Compare to Hadoop’s scheduler and SkewReduce’s optimizer 15

Acknowledgements University of California, Santa Barbara Mentor : Nan Li Faculty advisor : Prof. Xifeng Yan Graduate students : Shengqi Yang University of Washington Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions YongChul Kwon, Magdalena Balazinska 16

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan - PowerPoint PPT Presentation

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan; University of California, Santa Barbara Data Mining: Big Picture Big data is rampant in fields, data mining helps solve that

HTTP/2 Daniel Stenberg, Mozilla @bagder HTTP Today HTTP/2 basics Status Future Daniel

HPC Cluster Efficiency Benchmarking 07.09.2011 Daniel Molka (daniel.molka@tu-dresden.de) Daniel

LVI Hijacking Transient Execution with Load Value Injection Daniel Gruss, Daniel Moghimi, Jo Van

Having Impact Matters Jesper Richter-Reichhelm (@rirei) Daniel Pink Daniel Pink Autonomy

Property of Phase Peptide Synthesis Daniel Andre Novoa Daniel Andre Novoa Introduction Property

Daniel Island Town STREETSCAPE IMPROVEMENT PROJECT Daniel Island Town Current streetscape?

HP-PS Collimator Studies Androula Alekou Daniel Spitzbart androula.alekou@cern.ch

10 Tips Every Flare User Should Know PRESENTED BY Daniel Ferguson INTRODUCTION Daniel Ferguson

dnsmon DNS Server Monitoring Daniel Karrenberg <daniel.karrenberg@ripe.net> 1 dnsmon

Stranded Assets, CAPEX and Investment Consultants Dr. Daniel J. Tulloch Email: Daniel J. Tulloch

Daniel 8 The Ram & the Goat Daniel 8:1-2 In the

Fixed-Target Program at STAR Daniel Cebra University of California, Davis Daniel Cebra Probing

Industrial I/O Subsystem: The Home of Linux Sensors Daniel Baluta Intel daniel.baluta@intel.com

with SDSS-IV/MaNGA 10,000 of these Daniel Thomas University of Portsmouth Daniel Goddard, Taniya

eBPF and XDP walkthrough and recent updates Daniel Borkmann <daniel@iogearbox.net> cilium

FXT Daniel Cebra Daniel Cebra CBM-STAR Joint Workshop CBM-STAR Joint Workshop Slide 1 of 23

Ground Truth Data for Performance Evaluation of Urdu Nastalique OCR Aneeta Niazi Research

The Mormon Diaries Project Scott Eldredge, Digital Initiatives Program Manager Harold B. Lee

Slide 4 / 22 Incisors Different animals have different amounts and types of incisors. Humans

Panel: 71 Year Old Long- Standing Spinal Canal Stenosis Refusing Surgical Intervention

Knowlywood: Mining Activity Knowledge From Hollywood Narratives Date:2016/08/30 Author:Nilet

Democratizing Energy Technology Dane A. Boysen, PhD April 17, 2017 University of Connecticut

Everywhere Blocks for SIMD Programming Authors: Rubens E. A. Moreira, Sylvain

Transmission of resistant HIV in patients with a known date of infection Data from the HIV-1

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan - PowerPoint PPT Presentation

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan; University of California, Santa Barbara Data Mining: Big Picture Big data is rampant in fields, data mining helps solve that

HTTP/2 Daniel Stenberg, Mozilla @bagder HTTP Today HTTP/2 basics Status Future Daniel

HPC Cluster Efficiency Benchmarking 07.09.2011 Daniel Molka (daniel.molka@tu-dresden.de) Daniel

LVI Hijacking Transient Execution with Load Value Injection Daniel Gruss, Daniel Moghimi, Jo Van

Having Impact Matters Jesper Richter-Reichhelm (@rirei) Daniel Pink Daniel Pink Autonomy

Property of Phase Peptide Synthesis Daniel Andre Novoa Daniel Andre Novoa Introduction Property

Daniel Island Town STREETSCAPE IMPROVEMENT PROJECT Daniel Island Town Current streetscape?

HP-PS Collimator Studies Androula Alekou Daniel Spitzbart androula.alekou@cern.ch

10 Tips Every Flare User Should Know PRESENTED BY Daniel Ferguson INTRODUCTION Daniel Ferguson

dnsmon DNS Server Monitoring Daniel Karrenberg &lt;daniel.karrenberg@ripe.net&gt; 1 dnsmon

Stranded Assets, CAPEX and Investment Consultants Dr. Daniel J. Tulloch Email: Daniel J. Tulloch

Daniel 8 The Ram &amp; the Goat Daniel 8:1-2 In the

Fixed-Target Program at STAR Daniel Cebra University of California, Davis Daniel Cebra Probing

Industrial I/O Subsystem: The Home of Linux Sensors Daniel Baluta Intel daniel.baluta@intel.com

with SDSS-IV/MaNGA 10,000 of these Daniel Thomas University of Portsmouth Daniel Goddard, Taniya

eBPF and XDP walkthrough and recent updates Daniel Borkmann &lt;daniel@iogearbox.net&gt; cilium

FXT Daniel Cebra Daniel Cebra CBM-STAR Joint Workshop CBM-STAR Joint Workshop Slide 1 of 23

Ground Truth Data for Performance Evaluation of Urdu Nastalique OCR Aneeta Niazi Research

The Mormon Diaries Project Scott Eldredge, Digital Initiatives Program Manager Harold B. Lee

Slide 4 / 22 Incisors Different animals have different amounts and types of incisors. Humans

Panel: 71 Year Old Long- Standing Spinal Canal Stenosis Refusing Surgical Intervention

Knowlywood: Mining Activity Knowledge From Hollywood Narratives Date:2016/08/30 Author:Nilet

Democratizing Energy Technology Dane A. Boysen, PhD April 17, 2017 University of Connecticut

Everywhere Blocks for SIMD Programming Authors: Rubens E. A. Moreira, Sylvain

Transmission of resistant HIV in patients with a known date of infection Data from the HIV-1

dnsmon DNS Server Monitoring Daniel Karrenberg <daniel.karrenberg@ripe.net> 1 dnsmon

Daniel 8 The Ram & the Goat Daniel 8:1-2 In the

eBPF and XDP walkthrough and recent updates Daniel Borkmann <daniel@iogearbox.net> cilium