Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan - - PowerPoint PPT Presentation

daniel vicory
SMART_READER_LITE
LIVE PREVIEW

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan - - PowerPoint PPT Presentation

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan; University of California, Santa Barbara Data Mining: Big Picture Big data is rampant in fields, data mining helps solve that


slide-1
SLIDE 1

Daniel Vicory

Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan; University of California, Santa Barbara

slide-2
SLIDE 2

Data Mining: Big Picture

  • Big data is rampant in

fields, data mining helps solve that

  • Process of extracting

patterns and meaningful data from large data sets

  • Useful for business,

research, medicine, etc.

2

slide-3
SLIDE 3

Data Mining Applications

3

slide-4
SLIDE 4

What is MapReduce and Hadoop?

  • MapReduce was invented by Google and used to index the

web

  • Hadoop is software that implements MapReduce
  • Map and Reduce refer to the two main steps in algorithm
  • MapReduce steps and final results are key-value pairs

4

  • 1. map (k1,v1)

→ list(k2,v2)

  • 2. reduce (k2,list(v2)) → list(k3,v3)
slide-5
SLIDE 5

Word Count in MapReduce

5

Courtesy of JTeam/Martijn van Groningen <http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/>

Input Splitting Mapping Shuffling Reducing Final Result

slide-6
SLIDE 6

The Problem of Skew

  • MapReduce is a sequential algorithm
  • Heterogeneous computing environments and non-random

datasets can cause each task, or partition of data, to complete at varying times, also known as skew

  • Skew can mean that a cluster will not be utilized efficiently
  • SkewReduce, a framework developed by Washington

researchers, solves the issue of skew

6

slide-7
SLIDE 7

Skew Illustrated

7

Time Elapsed Time wasted Time doing task

Courtesy of Skew-Resistant Parallel Processing […] YongChul Kwon, Magdalena Balazinksa, Bill Howe, and Jerome Rolia

#1 #2 #3 #4 #5 #6

slide-8
SLIDE 8

SkewReduce

  • SkewReduce is a framework built on top of Hadoop
  • Has an API, which is tied to processing specific types of data
  • Also has an optimizer

– Makes use of cost analysis functions – Cost is used to partition data so that each computer finishes its task at about the same time as the rest

  • Cost functions require sample data and more programming,

not out of the box

8

slide-9
SLIDE 9

Project Goals

  • Setup Hadoop cluster, run SkewReduce
  • Work off of SkewReduce as a base
  • Leave API alone, remove optimizer
  • Implement a task scheduler that does not make use of cost

functions or respective sample data

  • Compare performance with default “dumb” Hadoop task

scheduler and SkewReduce’s optimizer

9

slide-10
SLIDE 10

Our Optimized Task Scheduler

  • Novel and clever way of “fast-tracking” tasks to completion
  • Does not care about underlying data or algorithm
  • Tasks which are deemed to take too long in comparison to all
  • ther equally-sized tasks on a computer are stopped and split

up for rest of the cluster

  • Removes the need for cost functions or sample data

10

slide-11
SLIDE 11

Task Scheduler Visualized

Time Elapsed Incomplete, running too long, task Redistributed task chunks Killed tasks Complete task #1 #2 #3 #4 #5 #6

slide-12
SLIDE 12

Hadoop Cluster Performance Tuning

12

  • Test cluster with 8665 books from Gutenberg project, or ~3.2 GB, using word count
  • Seven node cluster, Core 2 Duo 2.8GHz, 3GB RAM, and 160GB HD each

Run # Configuration

(each run inherits last configuration)

Runtime 1 8665 separate files, replication 2 1 hrs, 3 mins, 12 sec 2 Compiled single file 3 mins, 30 sec 3 Increase file buffer size 3 mins, 25 sec 4 Turn off speculative execution 3 mins, 20 sec 5 Increase MapReduce memory to 512MB from 200MB 3 mins, 30 sec 6 Increase block size to 128MB from 64MB 3 mins, 21 sec

slide-13
SLIDE 13

Experimental Methods

  • Use SkewReduce’s following datasets:
  • Use SkewReduce’s included MapReduce algorithms which

identifies clusters of particles

  • Benchmark SkewReduce emulating Hadoop behavior,

SkewReduce’s Optimizer, and our task scheduler with both datasets

13

Dataset Size # Items Description Astro 18 GB 900 M Cosmology simulation Seaflow 1.9 GB 59 M Flow cytometry

slide-14
SLIDE 14

Expected Runtime Results

14.1 1.6 87.2 14.1 10 20 30 40 50 60 70 80 90 100 Hadoop's default scheduler Our Task Scheduler Goal SkewReduce's Optimizer Astro (hours) Seaflow (minutes)

14

Dataset (time scale)

slide-15
SLIDE 15

Challenges and Future Work

  • Large learning curve for MapReduce, Hadoop, and

SkewReduce

  • Finish task scheduler
  • Ensure task scheduler requires no changes to algorithm or

dataset

  • Experiment with small variations to task scheduler algorithm

to improve upon

  • Compare to Hadoop’s scheduler and SkewReduce’s optimizer

15

slide-16
SLIDE 16

Acknowledgements

University of California, Santa Barbara Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan Graduate students: Shengqi Yang University of Washington Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions YongChul Kwon, Magdalena Balazinska

16