Streaming OODT: Combining Apache Spark's Power with Apache OODT - - PowerPoint PPT Presentation

streaming oodt
SMART_READER_LITE
LIVE PREVIEW

Streaming OODT: Combining Apache Spark's Power with Apache OODT - - PowerPoint PPT Presentation

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA Jet Propulsion Laboratory Agenda Data and Processing Data Systems Apache OODT Apache Spark Streaming OODT


slide-1
SLIDE 1

Streaming OODT:
 


Combining Apache Spark's Power with Apache OODT

Michael Starch – NASA Jet Propulsion Laboratory

slide-2
SLIDE 2

Agenda

– Data and Processing – Data Systems – Apache OODT – Apache Spark – Streaming OODT – Examples – Where can I get the code? – Acknowledgements – Questions

slide-3
SLIDE 3

Data and Processing

slide-4
SLIDE 4

Data and Processing

Figure 1: What is data processing?

a∑x + x dx dt

a∑x + y dx dt

Figure 2: More complex data processing

slide-5
SLIDE 5

Parallelization

Figure 3: Parallelizing data processing

slide-6
SLIDE 6

Big Data

Figure 4: Data is becoming very large Figure 5: Parallelizable big-data

slide-7
SLIDE 7

Data Systems

slide-8
SLIDE 8

Archival and Search

Figure 6: Archiving and searching in data sets

slide-9
SLIDE 9

Processing and Resource Management

Figure 7: Processing and resource management

slide-10
SLIDE 10

Data Ingest and Delivery

a∑x + x dx dt

Figure 8: Data ingestion and delivery

slide-11
SLIDE 11

Apache OODT

slide-12
SLIDE 12

Apache OODT

Figure 9: Base Object-Oriented Data Technology (OODT)

slide-13
SLIDE 13

Archival and Search

Figure 10: OODT metadata-based search

slide-14
SLIDE 14

Workflow Management

Figure 11: OODT workflow management

slide-15
SLIDE 15

Limitations

Figure 12: Simplified OODT Architecture

slide-16
SLIDE 16

Apache Spark

slide-17
SLIDE 17

Map Reduce Processing

Figure 13: Map Reduce Processing

slide-18
SLIDE 18

Berkley Data Analysis Stack

Source: https://amplab.cs.berkeley.edu/software/ Figure 14: Berkley data analysis stack components

slide-19
SLIDE 19

Apache Spark

Figure 15: Resilient Distributed Datasets Figure 16: Apache Spark libraries Source: https://spark.apache.org/images/spark-stack.png

slide-20
SLIDE 20

Streaming OODT

slide-21
SLIDE 21

Streaming OODT Design

Figure 17: Design and implementation of Streaming OODT

slide-22
SLIDE 22

Modified Architecture

Figure 18: Improved OODT Architecture for big-data processing

slide-23
SLIDE 23

Examples

slide-24
SLIDE 24

Example - Palindromes

Figure 19: Palindrome detection algorithm

slide-25
SLIDE 25

Example - Code

//Example detection algorithm ... public static boolean isPalindrome(String line) { line = line.replaceAll("\\s","").toLowerCase(); return line.equals(new StringBuilder(line).reverse().toString()); }: ... //Spark wrapper class for detection algorithm static class FilterPalindrome implements Function<String, Boolean> { public Boolean call(String s) { return isPalindrome(s); } } ...

Sample 1: Palindrome detection shared code

slide-26
SLIDE 26

Example – Data Set

clowring infratrochanteric unlimitable overstaffing ... nonsubstantiality incongeniality ghbor gargil semiconventionality betokens clinodome ... pulviniform actualize cousins moocha Mosaism craals midstout desightment Boehmenism LP ravelins underskirt CSB cossas xen- nonlucidness unvagrantness togata noncaptiousness dromioid lambie undergarments salvages... LAP revealableness outsnore headstalls metallography

  • utgazed unstintingly boongary provinces trans-Mongolian...

Sample 2: Palindrome file sample

... 10,805,887,353 Bytes (11 GB) 46284 ¡palindromes

slide-27
SLIDE 27

Example – Shootout

Spark Spark

429.774s 429.774s 1 CPU 1 CPU

//Sample java code ... JavaRDD<String> rdd = sc.textFile( input.getValue("file")); JavaRDD<String> filtered = rdd.filter(new new PalindromeUtils PalindromeUtils . .FilterPalindrome FilterPalindrome()); ()); long long count count = = filtered filtered.count .count(); (); ... //Sample java code ... String file = input.getValue("file"); br = new new BufferedReader BufferedReader(new new FileReader FileReader(file file)); )); String line; while while (( ((line line = = br br.readLine .readLine()) ()) != != null null) { ) { if if ( (PalindromeUtils PalindromeUtils . .isPalindrome isPalindrome(line line)) )) count++; } ...

Spark Spark

16.72s 16.72s ~92 CPUs ~92 CPUs

Sample 3: Naïve file processing code Sample 4: Spark file processing code

slide-28
SLIDE 28

Example - Streaming

JavaReceiverInputDStream<String> stream = ssc.socketTextStream(input.getValue("host"), Integer.parseInt(input.getValue("port"))); JavaDStream<String> filtered = stream.filter(new new PalindromeUtils.FilterPalindrome PalindromeUtils.FilterPalindrome()); ()); final final JavaDStream JavaDStream<Long> <Long> count count = = filtered filtered.count .count(); (); /* Begin: output code */ count.foreachRDD(new new Function< Function<JavaRDD JavaRDD<Long>,Void>(){ <Long>,Void>(){ public public Void call( Void call(JavaRDD JavaRDD<Long> <Long> jrdd jrdd) ) throws throws Exception { Exception { synchronized synchronized(output

  • utput)

) { Long[] collected = (Long[])jrdd.rdd().collect(); for for (Long (Long item item : : collected collected)

  • utput.println("Found "+item.longValue()+ " palindromes.");

} return return null null;}}); /* End: output code*/ ssc.start(); ssc.awaitTermination();

Sample 5: Streaming palindromes code

slide-29
SLIDE 29

Example – Streaming Configuration

... <instanceClass name= "org.apache.oodt.cas.resource.spark.examples.StreamingPalindromeEx ample" /> <inputClass name= "org.apache.oodt.cas.resource.structs.NameValueJobInput"> <properties> <property name="host" value="host" /> <property name="port" value="7007" /> <property name="time" value="60000" /> <property name="output" value="/home/user/files/output- streaming-palindrome.txt" /> </properties> </inputClass> <queue>quick</queue> <load>1</load> ...

Sample 6: Streaming palindromes configuration

slide-30
SLIDE 30

Example – Streaming In Action

slide-31
SLIDE 31

Where can I get the code?

  • It’s Open Source! Jump on in!
  • Apache OODT SVN:

https://svn.apache.org/repos/asf/oodt/trunk/

  • Mailing List:

dev@oodt.apache.org

slide-32
SLIDE 32

Acknowledgments

NASA Jet Propulsion Laboratory

Research & Technology Development “Archiving, Processing and Dissemination for the Big Data Era”

  • Apache Software Foundation

Apache OODT Project

slide-33
SLIDE 33

Questions?

你 有 沒 有 問 題 ?

Haben Sie Fragen? ¿Tienen preguntas? Avez-vous des questions?