Streaming OODT: Combining Apache Spark's Power with Apache OODT - - PowerPoint PPT Presentation

▶

Jun 03, 2023 387 likes •731 views

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA Jet Propulsion Laboratory Agenda Data and Processing Data Systems Apache OODT Apache Spark Streaming OODT

SLIDE 1

Streaming OODT:   

Combining Apache Spark's Power with Apache OODT

Michael Starch – NASA Jet Propulsion Laboratory

SLIDE 2

Agenda

– Data and Processing – Data Systems – Apache OODT – Apache Spark – Streaming OODT – Examples – Where can I get the code? – Acknowledgements – Questions

SLIDE 3

Data and Processing

SLIDE 4

Data and Processing

Figure 1: What is data processing?

a∑x + x dx dt

∫

a∑x + y dx dt

∫

Figure 2: More complex data processing

SLIDE 5

Parallelization

Figure 3: Parallelizing data processing

SLIDE 6

Big Data

Figure 4: Data is becoming very large Figure 5: Parallelizable big-data

SLIDE 7

Data Systems

SLIDE 8

Archival and Search

Figure 6: Archiving and searching in data sets

SLIDE 9

Processing and Resource Management

Figure 7: Processing and resource management

SLIDE 10

Data Ingest and Delivery

a∑x + x dx dt

∫

Figure 8: Data ingestion and delivery

SLIDE 11

Apache OODT

SLIDE 12

Apache OODT

Figure 9: Base Object-Oriented Data Technology (OODT)

SLIDE 13

Archival and Search

Figure 10: OODT metadata-based search

SLIDE 14

Workflow Management

Figure 11: OODT workflow management

SLIDE 15

Limitations

Figure 12: Simplified OODT Architecture

SLIDE 16

Apache Spark

SLIDE 17

Map Reduce Processing

Figure 13: Map Reduce Processing

SLIDE 18

Berkley Data Analysis Stack

Source: https://amplab.cs.berkeley.edu/software/ Figure 14: Berkley data analysis stack components

SLIDE 19

Apache Spark

Figure 15: Resilient Distributed Datasets Figure 16: Apache Spark libraries Source: https://spark.apache.org/images/spark-stack.png

SLIDE 20

Streaming OODT

SLIDE 21

Streaming OODT Design

Figure 17: Design and implementation of Streaming OODT

SLIDE 22

Modified Architecture

Figure 18: Improved OODT Architecture for big-data processing

SLIDE 23

Examples

SLIDE 24

Example - Palindromes

Figure 19: Palindrome detection algorithm

SLIDE 25

Example - Code

//Example detection algorithm ... public static boolean isPalindrome(String line) { line = line.replaceAll("\\s","").toLowerCase(); return line.equals(new StringBuilder(line).reverse().toString()); }: ... //Spark wrapper class for detection algorithm static class FilterPalindrome implements Function<String, Boolean> { public Boolean call(String s) { return isPalindrome(s); } } ...

Sample 1: Palindrome detection shared code

SLIDE 26

Example – Data Set

clowring infratrochanteric unlimitable overstaffing ... nonsubstantiality incongeniality ghbor gargil semiconventionality betokens clinodome ... pulviniform actualize cousins moocha Mosaism craals midstout desightment Boehmenism LP ravelins underskirt CSB cossas xen- nonlucidness unvagrantness togata noncaptiousness dromioid lambie undergarments salvages... LAP revealableness outsnore headstalls metallography

utgazed unstintingly boongary provinces trans-Mongolian...

Sample 2: Palindrome file sample

... 10,805,887,353 Bytes (11 GB) 46284 ¡palindromes

SLIDE 27

Example – Shootout

Spark Spark

429.774s 429.774s 1 CPU 1 CPU

//Sample java code ... JavaRDD<String> rdd = sc.textFile( input.getValue("file")); JavaRDD<String> filtered = rdd.filter(new new PalindromeUtils PalindromeUtils . .FilterPalindrome FilterPalindrome()); ()); long long count count = = filtered filtered.count .count(); (); ... //Sample java code ... String file = input.getValue("file"); br = new new BufferedReader BufferedReader(new new FileReader FileReader(file file)); )); String line; while while (( ((line line = = br br.readLine .readLine()) ()) != != null null) { ) { if if ( (PalindromeUtils PalindromeUtils . .isPalindrome isPalindrome(line line)) )) count++; } ...

Spark Spark

16.72s 16.72s ~92 CPUs ~92 CPUs

Sample 3: Naïve file processing code Sample 4: Spark file processing code

SLIDE 28

Example - Streaming

JavaReceiverInputDStream<String> stream = ssc.socketTextStream(input.getValue("host"), Integer.parseInt(input.getValue("port"))); JavaDStream<String> filtered = stream.filter(new new PalindromeUtils.FilterPalindrome PalindromeUtils.FilterPalindrome()); ()); final final JavaDStream JavaDStream<Long> <Long> count count = = filtered filtered.count .count(); (); /* Begin: output code */ count.foreachRDD(new new Function< Function<JavaRDD JavaRDD<Long>,Void>(){ <Long>,Void>(){ public public Void call( Void call(JavaRDD JavaRDD<Long> <Long> jrdd jrdd) ) throws throws Exception { Exception { synchronized synchronized(output

utput)

) { Long[] collected = (Long[])jrdd.rdd().collect(); for for (Long (Long item item : : collected collected)

utput.println("Found "+item.longValue()+ " palindromes.");

} return return null null;}}); /* End: output code*/ ssc.start(); ssc.awaitTermination();

Sample 5: Streaming palindromes code

SLIDE 29

Example – Streaming Configuration

... <instanceClass name= "org.apache.oodt.cas.resource.spark.examples.StreamingPalindromeEx ample" /> <inputClass name= "org.apache.oodt.cas.resource.structs.NameValueJobInput"> <properties> <property name="host" value="host" /> <property name="port" value="7007" /> <property name="time" value="60000" /> <property name="output" value="/home/user/files/output- streaming-palindrome.txt" /> </properties> </inputClass> <queue>quick</queue> <load>1</load> ...

Sample 6: Streaming palindromes configuration

SLIDE 30

Example – Streaming In Action

SLIDE 31

Where can I get the code?

It’s Open Source! Jump on in!
Apache OODT SVN:

https://svn.apache.org/repos/asf/oodt/trunk/

Mailing List:

dev@oodt.apache.org

SLIDE 32

Acknowledgments

NASA Jet Propulsion Laboratory

Research & Technology Development “Archiving, Processing and Dissemination for the Big Data Era”

Apache Software Foundation

Apache OODT Project

SLIDE 33

Streaming OODT:

Combining Apache Spark's Power with Apache OODT

Agenda

– Data and Processing – Data Systems – Apache OODT – Apache Spark – Streaming OODT – Examples – Where can I get the code? – Acknowledgements – Questions

Data and Processing

Data and Processing

Parallelization

Big Data

Data Systems

Archival and Search

Processing and Resource Management

Data Ingest and Delivery

Apache OODT

Apache OODT

Archival and Search

Workflow Management

Limitations

Apache Spark

Map Reduce Processing

Berkley Data Analysis Stack

Apache Spark

Streaming OODT

Streaming OODT Design

Modified Architecture

Examples

Example - Palindromes

Example - Code

Example – Data Set

... 10,805,887,353 Bytes (11 GB) 46284 ¡palindromes

Example – Shootout

Spark Spark

Spark Spark

Example - Streaming

Example – Streaming Configuration

Example – Streaming In Action

Where can I get the code?

https://svn.apache.org/repos/asf/oodt/trunk/

Acknowledgments

NASA Jet Propulsion Laboratory

Questions?

你 有 沒 有 問 題 ?

Haben Sie Fragen? ¿Tienen preguntas? Avez-vous des questions?

Streaming OODT:   

你有沒有問題 ?