Spark Code Camp Discover Spark Streaming & Spark SQL Project - - PowerPoint PPT Presentation

spark code camp
SMART_READER_LITE
LIVE PREVIEW

Spark Code Camp Discover Spark Streaming & Spark SQL Project - - PowerPoint PPT Presentation

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content


slide-1
SLIDE 1

Spark Code Camp

Discover Spark Streaming & Spark SQL

slide-2
SLIDE 2

Project Overview

  • Focus on Spark Streaming and Spark SQL
  • Explored Streaming API of Apache Spark on Ukko Cluster

○ Window based Stream Content ○ Direct Stream content

  • Use Twitter Streaming API as a data source
  • Aim - collect tweet data and analyse

○ Find out popular hashtags ○ Discover tweet frequency per location ○ Discover tweetings trends over time

slide-3
SLIDE 3

Open-Source Stack

slide-4
SLIDE 4

APIs Stack

  • Spark Core & Streaming

"org.apache.spark" %% "spark-core" % "1.0.2" % "provided"

"org.apache.spark" %% "spark-streaming" % "1.0.2" % "provided"

  • Twitter4j & Twitter Stream

"org.twitter4j" % "twitter4j-core" % "3.0.3"

"org.twitter4j" % "twitter4j-stream" % "3.0.3"

"org.apache.spark" %% "spark-streaming-twitter" % "1.0.2" % "provided"

  • Akka

"com.typesafe.akka" % "akka-actor_2.10" % "2.2-M1"

  • Socko

"org.mashupbots.socko" % "socko-webserver_2.10" % "0.4.2",

  • Spark SQL

"org.apache.spark" %% "spark-sql" % "1.0.0" % "provided"

slide-5
SLIDE 5

Results

  • Discovered most popular hashtags in last n seconds with a sliding window

streaming

  • Dynamic Graph Plotting with live feeds from Twitter Stream content
  • Generated a dataset of tweets in text files and in Spark SQL tables

○ One millions tweets collected

  • Used Spark SQL to analyse tweet dataset
  • Used Actor based interaction between stream content and Web Server
slide-6
SLIDE 6
slide-7
SLIDE 7

Challenges & Learning

  • Explored Streaming API

○ Few tutorial available to explore streaming in Spark ○ Few Streaming source - Twitter or Other ?

  • Build environment

○ Maven or SBT

  • Stack selection based on Learning Curve

○ Short time to explore & experiment with different open-source software stack ○ Decision challenges ■ Scala based Framework: Akka or Play ? ■ Web Server: Socko or other http web server ? ■ Graph: Chart.js or other chart libraries ? ■ Storage: File system or Hive or Shark or Spark SQL ?

  • Stream Handling

○ Which attributes of twitter status ( a user tweet == status) is useful ? ○ What can be possible with huge stream of data?

slide-8
SLIDE 8

References

  • http://sockoweb.org/
  • https://github.com/mashupbots/socko
  • http://akka.io/
  • https://spark.apache.org/streaming/
  • http://www.chartjs.org/
slide-9
SLIDE 9
  • Team Members

○ Maninder Pal Singh ○ Ayesha Ahmad ○

  • Md. Mesbahul Islam