BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES - - PowerPoint PPT Presentation

building apache spark application pipelines for the
SMART_READER_LITE
LIVE PREVIEW

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES - - PowerPoint PPT Presentation

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM Michael McCune 14 November 2016 1 INTRODUCTION A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko project for


slide-1
SLIDE 1

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM

Michael McCune 14 November 2016

1

slide-2
SLIDE 2

INTRODUCTION

A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko project for OpenShift

#ApacheBigData EU 2016

2

slide-3
SLIDE 3

OVERVIEW

Building Application Pipelines Case Study: Ophicleide Demonstration Lessons Learned Next Steps

#ApacheBigData EU 2016

3

slide-4
SLIDE 4

INSPIRATION

Larger themes Developer empowerment Improved collaboration Operational freedom

#ApacheBigData EU 2016

4

slide-5
SLIDE 5

CLOUD APPLICATIONS

What are we talking about? Multiple disparate components Require deployment flexibility Challenging to debug

#ApacheBigData EU 2016

Spark HTTP Node.js MongoDB Python MySQL Ruby Kafka HDFS ActiveMQ PostgreSQL

5

slide-6
SLIDE 6

PLANNING

Before you begin engineering Identify moving pieces Storyboard the data flow Visualize success and failure

#ApacheBigData EU 2016

Spark HTTP Node.js MongoDB Python

6

slide-7
SLIDE 7

PLANNING

Insightful analytics What dataset? How to process? Where are the results?

#ApacheBigData EU 2016

Spark HTTP Node.js MongoDB Python

Ingest Process Publish

7

slide-8
SLIDE 8

BUILDING

Decompose application components Natural breakpoints Build for modularity Stateless versus stateful

#ApacheBigData EU 2016

Spark HTTP Node.js MongoDB Python

8

slide-9
SLIDE 9

BUILDING

Focus on the communication Coordinate in the middle Network resiliency Kubernetes DNS

#ApacheBigData EU 2016

9

slide-10
SLIDE 10

COLLABORATING

Building as a team The right tools Modular projects Iterative improvements Coordinating actions

#ApacheBigData EU 2016

Spark HTTP Node.js MongoDB Python

10

slide-11
SLIDE 11

CASE STUDY: OPHICLEIDE

#ApacheBigData EU 2016

11

slide-12
SLIDE 12

CASE STUDY: OPHICLEIDE

What does it do? Word2Vec models HTTP available data Similarity queries

#ApacheBigData EU 2016

Spark Browser Node.js MongoDB Python

Kubernetes

Spark Spark Spark Spark Text Data Text Data Text Data

12

slide-13
SLIDE 13

CASE STUDY: OPHICLEIDE

Building blocks Apache Spark Word2Vec Kubernetes OpenShift Node.js Flask MongoDB OpenAPI

#ApacheBigData EU 2016

13

slide-14
SLIDE 14

DEEP DIVE

OpenAPI Schema for REST APIs Wealth of tooling Central discussion point

#ApacheBigData EU 2016

14

slide-15
SLIDE 15

OPENAPI

#ApacheBigData EU 2016 paths: /: get: description: |- Returns information about the server version responses: "200": description: |- Valid server info response schema: app = connexion.App(__name__, specification_dir='./swagger/') app.add_api('swagger.yaml', arguments={'title': 'The REST API for the Ophicleide ' 'Word2Vec server'}) app.run(port=8080)

15

slide-16
SLIDE 16

DEEP DIVE

Configuration Data What is needed? How to deliver?

#ApacheBigData EU 2016

Node.js Python

Kubernetes

REST_ADDR=127.0.0.1 REST_PORT=8080 MONGO=mongodb://admin:admin@mongodb

16

slide-17
SLIDE 17

CONFIGURATION DATA

#ApacheBigData EU 2016 spec: containers:

  • name: ${WEBNAME}

image: ${WEBIMAGE} env:

  • name: OPH_TRAINING_ADDR

value: ${OPH_ADDR}

  • name: OPH_TRAINING_PORT

value: ${OPH_PORT}

  • name: OPH_WEB_PORT

value: "8081" ports:

  • containerPort: 8081

protocol: TCP

17

slide-18
SLIDE 18

CONFIGURATION DATA

#ApacheBigData EU 2016 var training_addr = process.env.OPH_TRAINING_ADDR || '127.0.0.1'; var training_port = process.env.OPH_TRAINING_PORT || '8080'; var web_port = process.env.OPH_WEB_PORT || 8080; app.get("/api/models", function(req, res) { var url = `http://${training_addr}:${training_port}/models`; request.get(url).pipe(res); }); app.get("/api/queries", function(req, res) { var url = `http://${training_addr}:${training_port}/queries`; request.get(url).pipe(res); }); app.listen(ophicleide_web_port, function() { console.log(`ophicleide-web listening on ${web_port}`); });

18

slide-19
SLIDE 19

SECRETS

Not used in Ophicleide, but worth mentioning

#ApacheBigData EU 2016 volumes:

  • name: mongo-secret-volume

secret: secretName: mongo-secret containers:

  • name: shiny-squirrel

image: elmiko/shiny_squirrel args: ["mongodb"] volumeMounts:

  • name: mongo-secret-volume

mountPath: /etc/mongo-secret readOnly: true

19

slide-20
SLIDE 20

SECRETS

Each secret exposed as a file in the container

#ApacheBigData EU 2016 MONGO_USER=$(cat /etc/mongo-secret/username) MONGO_PASS=$(cat /etc/mongo-secret/password) /usr/bin/python /opt/shiny_squirrel/shiny_squirrel.py \

  • -mongo \

mongodb://${MONGO_USER}:${MONGO_PASS}@${MONGO_HOST_PORT}

20

slide-21
SLIDE 21

DEEP DIVE

Spark processing Read text from URL Split words Create vectors

#ApacheBigData EU 2016

21

slide-22
SLIDE 22

SPARK PROCESSING

#ApacheBigData EU 2016 def workloop(master, inq, outq, dburl): sconf = SparkConf().setAppName( "ophicleide-worker").setMaster(master) sc = SparkContext(conf=sconf) if dburl is not None: db = pymongo.MongoClient(dburl).ophicleide

  • utq.put("ready")

while True: job = inq.get() urls = job["urls"] mid = job["_id"] model = train(sc, urls) items = model.getVectors().items() words, vecs = zip(*[(w, list(v)) for w, v in items])

22

slide-23
SLIDE 23

SPARK PROCESSING

#ApacheBigData EU 2016 def train(sc, urls): w2v = Word2Vec() rdds = reduce(lambda a, b: a.union(b), [url2rdd(sc, url) for url in urls]) return w2v.fit(rdds) def url2rdd(sc, url): response = urlopen(url) corpus_bytes = response.read() text = str( corpus_bytes).replace("\\r", "\r").replace("\\n", "\n") rdd = sc.parallelize(text.split("\r\n\r\n")) rdd.map(lambda l: l.replace("\r\n", " ").split(" ")) return rdd.map(lambda l: cleanstr(l).split(" "))

23

slide-24
SLIDE 24

SPARK PROCESSING

#ApacheBigData EU 2016 def create_query(newQuery) -> str: mid = newQuery["model"] word = newQuery["word"] model = model_cache_find(mid) if model is None: msg = (("no trained model with ID %r available; " % mid) + "check /models to see when one is ready") return json_error("Not Found", 404, msg) else: # XXX w2v = model["w2v"] qid = uuid4() try: syns = w2v.findSynonyms(word, 5) q = { "_id": qid, "word": word, "results": syns, "modelName": model["name"], "model": mid } (query_collection()).insert_one(q)

24

slide-25
SLIDE 25

DEMONSTRATION

see a demo at

#ApacheBigData EU 2016

https://vimeo.com/189710503

25

slide-26
SLIDE 26

LESSONS LEARNED

Things that went smoothly OpenAPI Dockerfiles Kuberenetes templates

#ApacheBigData EU 2016

26

slide-27
SLIDE 27

LESSONS LEARNED

Things that require greater coordination API coordination Compute resources Persistent storage Spark configurations

#ApacheBigData EU 2016

27

slide-28
SLIDE 28

LESSONS LEARNED

Compute resources CPU and memory constraints Label selectors

#ApacheBigData EU 2016

Kubelet Node Pod Pod Pod Pod Pod Kubelet Node Pod Pod Pod Kubelet Node Pod

28

slide-29
SLIDE 29

NEXT STEPS

Where to take this project? More Spark! Separate query service Development versus production

#ApacheBigData EU 2016

29

slide-30
SLIDE 30

PROJECT LINKS

Ophicleide Apache Spark Kubernetes OpenShift

#ApacheBigData EU 2016

https://github.com/ophicleide https://spark.apache.org https://kubernetes.io https://openshift.org

30

slide-31
SLIDE 31

THANKS!

elmiko https://elmiko.github.io @FOSSjunkie

31