BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES - PowerPoint PPT Presentation

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM Michael McCune 14 November 2016 1

INTRODUCTION A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko project for OpenShift #ApacheBigData EU 2016 2

OVERVIEW Building Application Pipelines Case Study: Ophicleide Demonstration Lessons Learned Next Steps #ApacheBigData EU 2016 3

INSPIRATION Larger themes Developer empowerment Improved collaboration Operational freedom #ApacheBigData EU 2016 4

CLOUD APPLICATIONS What are we talking about? Multiple disparate components Require deployment flexibility Challenging to debug Spark MySQL ActiveMQ Kafka HTTP Ruby Python Node.js MongoDB PostgreSQL HDFS #ApacheBigData EU 2016 5

PLANNING Before you begin engineering Identify moving pieces Storyboard the data flow Visualize success and failure Node.js Python Spark MongoDB HTTP #ApacheBigData EU 2016 6

PLANNING Insightful analytics What dataset? How to process? Where are the results? Node.js Python Spark MongoDB HTTP Ingest Process Publish #ApacheBigData EU 2016 7

BUILDING Decompose application components Natural breakpoints Build for modularity Stateless versus stateful Node.js Python Spark MongoDB HTTP #ApacheBigData EU 2016 8

BUILDING Focus on the communication Coordinate in the middle Network resiliency Kubernetes DNS #ApacheBigData EU 2016 9

COLLABORATING Building as a team The right tools Modular projects Iterative improvements Coordinating actions Node.js Python Spark MongoDB HTTP #ApacheBigData EU 2016 10

CASE STUDY: OPHICLEIDE #ApacheBigData EU 2016 11

CASE STUDY: OPHICLEIDE What does it do? Word2Vec models HTTP available data Similarity queries Node.js Python Spark Browser Spark Spark Text Data Spark Text Data Kubernetes Spark MongoDB Text Data #ApacheBigData EU 2016 12

CASE STUDY: OPHICLEIDE Building blocks Apache Spark Word2Vec Kubernetes OpenShift Node.js Flask MongoDB OpenAPI #ApacheBigData EU 2016 13

DEEP DIVE OpenAPI Schema for REST APIs Wealth of tooling Central discussion point #ApacheBigData EU 2016 14

OPENAPI paths: /: get: description: |- Returns information about the server version responses: "200": description: |- Valid server info response schema: app = connexion.App(__name__, specification_dir='./swagger/') app.add_api('swagger.yaml', arguments={'title': 'The REST API for the Ophicleide ' 'Word2Vec server'}) app.run(port=8080) #ApacheBigData EU 2016 15

DEEP DIVE Configuration Data What is needed? How to deliver? Python MONGO=mongodb://admin:admin@mongodb REST_ADDR=127.0.0.1 Node.js REST_PORT=8080 Kubernetes #ApacheBigData EU 2016 16

CONFIGURATION DATA spec: containers: - name: ${WEBNAME} image: ${WEBIMAGE} env: - name: OPH_TRAINING_ADDR value: ${OPH_ADDR} - name: OPH_TRAINING_PORT value: ${OPH_PORT} - name: OPH_WEB_PORT value: "8081" ports: - containerPort: 8081 protocol: TCP #ApacheBigData EU 2016 17

CONFIGURATION DATA var training_addr = process.env.OPH_TRAINING_ADDR || '127.0.0.1'; var training_port = process.env.OPH_TRAINING_PORT || '8080'; var web_port = process.env.OPH_WEB_PORT || 8080; app.get("/api/models", function(req, res) { var url = `http://${training_addr}:${training_port}/models`; request.get(url).pipe(res); }); app.get("/api/queries", function(req, res) { var url = `http://${training_addr}:${training_port}/queries`; request.get(url).pipe(res); }); app.listen(ophicleide_web_port, function() { console.log(`ophicleide-web listening on ${web_port}`); }); #ApacheBigData EU 2016 18

SECRETS Not used in Ophicleide, but worth mentioning volumes: - name: mongo-secret-volume secret: secretName: mongo-secret containers: - name: shiny-squirrel image: elmiko/shiny_squirrel args: ["mongodb"] volumeMounts: - name: mongo-secret-volume mountPath: /etc/mongo-secret readOnly: true #ApacheBigData EU 2016 19

SECRETS Each secret exposed as a file in the container MONGO_USER=$(cat /etc/mongo-secret/username) MONGO_PASS=$(cat /etc/mongo-secret/password) /usr/bin/python /opt/shiny_squirrel/shiny_squirrel.py \ --mongo \ mongodb://${MONGO_USER}:${MONGO_PASS}@${MONGO_HOST_PORT} #ApacheBigData EU 2016 20

DEEP DIVE Spark processing Read text from URL Split words Create vectors #ApacheBigData EU 2016 21

SPARK PROCESSING def workloop(master, inq, outq, dburl): sconf = SparkConf().setAppName( "ophicleide-worker").setMaster(master) sc = SparkContext(conf=sconf) if dburl is not None: db = pymongo.MongoClient(dburl).ophicleide outq.put("ready") while True: job = inq.get() urls = job["urls"] mid = job["_id"] model = train(sc, urls) items = model.getVectors().items() words, vecs = zip(*[(w, list(v)) for w, v in items]) #ApacheBigData EU 2016 22

SPARK PROCESSING def train(sc, urls): w2v = Word2Vec() rdds = reduce(lambda a, b: a.union(b), [url2rdd(sc, url) for url in urls]) return w2v.fit(rdds) def url2rdd(sc, url): response = urlopen(url) corpus_bytes = response.read() text = str( corpus_bytes).replace("\\r", "\r").replace("\\n", "\n") rdd = sc.parallelize(text.split("\r\n\r\n")) rdd.map(lambda l: l.replace("\r\n", " ").split(" ")) return rdd.map(lambda l: cleanstr(l).split(" ")) #ApacheBigData EU 2016 23

SPARK PROCESSING def create_query(newQuery) -> str: mid = newQuery["model"] word = newQuery["word"] model = model_cache_find(mid) if model is None: msg = (("no trained model with ID %r available; " % mid) + "check /models to see when one is ready") return json_error("Not Found", 404, msg) else: # XXX w2v = model["w2v"] qid = uuid4() try: syns = w2v.findSynonyms(word, 5) q = { "_id": qid, "word": word, "results": syns, "modelName": model["name"], "model": mid } (query_collection()).insert_one(q) #ApacheBigData EU 2016 24

DEMONSTRATION see a demo at https://vimeo.com/189710503 #ApacheBigData EU 2016 25

LESSONS LEARNED Things that went smoothly OpenAPI Dockerfiles Kuberenetes templates #ApacheBigData EU 2016 26

LESSONS LEARNED Things that require greater coordination API coordination Compute resources Persistent storage Spark configurations #ApacheBigData EU 2016 27

LESSONS LEARNED Compute resources CPU and memory constraints Label selectors Kubelet Kubelet Kubelet Pod Pod Pod Pod Pod Pod Pod Pod Pod Node Node Node #ApacheBigData EU 2016 28

NEXT STEPS Where to take this project? More Spark! Separate query service Development versus production #ApacheBigData EU 2016 29

PROJECT LINKS Ophicleide https://github.com/ophicleide Apache Spark https://spark.apache.org Kubernetes https://kubernetes.io OpenShift https://openshift.org #ApacheBigData EU 2016 30

THANKS! elmiko @FOSSjunkie https://elmiko.github.io 31

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES - PowerPoint PPT Presentation

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM Michael McCune 14 November 2016 1 INTRODUCTION A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko project for

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

CryptoManiac Slides borrowed with permission from Todd Austin and Lisa Wu University of Michigan

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

DESIGN AND IMPLEMENTATION OF A VIRTUAL NETWORKING FRAMEWORK FOR THE MOBILITYFIRST FUTURE INTERNET

r t s ts s

2 in -models Benoit Monin joint work with Ludovic Patey Universit e Paris-Est Cr eteil

f ' ( x ) g ( x ) dx f ( x ) g ' ( x ) dx f ( b ) g ( b ) f

RBAC Administration in Distributed Systems Policy Distribution Concluding Remarks Marnix

0-0 Well-posedness and asymptotic analysis for a Penrose-Fife type phase-field system Buona

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES - PowerPoint PPT Presentation

BUILDING APACHE SPARK APPLICATION PIPELINES FOR THE KUBERNETES ECOSYSTEM Michael McCune 14 November 2016 1 INTRODUCTION A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko project for

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

CryptoManiac Slides borrowed with permission from Todd Austin and Lisa Wu University of Michigan

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben

DESIGN AND IMPLEMENTATION OF A VIRTUAL NETWORKING FRAMEWORK FOR THE MOBILITYFIRST FUTURE INTERNET

r t s ts s

2 in -models Benoit Monin joint work with Ludovic Patey Universit e Paris-Est Cr eteil

f ' ( x ) g ( x ) dx f ( x ) g ' ( x ) dx f ( b ) g ( b ) f

RBAC Administration in Distributed Systems Policy Distribution Concluding Remarks Marnix

0-0 Well-posedness and asymptotic analysis for a Penrose-Fife type phase-field system Buona

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark