big data on google cloud
play

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends - PowerPoint PPT Presentation

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud way William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / vbp@google.com Agenda 1 Big Data at Google 2


  1. Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data “the Cloud way” William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / vbp@google.com

  2. Agenda 1 Big Data at Google 2 Managing data through its lifecycle 3 Google Cloud Dataflow 4 Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark... 5 Optimizing your time 6 References and follow-up

  3. Big Data at Google

  4. Building on Google’s infrastructure 1.5 million devices activated every day (over a billion devices) 6 billion hours watched every month (10h uploaded every minute) 20 billion pages crawled every day

  5. Hardware and data center innovation

  6. Software innovation BigQuery Cloud Dataflow MapReduce Flume Dremel MillWheel Spanner GFS Big Table Pregel Colossus 2002 2004 2006 2008 2010 2012 2013

  7. Managing data through its lifecycle

  8. Data lifecycle Google App Engine Real time analytics & alerts Cloud BigQuery Pub/Sub Analytics BigQuery (SQL) Stream Cloud Storage Logs (tables) Batch Google Cloud Analytics Dataflow Premium Cloud Google Cloud Dataflow Cloud Storage Storage (files)

  9. Data usage organization maturity lifecycle Prescriptive + Predictive Exploratory Descriptive Predictive + Exploratory Descriptive Exploratory + Descriptive Descriptive

  10. Supporting organizations with operational ease of use ● no administration ● most powerful tools in the easiest way ● constant experimentation with low risks & cost ● easy collaboration across teams and organizations ● low costs without requiring usage commitments ● best performance & virtually unlimited scale ● always on

  11. Google Cloud Dataflow

  12. Data lifecycle Google App Engine Real time analytics & alerts Cloud BigQuery Pub/Sub Analytics BigQuery (SQL) Stream Cloud Storage Logs (tables) Batch Google Cloud Analytics Dataflow Premium Cloud Google Cloud Dataflow Cloud Storage Storage (files)

  13. What is Cloud Dataflow? Cloud Dataflow is Cloud Dataflow is a a collection of managed service SDKs for building for executing parallelized data parallelized data processing processing pipelines pipelines ↳ Download from GitHub: ↳ Use on Google Cloud: https://github.com/GoogleCloudPlatform/DataflowJavaSDK https://cloud.google.com/dataflow/

  14. Cloud Dataflow SDK - Logical Model Pipeline{ Who => Inputs Unified programming What => Transforms model for both batch & Where => Windows stream processing. When => Watermarks + Triggers To => Outputs }

  15. Cloud Dataflow Pipeline • A Direct Acyclic Graph of data processing transformations • Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark • May include multiple inputs and multiple outputs • May encompass many logical MapReduce operations • PCollections flow through the pipeline

  16. Life of a Dataflow Pipeline Managed Service Work Manager Job Manager Graph optimization User Code & SDK Progress & Schedule Deploy & Logs Monitoring UI Google Cloud Platform

  17. Worker Lifecycle Management throughout batch execution Schedule & Monitor Deploy Tear Down

  18. Worker Optimization 100 mins. 65 mins. vs.

  19. Continuous worker scaling for long-lived streaming pipelines 800 RPS 1,200 RPS 5,000 RPS 50 RPS time

  20. Portability: Cloud Dataflow Runners • Run the same code in multiple modes using different runners • Direct Runner • For local, in-memory execution. • Great for developing and unit tests • Cloud Dataflow Service Runner • Runs on the fully-manage Dataflow Service • Your code runs distributed across GCE instances • Community sourced • Spark runner @ github.com/cloudera/spark-dataflow • Flink runner coming soon from dataArtisans The most productive and portable Data pipeline SDK.

  21. Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...

  22. Data lifecycle Google App Engine Real time analytics & alerts Cloud BigQuery Pub/Sub Analytics BigQuery (SQL) Stream Cloud Storage Logs (tables) Batch Google Cloud Analytics Dataflow Premium Cloud Google Cloud Dataflow Cloud Storage Storage (files)

  23. Cloud Pub/Sub Many-to-many asynchronous messaging Fast and reliable

  24. BigQuery Ingest data via streaming (100K rows/second/table) or file loader ● Process interactive SQL queries on TB or PB of data ● Zero administration; just upload data and send queries ● Pay for storage and query separately, based on actual usage ● Non-technical analysts can drive queries on massive datasets using BI ● tools (e.g. Tableau) Highly Available: Data replication in multiple geographies. ● Secure and easy collaboration: access to data is controlled using ● customer-owned ACLs

  25. Hadoop and Spark Name Local Master SSD Node Node (optional) GCS Connectors Connector PD SSD BigQuery Connector Work Nodes Work HDFS HDFS Work Nodes (optional) Nodes (optional) PD standard bdutil orchestration

  26. Optimizing your time

  27. Optimizing Your Time

  28. Optimizing Your Time

  29. Optimizing Your Time

  30. Optimizing Your Time

  31. Optimizing Your Time

  32. Optimizing Your Time

  33. Optimizing Your Time

  34. Optimizing Your Time

  35. Optimizing Your Time

  36. References and follow-up

  37. Getting Started Cloud Dataflow Service: https://cloud.google.com/dataflow ● Questions: https://stackoverflow.com/questions/tagged/google-cloud-dataflow ● SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK ● BigQuery https://cloud.google.com/bigquery/ ● Contact me Cloud Pub/Sub Twitter: @vambenepe ● https://cloud.google.com/pubsub/ ● email: vbp@google.com ● Hadoop and Spark https://cloud.google.com/hadoop/ ●

  38. Thank You! cloud.google.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend