synthetic data generation for realistic analytics
play

Synthetic Data Generation for Realistic Analytics Examples and - PowerPoint PPT Presentation

Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/ Who Am I? Software Engineer at Red Hat Data Science Team, Emerging Technologies


  1. Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/

  2. Who Am I? • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space – Ensure software works for Red Hat customers – Promote data science internally through consulting projects • Apache BigTop PMC 2 ¡

  3. Synthetic Data • No licensing, privacy, or intellectual property concerns • Scalable: Laptops to Clusters! • More reliable than external data sets • Enable more realistic example applications • Enable more comprehensive testing than wordcount and TeraSort 3 ¡

  4. Data Transformation and Summarization Pipeline Accounts Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Raw Daily Transform Clean & Cumulative Parse Summarize Aggregate Page Views Raw Text Validate Activity Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Daily Activity 4 ¡

  5. Data Transformation and Summarization Pipeline Accounts Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Raw Daily Transform Clean & Cumulative Parse Summarize Aggregate Page Views Raw Text Validate Activity Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Daily Activity 5 ¡

  6. Data Transformation and Summarization Pipeline Accounts Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Raw Daily Transform Clean & Cumulative Parse Summarize Aggregate Page Views Raw Text Validate Activity Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Daily Activity 6 ¡

  7. Data Transformation and Summarization Pipeline Accounts Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Raw Daily Transform Clean & Cumulative Parse Summarize Aggregate Page Views Raw Text Validate Activity Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Daily Activity 7 ¡

  8. Data Transformation and Summarization Pipeline Accounts Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Raw Daily Transform Clean & Cumulative Parse Summarize Aggregate Page Views Raw Text Validate Activity Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Daily Activity 8 ¡

  9. Data Transformation and Summarization Pipeline Accounts Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Raw Daily Transform Clean & Cumulative Parse Summarize Aggregate Page Views Raw Text Validate Activity Raw Daily Transform Clean & Parse Summarize Page Views Raw Text Validate Daily Activity 9 ¡

  10. Synthetic Data • Sensitive Data – Real data on cluster for scalability testing and validation – Synthetic data for local development and testing • Needed smaller data sets for checking calculations – Total aggregation results requires re-running old pipeline – Extra burden on operations team – Delay for development team 10 ¡

  11. Validation with Synthetic Data Raw Daily Page Views Transformation Accounts and Summarization Pipeline Data Generator Expected Cumulative Daily Activity Daily Activity Activity Expected Validation Cumulative Script Activity 11 ¡

  12. Validation with Synthetic Data Raw Daily Page Views Transformation Accounts and Summarization Pipeline Data Generator Expected Cumulative Daily Activity Daily Activity Activity Expected Validation Cumulative Script Activity 12 ¡

  13. Validation with Synthetic Data Raw Daily Page Views Transformation Accounts and Summarization Pipeline Data Generator Expected Cumulative Daily Activity Daily Activity Activity Expected Validation Cumulative Script Activity 13 ¡

  14. Validation with Synthetic Data Raw Daily Page Views Transformation Accounts and Summarization Pipeline Data Generator Expected Cumulative Daily Activity Daily Activity Activity Expected Validation Cumulative Script Activity 14 ¡

  15. Validation with Synthetic Data Raw Daily Page Views Transformation Accounts and Summarization Pipeline Data Generator Expected Cumulative Daily Activity Daily Activity Activity Expected Validation Cumulative Script Activity 15 ¡

  16. Validation with Synthetic Data Raw Daily Page Views Transformation Accounts and Summarization Pipeline Data Generator Expected Cumulative Daily Activity Daily Activity Activity Expected Validation Cumulative Script Activity 16 ¡

  17. Issues Tackled • Error in account validation introduced while refactoring code • Usage of the correct join types • Validation of date-time operations • Correct Output Formats 17 ¡

  18. Apache BigTop BigPetStore Blueprints • Problem domain: Transactions for a fictional chain of pet stores • BigPetStore Data Generator simulates customer purchasing behavior to generate realistic transaction data • Blueprints for big data ecosystem – Hadoop: MapReduce / Pig / Hive / Mahout – Spark – Flink (in progress) 18 ¡

  19. BigPetStore 19 ¡

  20. BigPetStore HCFS 20 ¡

  21. BigPetStore HCFS Core (RDDs) 21 ¡

  22. BigPetStore HCFS Core (RDDs) Spark SQL 22 ¡

  23. BigPetStore HCFS Core (RDDs) Spark SQL MLLib 23 ¡

  24. Team Cluster • ~10 nodes • 40 cores, 400GB RAM per node 24 ¡

  25. Potential Issues • Infrastructure • Storage • Software Installation • Software Upgrades • Spark Configuration Tuning • User Management 25 ¡

  26. Real Stories • Creating a new user – User Gluster permissions incorrect • Cluster upgrade – Spark upgrade didn’t take because of issue with Ansible role configuration – Wiped out our spark.conf – master / mesos settings wrong • Gluster moint points disappeared on reboot – Not set in fstab 26 ¡

  27. k8petstore Users BPS Data Generator Public IP Web BPS Data Proxy Application Generator BPS Data Generator Redis Master Redis Redis Redis Slave Slave Slave 27 ¡

  28. k8petstore Users BPS Data Generator Public IP Web BPS Data Proxy Application Generator BPS Data Generator Redis Master Redis Redis Redis Slave Slave Slave 28 ¡

  29. k8petstore Users BPS Data Generator Public IP Web BPS Data Proxy Application Generator BPS Data Generator Redis Master Redis Redis Redis Slave Slave Slave 29 ¡

  30. k8petstore Users BPS Data Generator Public IP Web BPS Data Proxy Application Generator BPS Data Generator Redis Master Redis Redis Redis Slave Slave Slave 30 ¡

  31. k8petstore 31 ¡

  32. Use Cases • Configuration • Scalability • Fault Tolerance 32 ¡

  33. k8petstore • OpenContrail networking solution demo 1 • Kubernetes JuJu Charm documentation example 2 • Kubernetes v1.0 launch talk at OSCON 3 [1] - https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and- opencontrail/ [2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281 33 ¡

  34. APACHE BIGTOP DATA GENERATORS 34 ¡

  35. BigPetStore 35 ¡

  36. BigTop Weatherman 36 ¡

  37. BigTop Bazaar 37 ¡

  38. Vision • Encourage synthetic data generation for testing and realistic examples • Serve as a resource for the larger Apache and open source communities • Emphasis on – Flexibility – Scalability – Realism • We look forward to collaborating and getting folks involved! 38 ¡

  39. Conclusion • Synthetic data generators and blueprints are useful! • Case studies: – Data Processing Pipelines – Cluster Deployment – Kubernetes • BigPetStore and BigTop Data Generators efforts in Apache BigTop • Open invitation to get involved and collaborate 39 ¡

  40. Resources http://bigtop.apache.org/ http://github.com/apache/bigtop http://rnowling.github.io/ 40 ¡

  41. QUESTIONS 41 ¡

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend