Synthetic Data Generation for Realistic Analytics Examples and - - PowerPoint PPT Presentation

synthetic data generation for realistic analytics
SMART_READER_LITE
LIVE PREVIEW

Synthetic Data Generation for Realistic Analytics Examples and - - PowerPoint PPT Presentation

Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/ Who Am I? Software Engineer at Red Hat Data Science Team, Emerging Technologies


slide-1
SLIDE 1

Synthetic Data Generation for Realistic Analytics Examples and Testing

Ronald J. Nowling Red Hat, Inc. rnowling@redhat.com http://rnowling.github.io/

slide-2
SLIDE 2

Who Am I?

  • Software Engineer at Red Hat
  • Data Science Team, Emerging

Technologies

– Evaluate open-source Big Data space – Ensure software works for Red Hat customers – Promote data science internally through consulting projects

  • Apache BigTop PMC

2 ¡

slide-3
SLIDE 3

Synthetic Data

  • No licensing, privacy, or intellectual

property concerns

  • Scalable: Laptops to Clusters!
  • More reliable than external data sets
  • Enable more realistic example

applications

  • Enable more comprehensive testing than

wordcount and TeraSort

3 ¡

slide-4
SLIDE 4

Data Transformation and Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

4 ¡

slide-5
SLIDE 5

Data Transformation and Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

5 ¡

slide-6
SLIDE 6

Data Transformation and Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

6 ¡

slide-7
SLIDE 7

Data Transformation and Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

7 ¡

slide-8
SLIDE 8

Data Transformation and Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

8 ¡

slide-9
SLIDE 9

Data Transformation and Summarization Pipeline

Transform Raw Text Raw Daily Page Views Parse Clean & Validate Raw Daily Page Views Raw Daily Page Views Transform Raw Text Transform Raw Text Parse Parse Clean & Validate Clean & Validate Accounts Summarize Summarize Summarize Aggregate Daily Activity Cumulative Activity

9 ¡

slide-10
SLIDE 10

Synthetic Data

  • Sensitive Data

– Real data on cluster for scalability testing and validation – Synthetic data for local development and testing

  • Needed smaller data sets for checking

calculations

– Total aggregation results requires re-running old pipeline – Extra burden on operations team – Delay for development team

10 ¡

slide-11
SLIDE 11

Validation Script Data Generator Expected Cumulative Activity Accounts Raw Daily Page Views Expected Daily Activity Transformation and Summarization Pipeline Cumulative Activity Daily Activity

Validation with Synthetic Data

11 ¡

slide-12
SLIDE 12

Validation Script Data Generator Expected Cumulative Activity Accounts Raw Daily Page Views Expected Daily Activity Transformation and Summarization Pipeline Cumulative Activity Daily Activity

Validation with Synthetic Data

12 ¡

slide-13
SLIDE 13

Validation Script Data Generator Expected Cumulative Activity Accounts Raw Daily Page Views Expected Daily Activity Transformation and Summarization Pipeline Cumulative Activity Daily Activity

Validation with Synthetic Data

13 ¡

slide-14
SLIDE 14

Validation Script Data Generator Expected Cumulative Activity Accounts Raw Daily Page Views Expected Daily Activity Transformation and Summarization Pipeline Cumulative Activity Daily Activity

Validation with Synthetic Data

14 ¡

slide-15
SLIDE 15

Validation Script Data Generator Expected Cumulative Activity Accounts Raw Daily Page Views Expected Daily Activity Transformation and Summarization Pipeline Cumulative Activity Daily Activity

Validation with Synthetic Data

15 ¡

slide-16
SLIDE 16

Validation Script Data Generator Expected Cumulative Activity Accounts Raw Daily Page Views Expected Daily Activity Transformation and Summarization Pipeline Cumulative Activity Daily Activity

Validation with Synthetic Data

16 ¡

slide-17
SLIDE 17

Issues Tackled

  • Error in account validation introduced

while refactoring code

  • Usage of the correct join types
  • Validation of date-time operations
  • Correct Output Formats

17 ¡

slide-18
SLIDE 18

Apache BigTop BigPetStore Blueprints

  • Problem domain: Transactions for a

fictional chain of pet stores

  • BigPetStore Data Generator simulates

customer purchasing behavior to generate realistic transaction data

  • Blueprints for big data ecosystem

– Hadoop: MapReduce / Pig / Hive / Mahout – Spark – Flink (in progress)

18 ¡

slide-19
SLIDE 19

BigPetStore

19 ¡

slide-20
SLIDE 20

BigPetStore

20 ¡

HCFS

slide-21
SLIDE 21

BigPetStore

21 ¡

Core (RDDs) HCFS

slide-22
SLIDE 22

BigPetStore

22 ¡

Spark SQL Core (RDDs) HCFS

slide-23
SLIDE 23

BigPetStore

23 ¡

Spark SQL MLLib Core (RDDs) HCFS

slide-24
SLIDE 24

Team Cluster

  • ~10 nodes
  • 40 cores, 400GB RAM per node

24 ¡

slide-25
SLIDE 25

Potential Issues

  • Infrastructure
  • Storage
  • Software Installation
  • Software Upgrades
  • Spark Configuration Tuning
  • User Management

25 ¡

slide-26
SLIDE 26

Real Stories

  • Creating a new user

– User Gluster permissions incorrect

  • Cluster upgrade

– Spark upgrade didn’t take because of issue with Ansible role configuration – Wiped out our spark.conf – master / mesos settings wrong

  • Gluster moint points disappeared on reboot

– Not set in fstab

26 ¡

slide-27
SLIDE 27

k8petstore

Public IP Proxy Users BPS Data Generator Redis Master Redis Slave Web Application Redis Slave Redis Slave BPS Data Generator BPS Data Generator

27 ¡

slide-28
SLIDE 28

k8petstore

Public IP Proxy Users BPS Data Generator Redis Master Redis Slave Web Application Redis Slave Redis Slave BPS Data Generator BPS Data Generator

28 ¡

slide-29
SLIDE 29

k8petstore

Public IP Proxy Users BPS Data Generator Redis Master Redis Slave Web Application Redis Slave Redis Slave BPS Data Generator BPS Data Generator

29 ¡

slide-30
SLIDE 30

k8petstore

Public IP Proxy Users BPS Data Generator Redis Master Redis Slave Web Application Redis Slave Redis Slave BPS Data Generator BPS Data Generator

30 ¡

slide-31
SLIDE 31

k8petstore

31 ¡

slide-32
SLIDE 32

Use Cases

  • Configuration
  • Scalability
  • Fault Tolerance

32 ¡

slide-33
SLIDE 33

k8petstore

  • OpenContrail networking solution demo1
  • Kubernetes JuJu Charm documentation

example2

  • Kubernetes v1.0 launch talk at OSCON3

[1] - https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and-

  • pencontrail/

[2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281

33 ¡

slide-34
SLIDE 34

APACHE BIGTOP DATA GENERATORS

34 ¡

slide-35
SLIDE 35

BigPetStore

35 ¡

slide-36
SLIDE 36

BigTop Weatherman

36 ¡

slide-37
SLIDE 37

BigTop Bazaar

37 ¡

slide-38
SLIDE 38

Vision

  • Encourage synthetic data generation for

testing and realistic examples

  • Serve as a resource for the larger Apache

and open source communities

  • Emphasis on

– Flexibility – Scalability – Realism

  • We look forward to collaborating and getting

folks involved!

38 ¡

slide-39
SLIDE 39

Conclusion

  • Synthetic data generators and blueprints are

useful!

  • Case studies:

– Data Processing Pipelines – Cluster Deployment – Kubernetes

  • BigPetStore and BigTop Data Generators

efforts in Apache BigTop

  • Open invitation to get involved and

collaborate

39 ¡

slide-40
SLIDE 40

Resources

http://bigtop.apache.org/ http://github.com/apache/bigtop http://rnowling.github.io/

40 ¡

slide-41
SLIDE 41

QUESTIONS

41 ¡