You call it Data Lake; we call it Data Historian Naghman Waheed - - PowerPoint PPT Presentation

you call it data lake we call it data historian
SMART_READER_LITE
LIVE PREVIEW

You call it Data Lake; we call it Data Historian Naghman Waheed - - PowerPoint PPT Presentation

You call it Data Lake; we call it Data Historian Naghman Waheed Data Platforms Lead Brian Arnold Data Platforms Architect May-24-2018 Naghman Waheed Brian Arnold Data Platforms Lead Data Platforms Architect 10 year career in IT, 6


slide-1
SLIDE 1

You call it Data Lake; we call it Data Historian

Naghman Waheed – Data Platforms Lead Brian Arnold – Data Platforms Architect May-24-2018

slide-2
SLIDE 2

Naghman Waheed Brian Arnold

  • 25+ year career at Monsanto.
  • Data Warehousing, Business

Intelligence, Data Architecture, Cloud Engineering.

  • Data solutions spanning key

business functions such as Supply Chain, Manufacturing, Order-To- Cash, Finance and Procurement.

Data Platforms Lead Data Platforms Architect

  • 10 year career in IT, 6 years in Big

Data

  • Software Development, Functional

Programming, Streaming, Big Data, Cloud Engineering

  • Ecommerce, Recommendation

Engines

slide-3
SLIDE 3

Monsanto - Who are we?

  • Bringing a broad range of solutions to help nourish our growing

world

  • Headquartered in Saint Louis, Missouri
  • >20,000 employees in 66 countries
  • A global company with >50% employees based outside of the

United States

  • One of the 25 World’s Best Multinational Workplaces by Great

Place to Work Institute

Produce with more judicious use

  • f limited natural

resources. improve the lives of the world’s farmers. Increase production to meet needs of a growing population.

“We succeed when farmers succeed.”

  • Hugh Grant, Monsanto

CEO

slide-4
SLIDE 4

Solving real challenges in agriculture industry

Rising Population

Growing enough for a growing world

Global Population

1980 TODAY 2050

4.4B 7.1B 9.6B+

Limited Farmland

Farmers will need to produce enough food with fewer resources to support our world population

Acres per Person

1961

2050

1

<1/3

Changing Climate

Farmers are impacted by climate change in many ways:

WATER AVAILABILITY ISSUES INCREASINGLY UNPREDICTABLE WEATHER INSECT RANGE EXPANSION WEED PRESSURE CHANGES CROP DISEASE INCREASES PLANTING ZONE SHIFTS

Changing Economies and Diets

A growing global middle class is choosing animal protein – meat, eggs, and dairy – as a larger part of their diet

Dietary Percentage of Protein

14%

1965 2030

9%

slide-5
SLIDE 5

Our Solutions for Sustainable Agriculture

5

Our toolkit includes:

Plant Breeding Biotechnology

Crop Protection Precision Agriculture

slide-6
SLIDE 6

Key Technology Trends In Agriculture

Economies of Data Science at Scale

2050 <1/3

Mobile Device Proliferation among Growers

A typical farm is generating 20GB of unique field data every year Computing unit costs have gone down by 1,000x in last 10 years 94% of US farmers own a mobile phone or a smartphone Compared to less than 10& 10 years ago

1961

1

Low-cost Observation Technology /IoT

Connected sensors on tractors, combines, and in fields has increased

  • ver 1000x in the last 10 years

The cost of the average digital sensor had dropped more than half over that time Source : Gartner Technology Trends 2015

slide-7
SLIDE 7

Why Data Historian?

Strategy

  • Cloud First
  • Open Source
  • API First
  • Ecosystem fit

Capabilities

  • Ingestion
  • Access
  • Integration
  • Self Service

Architecture

  • Scalable
  • Fault

Tolerance

  • Performance

TCO

  • License
  • Infrastructure
  • Cost review
  • Support

Build vs. Buy

  • Customization
  • Iterative

release

  • Technology

commitment

slide-8
SLIDE 8

Discover

Data Strategy

Ingest Process Persist Integrate Analyze Expose

Company 360 Product 360 Customer 360 Event 360 Location 360 Insights

  • thers

Data FrontDoor Haystack

Kafka

Enterprise Data Hub Visualization Enterprise Data Warehouse Research Datastore Other Datastores Ancestry Datastore

Data Historian

Change Data Capture Change Data Capture Geospatial Platform Extract Transform Load Quality Management Analytics Platform

slide-9
SLIDE 9

API Gateway Data FrontDoor

Custom API Harvester Authentication Authorization

Identity Management

Tag & Register APIs

Virtual Directory Service Transactional Systems Company 360 Product 360 Customer 360 Event 360 Location 360 Insights

  • thers

Trusted Partner Portal

Kafka

Enterprise Data Hub Enterprise Data Warehouse Research Datastore Change Data Capture

Archive Log 30 minute latency

Data Stores

Other Datastores Ancestry Datastore

Data Historian

Haystack

Topic Metadata

Change Data Capture Batch Ingestion Streaming Ingestion API Ingestion Quality Management UI Ingestion Extract Transform Load Visualization Virtualization Geospatial Platform Ontology Management

To API Gateway Metadata linked to search

Analytics Platform

To Data Historian

Data Platforms Ecosystem

To IDM

slide-10
SLIDE 10

Data Storage & Processing

Monsanto Internal Users Monsanto Internal Users

Adhoc Analysis

Identity Management

API Gateway API Access

Authentication / Authorization Authentication / Authorization

Metadata Management

Kafka

File Upload

AWS S3 Storage Metadata Store Archive Glacier Storage

Data Ingest

Historian UI

Access

Historian UI

Applications

Data Historian - Reference Architecture

Streaming Data Stores

Governance Rules

Data Stores

Audit Rules

Query Engine

Data Historian Processing Engine Security

slide-11
SLIDE 11

Data Historian – Technology

S3 Glacier Lambda

slide-12
SLIDE 12

AWS Data Historian Architecture

slide-13
SLIDE 13
  • Batch imports from RDBMS
  • Full, delta, merge
  • Streaming from Kafka (Datahub)
  • File ingestion through API and Data Historian UI
  • Users can append files to existing datasets as well

Ingest

slide-14
SLIDE 14

Ingestion Process

Scheduler Import Raw Records Build Hive Staging Tables in HDFS Validation Export Data To Master Tables in S3 Export Raw Data To S3 Archive Export Rebuild Materialized View

slide-15
SLIDE 15
  • Required fields
  • Name, Description, Source, Publisher, etc.
  • Optional fields
  • Tags, custom fields
  • Forwarded to our metadata platform (Haystack)
  • Metadata objects pushed through Kafka (Datahub)

Metadata

slide-16
SLIDE 16
  • Export to RDBMS
  • Full, delta, merge
  • Export to Kafka
  • Export to Redshift
  • Export to S3
  • Materialized Export

Exports

Scheduler Calculate Query Predicate Materialized Export Archive Export RDBMS Export Kafka Export S3 Export Purge Target Purge Source Validation

slide-17
SLIDE 17
  • Archive & Retention
  • Automated Compliance Checks
  • Security
  • Permissions

Governance

slide-18
SLIDE 18
  • Get/List/Put Datasets
  • Get/Put Dataset Metadata
  • Get Dataset Status
  • Query
  • SDKs
  • Java, Scala, R, Python

APIs & Integration

slide-19
SLIDE 19

Physical

APIs - Query

Data Historian API

Virtual

Client Data Historian UI Data Historian JDBC Driver Data Historian Security Service

slide-20
SLIDE 20

Data Historian UI - Query Interface

20

4

slide-21
SLIDE 21

Data Historian UI – Browse Datasets

21

4

slide-22
SLIDE 22

Data Historian UI – Dataset Details

22

4

slide-23
SLIDE 23

Data Historian UI – Permissions Management

23

4

slide-24
SLIDE 24

Data Historian UI - Future

24

4

slide-25
SLIDE 25

Highlights

  • v1.0 production release 16 months ago
  • 164 active datasets in prod
  • 10TB of data in prod
  • >1,000+ query requests per day
  • Early Adopters :
  • Internal Security Office, Research & Development
  • Early Majority :
  • IoT, Data Assets, Supply Chain
  • Late Majority :
  • Finance, Commercial, HR, Other

25

slide-26
SLIDE 26

Lessons Learned

  • Open Source
  • Flexibility
  • Learning Curve
  • Specialized Skill Set
  • Cloud – AWS
  • Agility
  • Security
  • Support
  • Resource Staffing

26

slide-27
SLIDE 27

Questions?

27