[PPT] - Analytics for Object Storage Simplified - Unified File and Object PowerPoint Presentation

SLIDE 1

Analytics for Object Storage Simplified

Unified File and Object for Hadoop

Sandeep R Patil STSM, Master Inventor, IBM Spectrum Scale Smita Raut Object Development Lead, IBM Spectrum Scale

Acknowledgement : Bill Owen, Tomer Perry, Dean Hildebrand, Piyush Chaudhary, Yong Zeng, Wei Gong, Theodore Hoover Jr, Muthuannamalai Muthiah.

SLIDE 2

Part 1 : Need as well as Design Points for Unified File and Object

Ø Introduction to Object Storage Ø Unified File & Object Access Ø Use Cases Enabled By UFO

Part 2: Analytics with Unified File and Object

Ø Big Data and Challanges Ø Design Points, Approach and Solution

Agenda

SLIDE 3

Ø Object Storage Introduction

3

Part 1 : Need as well as Design Points for Unified File and Object

SLIDE 4

Introduction to Object Store

Object storage is highly available, distributed, eventually consistent storage.
Data is stored as individual objects with unique identifier
Flat addressing scheme that allows for greater scalability
Has simpler data management and access
REST-based data access
Simple atomic operations:
PUT, POST, GET, DELETE
Usually software based that runs on commodity hardware
Capable of scaling to 100s of petabytes
Uses replication and/or erasure coding for availability instead of RAID
Access over RESTful API over HTTP, which is a great fit for cloud and mobile applications

– Amazon S3, Swift, CDMI API

4

SLIDE 5

Object Storage Enables The Next Generation of Data Management

Multi-Site Cloud Storage Multi-Tenancy Simpler management and flatter namespace Simple APIs/Semanti cs (Swift/S3, Versioning, Whole File Updates) Scalable Metadata Access Scalable and Highly- Available Cost Savings Ubiquitous Access

SLIDE 6

But Does it Create Yet Another Storage Island in Your Data Center…??

6

SLIDE 7

Ø Unified File and Object Access

7

SLIDE 8

What is Unified File and Object Access ?

Accessing object using file interfaces

(SMB/NFS/POSIX) and accessing file using object interfaces (REST) helps legacy applications designed for file to seamlessly start integrating into the

bject world.
It allows object data to be accessed using

applications designed to process files. It allows file data to be published as objects.

Multi protocol access for file and object in the same

namespace (with common User ID management capability) allows supporting and hosting data oceans

f different types of data with multiple access options.
Optimizes various use cases and solution

architectures resulting in better efficiency as well as cost savings.

8

<Clustered file system>

Swift (With Swift on File)

NFS/SMB/POSIX Object(http)

2 1

<Container>

File Exports created

n container level

OR POSIX access from container level Objects accessed as Files Data ingested as Objects

3

Data ingested as Files

4

Files accessed as Objects

SLIDE 9

Flexible Identity Management Modes

Two Identity Management Modes
Administrators can choose based on their need and use-case

9

Local_Mode Unified_Mode Identity Management Modes

Object created by Object interface will be owned by internal “swift” user Application processing the object data from file interface will need the required file ACL to access the data. Object authentication setup is independent of File Authentication setup Object created from Object interface should be

wned by the user doing the Object PUT (i.e

FILE will be owned by UID/GID of the user) Users from Object and File are expected to be common auth and coming from same directory service (only AD+RFC 2307 or LDAP) Owner of the object will own and have access to the data from file interface.

Suitable for unified file and object access for end users. Leverage common ILM policies for file and object data based on data

wnership

Suitable when auth schemes for file and

bject are different and unified access

is for applications

SLIDE 10

Ø Use Cases Enabled by Unified File Object

10

SLIDE 11

Use case 1 : Process Object Data with File-Oriented Applications and Publish Outcomes as Objects

11

Swift on file

Container1 Virtual Machine Instances Virtual Machine Instances Container2

Subsidiary 1 Subsidiary 2

NFS Export

n

Container 1 NFS Export

n

Container 2 Virtual Machine Instances Virtual Machine Instances

VM Farm for Subsidiary 1 for video processing VM Farm for Subsidiary 2 for video processing …. ….

Ingest Media Objects

Media House OpenStack Cloud Platform (Tenant = Media House Subsidiaries)

Manila Shares (NFS) exported only for Subsidiary1 Publishing Channels

Final Video (as objects) available for streaming

Final processed videos available as Objects in container which is used for external publishing

Raw media content sent for media processing which happens over files (Object to File access)

NFS Export

n

Container 1’ Container 1’

Manila Shares (NFS) exported only for Subsidiary2

Files converted into objects for publishing (File to Object access)

SLIDE 12

Use case 2 : Users read/write data via File and Object with Common User Authentication and Identity

12

Clustered file system

Data N F S S M B O b je c t Data N F S S M B O b je c t

User: John

User: Riya

Access Common Data using the same User Credentials across all protocols Corporate User Directory (Active Directory/LDAP) Riya’s data Read/Written from Object should be

wned by Riya when

accessed from File (SMB/NFS/POSIX)

User: Riya UID: 1001 GID: 2000 Domain: XYZ

SLIDE 13

We have now understood Part 1: Need as well as Design for Unified File and Object ….. Let us now deep dive on Part 2: Analytics with Unified File and Object

SLIDE 14

Ø Big Data and Challenges

14

SLIDE 15

Big Data

§ Big data is a term for data sets that are so large or complex that traditional data processing applications (database management tools or traditional data processing applications) are inadequate. § The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

Characteristics

§ Volume

– The quantity of generated and stored data.

§ Variety

‒ The type and nature of the data.

§ Velocity

‒ Speed at which the data is generated and processed

§ Variability

‒ Inconsistency of data sets can hamper processes manage it

§ Veracity

‒ Quality of captured data can vary greatly, affecting accuracy

SLIDE 16

Challenges with the Early Big Data Storage Models

Move data to the analytics engine

. . .

Perform analytics Ingest data at various end points Repeat!

It takes hours or days to move the data!

!

Key Business processes are now depend on the analytics

!

Can’t just throw away data due to regulations or business requirement

!

. . .

It’s not just

ne type of

analytics

! !

More data source than ever before, not just data you own, but public or rented data

SLIDE 17

Ø Design Points, Approach & Solution

17

SLIDE 18

Powered by

Disk Tape Shared Nothing Cluster Flash Off Premise

Geographically dispersed management of data including disaster recovery

Compute farm Traditional applications New Gen applications

common name space

What are the Solution Design Points that we came across?

Bring analytics to the data Encryption for protection of your data Single Name Space to house all your data (Files and Object) Unified data access with File & Object 1 2 5 6 3 Optimize economics based on value of the data 4

SLIDE 19

Block

iSCSI

Client workstations Users and applications Compute farm Traditional applications

Global Namespace

Analytics

Transparent HDFS Spark

OpenStack

Cinder Glance Manilla

Object

Swift S3

Powered by

Clustered File System Automated data placement and data migration

Disk Tape Shared Nothing Cluster Flash New Gen applications Transparent Cloud Tier | 19 Worldwide Data Distribution

Site B Site A Site C

How Did We Approach The Solution & address the Design Points?

Took the Data Ocean Approach

SMB NFS POSIX

File Encryption

DR Site

JBOD/JBOF

Spectrum Scale RAID

4000+ customers Unified File and Object – as explained previously

Meeting Design Point 1 Meeting Design Point 2 Meeting Design Point 3 Meeting Design Point 4 Meeting Design Point 5 Meeting Design Point 6

SLIDE 20

Meeting Design Point 6 – Bring Analytics to Data Apache Hadoop - Key Platform for Big Data and Analytics

§ An open-source software framework and most popular BD&A platform § Designed for distributed storage and processing of very large data sets on computer clusters built from commodity hardware § Core of Hadoop consists of

A processing part called MapReduce
A storage part, known as Hadoop Distributed File System (HDFS)
Hadoop common libraries and components

§ Leading Hadoop Distro: HortonWorks, CloudEra, MapR, IBM IOP/BigInsights

SLIDE 21

§ HDFS is a shared nothing architecture, which is very inefficient for high throughput jobs (disks and cores grow in same ratio) § Costly data protection:

§ uses 3-way replication; limited RAID/erasure coding

§ Works only with Hadoop i.e weak support for File or Object protocols § Clients have to copy data from enterprise storage to HDFS in order to run Hadoop jobs, this can result in running on stale data. Meeting Design Point 6 – Bring Analytics to Data HDFS Shortcomings

SLIDE 22

Meeting Design Point 6 – How to Bring Analytics to Data ?

Desired Solution: Need In place Analytics (No Copies Required). Clustered Filesystem should support HDFS Connectors

SLIDE 23

How did we design to overcome the inhibitors: Developing a HDFS Transparency Connector with Unified File and Object Access

6/2/17 23

Map/Reduce API Hadoop FS APIs Higher-level languages: Hive, BigSQL JAQL, Pig …

Applications

HDFS Client

Spectrum Scale

HDFS RPC

Hadoop client

Hadoop FileSystem API

Connector on libgpfs,posix API

Hadoop client

Hadoop FileSystem API

Connector on libgpfs,posix API

GPFS node Hadoop client

Hadoop FileSystem API

GPFS node

Connector

Power Linux Power Linux

Commodity hardware Shared storage

hdfs://hostnameX:portnumber

HDFS Client HDFS Client HDFS Client GPFS Connector Service GPFS Connector Service HDFS RPC over network

Spectrum Scale Connector Server

SLIDE 24

Meeting Design Point 6- “In-Place” Analytics for Unified File and Object Data Achieved.

24

IBM Spectrum Scale

Unified File and Object fileset

Object (http) Data ingested as Objects Spark or Hadoop MapReduce In-Place Analytics

Source:https://aws.amazon.com/elasticmapreduce/

Traditional object store – Data to be copied from

bject store to dedicated cluster , do the analysis

and copy the result back to object store for publishing Object store with Unified File and Object Access – Object Data available as File on the same fileset. Analytics systems like Hadoop MapReduce or Spark allow the data to be directly leveraged for analytics. No data movement i.e. In-Place immediate data analytics.

Analytics With Unified File and Object Access Analytics on Traditional Object Store

Explicit Data movement Results Published as Objects with no data movement Results returned in place

SLIDE 25