Visual Data Management System Vishakha Gupta-Cledat, Luis Remis, - - PowerPoint PPT Presentation

visual data management system
SMART_READER_LITE
LIVE PREVIEW

Visual Data Management System Vishakha Gupta-Cledat, Luis Remis, - - PowerPoint PPT Presentation

Images Find data and me cats metadata Visual Data Management System Vishakha Gupta-Cledat, Luis Remis, Christina Strong, Ragaad Altarawneh, Scott Hahn vishakha.s.gupta, luis.remis, christina.r.strong, ragaad.altarawneh, scott.hahn@intel.com


slide-1
SLIDE 1

Intel Labs

Visual Data Management System

Vishakha Gupta-Cledat, Luis Remis, Christina Strong, Ragaad Altarawneh, Scott Hahn vishakha.s.gupta, luis.remis, christina.r.strong, ragaad.altarawneh, scott.hahn@intel.com Images data and metadata Find me cats

slide-2
SLIDE 2

Intel Labs Intel Labs 2

What is VDMS?

A novel Visual Data Management System

  • For storing, accessing and transforming visual data
slide-3
SLIDE 3

Intel Labs Intel Labs 2

What is VDMS?

A novel Visual Data Management System

  • For storing, accessing and transforming visual data
  • Primarily geared towards visual analytics pipelines and data science queries
slide-4
SLIDE 4

Intel Labs Intel Labs 2

What is VDMS?

A novel Visual Data Management System

  • For storing, accessing and transforming visual data
  • Primarily geared towards visual analytics pipelines and data science queries
  • With a goal of efficiently achieving cloud scale while maintaining ease-of-use
slide-5
SLIDE 5

Intel Labs Intel Labs 2

What is VDMS?

A novel Visual Data Management System

  • For storing, accessing and transforming visual data
  • Primarily geared towards visual analytics pipelines and data science queries
  • With a goal of efficiently achieving cloud scale while maintaining ease-of-use

Also aims to

  • Exploit Intel’s heterogeneous memory and storage hierarchy
slide-6
SLIDE 6

Intel Labs Intel Labs 2

What is VDMS?

A novel Visual Data Management System

  • For storing, accessing and transforming visual data
  • Primarily geared towards visual analytics pipelines and data science queries
  • With a goal of efficiently achieving cloud scale while maintaining ease-of-use

Also aims to

  • Exploit Intel’s heterogeneous memory and storage hierarchy
  • Be general purpose e.g. common core for medical imaging, sports, retail
slide-7
SLIDE 7

Intel Labs

Visual Data: Scale and Applications

3

Billions of sources

slide-8
SLIDE 8

Intel Labs

Visual Data: Scale and Applications

3

Images Videos Billions of sources

Large in size (individual object could range in size from KB to GB) Increasingly being used for visual understanding in a range of machine learning applications

slide-9
SLIDE 9

Intel Labs

Visual Data: Scale and Applications

3

Images Feature Vectors / Descriptors Videos Billions of sources

Large in size (individual object could range in size from KB to GB) Increasingly being used for visual understanding in a range of machine learning applications

slide-10
SLIDE 10

Intel Labs Intel Labs 4

The Unsustainable Current Solutions

Resolve visual computing challenges and frameworks first

  • Improving accuracy of algorithms on more and more complex data
  • Storage has not become a bottleneck yet!
slide-11
SLIDE 11

Intel Labs Intel Labs 4

The Unsustainable Current Solutions

Resolve visual computing challenges and frameworks first

  • Improving accuracy of algorithms on more and more complex data
  • Storage has not become a bottleneck yet!

Application-specific solutions, if data does become a problem

  • Organize media files
  • Manually gather and normalize relevant metadata
  • Build custom scripts to tie together many stages of complex processing
slide-12
SLIDE 12

Intel Labs Intel Labs 4

The Unsustainable Current Solutions

Resolve visual computing challenges and frameworks first

  • Improving accuracy of algorithms on more and more complex data
  • Storage has not become a bottleneck yet!

Application-specific solutions, if data does become a problem

  • Organize media files
  • Manually gather and normalize relevant metadata
  • Build custom scripts to tie together many stages of complex processing

Visual data management for scale and reuse is still an open problem.

slide-13
SLIDE 13

Intel Labs Intel Labs 5

VDMS Storage Architecture

Exploding amount of visual data

  • For any request, access only the required subset of data – exploit metadata
slide-14
SLIDE 14

Intel Labs Intel Labs 5

VDMS Storage Architecture

Exploding amount of visual data

  • For any request, access only the required subset of data – exploit metadata

Even individual objects could be large

  • Speed up access to this desired data
  • Preprocess while reading where possible e.g. crop or detect edges before transferring
slide-15
SLIDE 15

Intel Labs Intel Labs 5

VDMS Storage Architecture

Exploding amount of visual data

  • For any request, access only the required subset of data – exploit metadata

Even individual objects could be large

  • Speed up access to this desired data
  • Preprocess while reading where possible e.g. crop or detect edges before transferring

High performance as well as ease-of-use

  • Suitable design choices for metadata and data, at scale
  • Intel hardware optimizations e.g. 3D Xpoint, media hardware, disk offload
  • Simple API and client libraries
slide-16
SLIDE 16

Intel Labs Intel Labs 6

VDMS Implementation

User

Visual Data Storage

VDMS

slide-17
SLIDE 17

Intel Labs Intel Labs 6

VDMS Implementation

Efficient metadata access via Persistent Memory Graph Database (PMGD) for visual data

  • Optimized for metadata storage and access patterns
  • Easy to evolve schema with new vision research

User

PMGD (Metadata Database) Visual Data Storage

VDMS

slide-18
SLIDE 18

Intel Labs Intel Labs 6

VDMS Implementation

Efficient metadata access via Persistent Memory Graph Database (PMGD) for visual data

  • Optimized for metadata storage and access patterns
  • Easy to evolve schema with new vision research

Efficient data access via Visual Compute Library

  • Enable alternate image/video analysis friendly storage

formats as compared to viewer friendly ones

  • Process data while accessing it

User

PMGD (Metadata Database) Visual Data Storage Visual Compute Library

VDMS

slide-19
SLIDE 19

Intel Labs Intel Labs 6

VDMS Implementation

Efficient metadata access via Persistent Memory Graph Database (PMGD) for visual data

  • Optimized for metadata storage and access patterns
  • Easy to evolve schema with new vision research

Efficient data access via Visual Compute Library

  • Enable alternate image/video analysis friendly storage

formats as compared to viewer friendly ones

  • Process data while accessing it

Ease-of-use via Request Server

  • Implement a unified and simple client API
  • Route query (or parts) to the right components for a

coherent user response

User

PMGD (Metadata Database) Visual Data Storage Visual Compute Library

VDMS

Request Server

slide-20
SLIDE 20

Intel Labs Intel Labs 7

Where We Are Now

User API v1.0 defined with internal feedback

slide-21
SLIDE 21

Intel Labs Intel Labs 7

Where We Are Now

User API v1.0 defined with internal feedback Functional one node server and client libraries

slide-22
SLIDE 22

Intel Labs Intel Labs 7

Where We Are Now

User API v1.0 defined with internal feedback Functional one node server and client libraries Three interesting proofs of concept at various stages of development with input from product groups

  • Real data and concrete use case: medical imaging application
  • Large scale, real time, intensive use case: FreeD sports storage architecture
  • Integration with a larger analytic framework: Retail shopper insights application
slide-23
SLIDE 23

Intel Labs Intel Labs 8

Medical Imaging Proof of Concept on VDMS

The Cancer Image Archive: http://www.cancerimagingarchive.net/

  • 60TB of medical images (Volumetric data)
  • ~1000 patients metadata information (very sparse)
slide-24
SLIDE 24

Intel Labs Intel Labs 8

Medical Imaging Proof of Concept on VDMS

The Cancer Image Archive: http://www.cancerimagingarchive.net/

  • 60TB of medical images (Volumetric data)
  • ~1000 patients metadata information (very sparse)

For our PoC:

  • 457 patients metadata, including drug and radiation treatments
  • Scans for 384 patients (60K images)
  • Replicated metadata x10 and x100, keeping the original distribution
slide-25
SLIDE 25

Intel Labs Intel Labs 8

Medical Imaging Proof of Concept on VDMS

The Cancer Image Archive: http://www.cancerimagingarchive.net/

  • 60TB of medical images (Volumetric data)
  • ~1000 patients metadata information (very sparse)

For our PoC:

  • 457 patients metadata, including drug and radiation treatments
  • Scans for 384 patients (60K images)
  • Replicated metadata x10 and x100, keeping the original distribution

Segmentation pipeline for demo

slide-26
SLIDE 26

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

slide-27
SLIDE 27

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Constructed JSON Query

slide-28
SLIDE 28

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Query - Pull Data

Constructed JSON Query

slide-29
SLIDE 29

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Query - Pull Data Return Data

Constructed JSON Query

slide-30
SLIDE 30

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Query - Pull Data Return Data

Constructed JSON Query

slide-31
SLIDE 31

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Query - Pull Data Return Data

Constructed JSON Query Constructed JSON Query + Image Blob

slide-32
SLIDE 32

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Query - Pull Data Return Data Query - Push Data

Constructed JSON Query Constructed JSON Query + Image Blob

slide-33
SLIDE 33

Intel Labs

Segmentation Pipeline

9

PyClient

Segmentation Algorithm for Brian Tumors

VDMS Server

VDMS Client Python Module

Query - Pull Data Return Data Query - Push Data Return Successful

Constructed JSON Query Constructed JSON Query + Image Blob

slide-34
SLIDE 34

Intel Labs Intel Labs 10

Domain Specific Queries - Some Examples

Query 1: Retrieve a single image (200x200), searching by its unique name.  Retrieve single image

slide-35
SLIDE 35

Intel Labs Intel Labs 10

Domain Specific Queries - Some Examples

Query 1: Retrieve a single image (200x200), searching by its unique name.  Retrieve single image Query 2: Retrieve a complete brain scan (155 images) from a particular patient.  Retrieve 155 images

slide-36
SLIDE 36

Intel Labs Intel Labs 10

Domain Specific Queries - Some Examples

Query 1: Retrieve a single image (200x200), searching by its unique name.  Retrieve single image Query 2: Retrieve a complete brain scan (155 images) from a particular patient.  Retrieve 155 images Query 3: Retrieve all brain scans corresponding to people over 75 who had a chemotherapy using the drug “Temodar”.  Retrieve 1600 images after 3 neighbor hops

slide-37
SLIDE 37

Intel Labs Intel Labs 11

Comparison Baseline

No single solution to compare

slide-38
SLIDE 38

Intel Labs Intel Labs 11

Comparison Baseline

No single solution to compare Create “likely” combination of well-known options

  • MemSQL for storing metadata
  • Apache HTTP server for requesting images via http
  • OpenCV for performing preprocessing
slide-39
SLIDE 39

Intel Labs Intel Labs 12

Performance Improvements - Metadata

Query 3: Retrieve 1600 image names after 3 neighbor hops

slide-40
SLIDE 40

Intel Labs Intel Labs 12

Performance Improvements - Metadata

Query 3: Retrieve 1600 image names after 3 neighbor hops

slide-41
SLIDE 41

Intel Labs Intel Labs 12

Performance Improvements - Metadata

Query 3: Retrieve 1600 image names after 3 neighbor hops

slide-42
SLIDE 42

Intel Labs Intel Labs 12

Performance Improvements - Metadata

VDMS performs up to one order of magnitude better compared to MemSQL A Graph Database is a logical choice for visual metadata. Query 3: Retrieve 1600 image names after 3 neighbor hops

slide-43
SLIDE 43

Intel Labs

Visual Compute Library: E.g. Transformation Operations

Images in Analytics-friendly TDB Format (uses TileDB)

13

Resize to 256x256

slide-44
SLIDE 44

Intel Labs

Visual Compute Library: E.g. Transformation Operations

Images in Analytics-friendly TDB Format (uses TileDB)

13

Resize to 256x256 Crop to one-sixth the size

slide-45
SLIDE 45

Intel Labs

Visual Compute Library: E.g. Transformation Operations

Images in Analytics-friendly TDB Format (uses TileDB)

13

Resize to 256x256 Crop to one-sixth the size

Images stored in the TDB format provide faster access and processing, thus making it a great format for visual analytics pipelines, specially for large images.

slide-46
SLIDE 46

Intel Labs

Overall Improvements

14

Query 1: Retrieve single image Query 2: Retrieve 155 images for a patient Query 3: Retrieve 1600 images after 3 neighbor hops

slide-47
SLIDE 47

Intel Labs

Overall Improvements

14

VDMS performs significantly better when dealing with more complex queries, without incurring in

  • verhead in more simple tasks

Query 1: Retrieve single image Query 2: Retrieve 155 images for a patient Query 3: Retrieve 1600 images after 3 neighbor hops

slide-48
SLIDE 48

Intel Labs 15

Query Processing and Configuration

  • Tools to configure

pipeline and answer queries

  • Visual query compiler
  • Visual kernel repository

In-line Processing

  • Video processing with

real time turnaround

  • Support arbitrary

number of streams

  • Programmable events
  • Optimized resource

utilization

Offline Processing

  • Query and analytic on

historic (stored) data

  • Processing of large

(cloud) scale video or image libraries

  • Optimized resource

utilization

Optimized Storage and Retrieval

  • Optimized metadata DB
  • Analysis friendly media

formats

  • Distributed for cloud

scale

  • Tiered storage for hot

and cold data

Framework for processing visual data from the edge to cloud with four focus areas within the Intel Science and Technology Center for Visual Cloud Systems

Hermes Peak: A Framework for Ad-hoc Video Analytics

slide-49
SLIDE 49

Intel Labs

Cameras Data Acquisition Sensors

Local Analytics

  • Preprocessing
  • filtering
  • aggregation

Inline analytics Offline analytics Storage

16

VDMS

Presentation and Interpretation

Bigger Picture: Visual Cloud Inferencing Flow

slide-50
SLIDE 50

Intel Labs

Cameras Data Acquisition Sensors

Local Analytics

  • Preprocessing
  • filtering
  • aggregation

Inline analytics Offline analytics Storage

16

VDMS

Presentation and Interpretation

Bigger Picture: Visual Cloud Inferencing Flow

Along with our academic partners, Intel Labs is looking at the entire flow of visual data and processing from edge to cloud

slide-51
SLIDE 51

Intel Labs 17

Query Processing and Configuration

  • Tools to configure

pipeline and answer queries

  • Visual query compiler
  • Visual kernel repository

TBD

In-line Processing

  • Video processing with

real time turnaround

  • Support arbitrary

number of streams

  • Programmable events
  • Optimized resource

utilization E.g. Streamer (https://github.com/visclo ud/streamer)

Offline Processing

  • Query and analytic on

historic (stored) data

  • Processing of large

(cloud) scale video or image libraries

  • Optimized resource

utilization E.g. Scanner (https://github.com/scann er-research/scanner)

Optimized Storage and Retrieval

  • Optimized metadata DB
  • Analysis friendly media

formats

  • Distributed for cloud

scale

  • Tiered storage for hot

and cold data E.g. VDMS

Hermes Peak: A Framework for Ad-hoc Video Analytics

slide-52
SLIDE 52

Intel Labs Intel Labs 18

Conclusions and Future Work

Room and need for novel storage methods in vision pipelines Graph database, made efficient with new technology, a good option for metadata Analysis friendly data storage a worthwhile research direction

slide-53
SLIDE 53

Intel Labs Intel Labs 18

Conclusions and Future Work

Room and need for novel storage methods in vision pipelines Graph database, made efficient with new technology, a good option for metadata Analysis friendly data storage a worthwhile research direction Address feature vector and video storage and search Scale out to sustain large amount of data and high rates

  • Also integrate with pub/sub model (Kafka) and evaluate

Next version of the API and open source code Hermes Peak integration to complete a visual pipeline

slide-54
SLIDE 54

Intel Labs Intel Labs 18

Conclusions and Future Work

Room and need for novel storage methods in vision pipelines Graph database, made efficient with new technology, a good option for metadata Analysis friendly data storage a worthwhile research direction Address feature vector and video storage and search Scale out to sustain large amount of data and high rates

  • Also integrate with pub/sub model (Kafka) and evaluate

Next version of the API and open source code Hermes Peak integration to complete a visual pipeline

slide-55
SLIDE 55

Intel Labs

Backup

slide-56
SLIDE 56

Intel Labs

Extracting Value from Visual Data – Machine Learning

20

slide-57
SLIDE 57

Intel Labs

Scale - Ubiquitous Cameras, New Applications

21

slide-58
SLIDE 58

Intel Labs

Despite Computing Challenges, Data Access Can’t be Ignored E.g. Image Classification using Deep Learning

22

As processing capabilities and algorithms improve, amount of data increases, and data reuse becomes a possibility, data access goes from an afterthought to a real challenge

slide-59
SLIDE 59

Intel Labs Intel Labs 23

Exploit Rich Visual Metadata

Media data easily leads to rich metadata computed in advance or on the fly Metadata much smaller and can be used to zoom in, on only the desired raw data

Search photos by faces, scenes, objects, and actions/events

Source: Yurong Chen, Intel Labs China

slide-60
SLIDE 60

Intel Labs Intel Labs 24

Representing Media Metadata

While this metadata schema will be application-specific, it looks like a property graph:

  • Nodes connected with Edges
  • Properties on nodes/edges
  • (optional) Group by tags

Support evolving schema Variety of indexes Find all photos of Alice from Hawaii

Name: Maui Type: Island State: Hawaii Population: 20000

Name: Hawaii1.jpg Date: 4/15/14 Size: 2MB Name: Jane Doe DOB: 4/15/1974 Name: John Doe DOB: 11/1/1975 Name: Hawaii2.jpg Date: 4/16/14 Size: 2.5MB Name: Alice Doe DOB: 8/15/2000

Location Photo Photo Person Person Person

Contains Contains LocatedAt

slide-61
SLIDE 61

Intel Labs Intel Labs 25

Persistent Memory Graph Database (PMGD)

Traditional property graph databases plagued by disk latencies New non-volatile memory technology (e.g. 3D Xpoint) with performance close to DRAM Opportunity to avoid a lot of legacy software  PMGD

  • Graph database implementation targeting persistent memory
slide-62
SLIDE 62

Intel Labs Intel Labs 26

PMGD Comparison to Neo4j

Queries taken from the LDBC social network benchmark Bars show speedup over Neo4j The more graph traversals, the better PMGD does

slide-63
SLIDE 63

Intel Labs Intel Labs 27

Speeding up Access to Desired Data

More and more machine consumption of data for processing

  • Think beyond standard formats for visual data
  • Create formats better suited for processing

Visual Compute Library (VCL)

  • Explore alternate formats for images, videos and feature vectors
  • Implement suitable processing on traditional and new formats
slide-64
SLIDE 64

Intel Labs Intel Labs 28

VCL::Image

Implement alternate image storage formats to use when beneficial

  • TDB format, based on TileDB [1]

Higher level interaction with images in traditional or TDB format

  • Perform processing such as crop, resize, threshold, ROI access, as data is read

[1] Stavros Papadopoulos, Kushal Datta, et. al. 2016. The TileDB array data storage manager. VLDB 2016

slide-65
SLIDE 65

Intel Labs

TDB Performance

29

Write Performance Read Performance

slide-66
SLIDE 66

Intel Labs Intel Labs 30

Request Server

Unified and simple client API Route query to the right component for a coherent user response

Parse Request

Metadata Data

VCL PMGD

Function Call

Client API

slide-67
SLIDE 67

Intel Labs

BraTS Challenge - Driving Application

31

slide-68
SLIDE 68

Intel Labs Intel Labs 32

VDMS Alternatives

No one solution to do it all Intel automotive path

  • HDFS for storing data
  • Hbase for organizing metadata
  • Another layer to make querying using relationships easier

Initial CMU solution

  • PostgreSQL database for metadata
  • Write their own frame server and use OpenCV
  • Still looking for an API

Facebook’s Tao + Haystack, Amazon’s Neptune + S3

  • Large scale but still not optimized for visual data management