[PPT] - Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group PowerPoint Presentation

SLIDE 1

Towards Scalable Multimedia Analytics

Björn Þór Jónsson datasys group Computer Science Department IT University of Copenhagen

SLIDE 2

Today’s Media Collections

Massive and growing

– Europeana > 50 million items – DeviantArt > 250 million items (160K/day) – Facebook > 1,000 billion items (200M/day)

Variety of users and applications

– Novices à enthusiasts à scholars à experts – Current systems aimed at helping experts

Need for understanding and insights

2

SLIDE 3

Media Tasks

3

Search Exploration

SLIDE 4

Media Tasks

4 [Zahálka and Worring, 2014]

SLIDE 5

Multimedia Analytics

Multimedia Analysis Visual Analytics

[Zahálka and Worring, 2014] 5

SLIDE 6

[Zahálka and Worring, 2014; Keim et al., 2010]

From Data to Insight

6

SLIDE 7

The Two Gaps

7 Generic data and annotation based on objective understanding Predefined, fixed annotation based

n understanding of

the collection Specific context and task-driven subjective understanding Dynamically evolving and interaction-driven understanding of collections

Semantic Gap

[Smeulders et al., 2000]

Pragmatic Gap

[Zahálka and Worring, 2014]

SLIDE 8

Multimedia Analytics State of the Art

Theory is developing
Early systems have appeared
No real-life applications (?)
Small collections only

8

SLIDE 9

Scalable Multimedia Analytics

Multimedia Analysis Visual Analytics Data Management

[Jónsson et al., 2016] 9

SLIDE 10

The Three Gaps

10 Generic data and annotation based on objective understanding Predefined, fixed annotation based

n understanding of

the collection Pre-computed indices and bulk processing of large datasets Specific context and task-driven subjective understanding Dynamically evolving and interaction-driven understanding of collections Serendipitous and highly interactive sessions on small data subsets

Semantic Gap

[Smeulders et al., 2000]

Pragmatic Gap

[Zahálka and Worring, 2014]

Scale Gap

[Jónsson et al., 2016]

SLIDE 11

[Jónsson et al., MMM 2016]

Ten Research Questions for Scalable Multimedia Analytics

VELOCITY VOLUME VARIETY VISUAL INTERACTION

SLIDE 12

12

SLIDE 13

Building Systems?

13

SLIDE 14

Service Layer

Big Data Framework: Lambda Architecture

14 Batch Layer Storage Layer Speed Layer [Marz and Warren, 2015]

SLIDE 15

Service Layer

Big Data Framework: Lambda Architecture

15 Batch Layer Storage Layer Speed Layer

SLIDE 16

Outline

Motivation:

Scalable multimedia analytics

Batch Layer:

Spark and 43 billion high-dim features

Service Layer:

Blackthorn and 100 million images

Conclusion:

Importance and challenges of scale!

16

SLIDE 17

Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. Franklin Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys) Taipei, Taiwan, June, 2017

17

SLIDE 18

Spark Case Study: Motivation

How can multimedia tasks harness the

power of cloud-computing?

– Multimedia collections are growing – Computing power is abundant

ADCFs = Hadoop || Spark

– Automatically Distributed Computing Frameworks – Designed for high-throughput processing

18

SLIDE 19

Design Choices: ADCF = Spark

Hadoop is not suitable (more later)
Resilient Distributed Datasets (RDDs)

– Transform one RDD to another via operators – Lazy execution – Master and Workers paradigm

Supports deep pipelines
Supports worker’s memory sharing
Lazy execution allows for optimizations

19

SLIDE 20

Design Choices: Application Domain

Content-Based Image Retrieval (CBIR)

– Well known application – Two phases: Off-line & “On-line”

Search results Query Image CBIR System

20

SLIDE 21

Properties:

Clustering-based
Deep hierarchical index
Approximate k-NN search
Trades response time for

throughput by batching Why?

Very simple
Prototypical of many

CBIR algorithms

Previous Hadoop

implementation facilitates comparison

Design Choices: DeCP Algorithm

21

SLIDE 22

DeCP as a CBIR System

22

Off-line

– Build the index hierarchy – Cluster the data collection

On-line

– Approximate k-NN search – Vote aggregation

Index is in RAM Clusters reside on disk

Searching a single feature

Identify Retrieve Scan

k-NN

Clustered collection

SLIDE 23

Multi-Core Parallelism: Scaling-Out

l Relative indexing time on 2-24 virtual

cores

2 4 6 8 10 12 14 16 18 20 22 24 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.00 0.10 0.20 0.30 0.40 0.50 0.60

Relative Scalabilty

Measured wall clock time Optimal trend Corrected trend

Number of cores Time relative to using 2 cores

Real cores HT cores

SLIDE 24

Batch Processing: Throughput

100K 50K 20K 10K 5K 3K 2K 1K 500 200 100 10 1,000 2,000 3,000 4,000 5,000 6,000

Throughput

8.1B desc. 1.6M clusters, L=4 Index Batches of images from 10 to 100,000

Scan wall time per image Scan CPU time per image

Batch size in images Milliseconds

81ms

SLIDE 25

Design Choices: Feature Collection

YLI feature corpus from YFCC100M

– Various feature sets (visual, semantic, …) – 99.2M images and 0.8M videos – Largest dataset publicly available

Use all 42.9 billion SIFT features!

– Goal is to test at a very large scale – No feature aggregation or compression – Largest feature collection reported!

25

SLIDE 26

Research Questions

What is the complexity of the Spark

pipeline for typical multimedia-related tasks?

How well does background processing

scale as collection size and resources grow?

How does batch size impact throughput
f an offline service?

26

SLIDE 27

Requirements for the ADCF

R1 Scalability Ability to scale out with additional computing power R2 Computational flexibility Ability to carefully balance system resources as needed R3 Capacity Ability to gracefully handle data that vastly exceeds main memory capacity R4 Updates Ability to gracefully update the data structures for dynamic workloads R5 Flexible pipeline Ability to easily implement variations of the indexing and/or retrieval process R6 Simplicity How efficiently the programmer’s time is spent

27

SLIDE 28

DeCP on Hadoop

28

Prior work evaluated DeCP on Hadoop

using 30 billion SIFTs on 100+ machines

Conclusion = limited success

– Scalability limited due to RAM per core – Two-step Map-Reduce pipeline is too rigid

Ex: Single data-source only
Ex: Could not search multiple clusters

– R1, R2, R3 = partially; R4 = no; R5, R6 = no

SLIDE 29

DeCP on Spark

A very different ADCF from Hadoop
Several advantages

– Arbitrarily deep pipelines

Easily implement all features and functionality

– Broadcast variables

Solves the RAM per core limitation

– Multiple data sources

Ex: Allows join operations for maintenance (R4)

29

SLIDE 30

Spark Pipeline Symbols

30

.map = one-to-one transformation
.flatmap = one-to-any transformation
.groupByKey = Hadoop’s Shuffle
.reduceByKey = Hadoop’s Reduce
.collectAsMap = collect to Master

.map .flatmap .groupByKey .reduceByKey .collectAsMap

SLIDE 31

Indexing Pipeline

31

SLIDE 32

Search Pipeline

32

Indexing Search

SLIDE 33

Evaluation: Off-line Indexing

Hardware: 51 AWS c3.8xl nodes

– 800 real cores + 800 virtual cores – 2.8 TB of RAM and 30 TB of SSD space

Indexing time as collection grows

33

Features (billions) Indexing time (seconds) Scaling (relative) 8.5 3,287 – 17.2 5,030 1.53 26.0 11,943 3.63 34.5 14,192 4.31 42.9 19,749 6.00

SLIDE 34

Evaluation: “On-line” Search

34

Throughput with batching

Hadoop limit

SLIDE 35

Summary

35

R1 Scalability R2 Computational Flexibility R3 Capacity R4 Updates R5 Flexible Pipelines R6 Simplicity

Spark

Yes Yes Yes Partial

full re- shuffle

Yes Yes

Hadoop

Partial

RAM per core

Partial Partial No No No

demonstration web-site under development

SLIDE 36

Largest Collections in the Literature...

1. Guðmundsson, Amsaleg, Jónsson, Franklin (2017)

– 42.9 billion SIFTs – 51 servers

2. Moise, Shestakov, Guðmundsson, Amsaleg (2013)

– 30.2 billion SIFTs – 108 servers

3. Lejsek, Jónsson, Amsaleg (2017?)

– 28.5 billion SIFTs – 1 server

4. Lejsek, Jónsson, Amsaleg (2011)

– 2.5 billion SIFTs – 1 server

5. Jégou et al. (2011)

– 2 billion ... – 1 server

6. Sun et al (2011)

– 1.5 billion ... – 10 servers

SLIDE 37

Service Layer

Role of Spark

37 Batch Layer Storage Layer Speed Layer

SLIDE 38

Outline

38

Motivation:

Scalable multimedia analytics

Batch Layer:

Spark and 43 billion high-dim features

Service Layer:

Blackthorn and 100 million images

Conclusion:

Importance and challenges of scale!

SLIDE 39

Service Layer

Framework: Lambda Architecture

39 Batch Layer Storage Layer Speed Layer

SLIDE 40

40

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Blackthorn: Large-Scale Interactive Multimodal Learning Accepted to IEEE Transactions on Multimedia Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Interactive Multimodal Learning on 100 Million Images Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR) New York, NY, USA, June, 2016

SLIDE 41

[Zahálka and Worring, 2014; Keim et al., 2010]

Multimedia Analytics Process

41

SLIDE 42

42

SLIDE 43

Blackthorn Motivation

Do not impose a dictionary on the user
Let the user synthesize categories
f relevance from semantic annotations
n the fly
Let the user search and explore

along those categories interactively

Interactive semi-supervised learning

43

at scale!

SLIDE 44

Honza’s Scalability Illustration

“Yesterday”:

10-100K images

YFCC:

100M images

44

Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html

SLIDE 45

Scaling > Data Volume

45

Single (high-end) workstation

– But … 1000D features à 800GB

Interactive response time!

– But … computing 100M feature scores takes minutes!

SLIDE 46

Blackthorn Overview

46

SLIDE 47

Blackthorn Data

Visual:

1000D ImageNet semantic features

Text:

100D LDA topics

Sparse data (mostly ~0)

47

SLIDE 48

Blackthorn Compression

48

SLIDE 49

Blackthorn Overview

49

SLIDE 50

Experimental Task

50

Find images relating to a particular city
Compare to standard linear SVM

SLIDE 51

Blackthorn Results: 1.2M Collection

51

Compression: 880GB à 5GB
Precision: 89-108% of SVM
Scoring time:

60-80x faster

Recall over time:

Blackthorn rocks!

SLIDE 52

Blackthorn Results: YFCC100M Collection

Scoring time: ~1 second!

52

SLIDE 53

Blackthorn Future Work

More (user) evaluation is needed
Other applications may (will) require

adaptations

Further scalability:

Combine eCP and Blackthorn

53

SLIDE 54

Outline

Motivation:

Scalable multimedia analytics

Batch Layer:

Spark and 43 billion high-dim features

Service Layer:

Blackthorn and 100 million images

Conclusion:

Importance and challenges of scale!

54

SLIDE 55

Why Scale?

55

Current and future applications
Future of computing
Because we cannot yet!

“We choose to … in this decade and do the other things, not because they are easy, but because they are hard, …” We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too. JFK, September 12, 1962

SLIDE 56

Scalability Hurdles

56

Industry-Level Collections

– Data – Processing capacity

Small-Minded Reviewers

– “Are there users willing to explore 100M data sets interactively?” – Too much engineering – not enough “science”

Interactive Applications

– Application knowledge – User study “victims”

SLIDE 57

Summary

Motivation:

Scalable multimedia analytics

Batch Layer:

Spark and 43 billion high-dim features

Serving Layer:

Blackthorn and 100 million images

Conclusion:

Importance and challenges of scale!

57