Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - - PowerPoint PPT Presentation

towards scalable multimedia analytics
SMART_READER_LITE
LIVE PREVIEW

Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - - PowerPoint PPT Presentation

Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group Computer Science Department IT University of Copenhagen Todays Media Collections Massive and growing Europeana > 50 million items DeviantArt >


slide-1
SLIDE 1

Towards Scalable Multimedia Analytics

Björn Þór Jónsson datasys group Computer Science Department IT University of Copenhagen

slide-2
SLIDE 2

Today’s Media Collections

  • Massive and growing

– Europeana > 50 million items – DeviantArt > 250 million items (160K/day) – Facebook > 1,000 billion items (200M/day)

  • Variety of users and applications

– Novices à enthusiasts à scholars à experts – Current systems aimed at helping experts

  • Need for understanding and insights

2

slide-3
SLIDE 3

Media Tasks

3

Search Exploration

slide-4
SLIDE 4

Media Tasks

4 [Zahálka and Worring, 2014]

slide-5
SLIDE 5

Multimedia Analytics

Multimedia Analytics

Multimedia Analysis Visual Analytics

[Zahálka and Worring, 2014] 5

slide-6
SLIDE 6

[Zahálka and Worring, 2014; Keim et al., 2010]

From Data to Insight

6

slide-7
SLIDE 7

The Two Gaps

7 Generic data and annotation based on objective understanding Predefined, fixed annotation based

  • n understanding of

the collection Specific context and task-driven subjective understanding Dynamically evolving and interaction-driven understanding of collections

Semantic Gap

[Smeulders et al., 2000]

Pragmatic Gap

[Zahálka and Worring, 2014]

[Zahálka and Worring, 2014]

slide-8
SLIDE 8

Multimedia Analytics State of the Art

  • Theory is developing
  • Early systems have appeared
  • No real-life applications (?)
  • Small collections only

8

slide-9
SLIDE 9

Scalable Multimedia Analytics

Scalable Multimedia Analytics

Multimedia Analysis Visual Analytics Data Management

[Jónsson et al., 2016] 9

slide-10
SLIDE 10

The Three Gaps

10 Generic data and annotation based on objective understanding Predefined, fixed annotation based

  • n understanding of

the collection Pre-computed indices and bulk processing of large datasets Specific context and task-driven subjective understanding Dynamically evolving and interaction-driven understanding of collections Serendipitous and highly interactive sessions on small data subsets

Semantic Gap

[Smeulders et al., 2000]

Pragmatic Gap

[Zahálka and Worring, 2014]

Scale Gap

[Jónsson et al., 2016]

slide-11
SLIDE 11

[Jónsson et al., MMM 2016]

Ten Research Questions for Scalable Multimedia Analytics

VELOCITY VOLUME VARIETY VISUAL INTERACTION

slide-12
SLIDE 12

12

slide-13
SLIDE 13

Building Systems?

13

slide-14
SLIDE 14

Service Layer

Big Data Framework: Lambda Architecture

14 Batch Layer Storage Layer Speed Layer [Marz and Warren, 2015]

slide-15
SLIDE 15

Service Layer

Big Data Framework: Lambda Architecture

15 Batch Layer Storage Layer Speed Layer

slide-16
SLIDE 16

Outline

  • Motivation:

Scalable multimedia analytics

  • Batch Layer:

Spark and 43 billion high-dim features

  • Service Layer:

Blackthorn and 100 million images

  • Conclusion:

Importance and challenges of scale!

16

slide-17
SLIDE 17

Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. Franklin Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys) Taipei, Taiwan, June, 2017

17

slide-18
SLIDE 18

Spark Case Study: Motivation

  • How can multimedia tasks harness the

power of cloud-computing?

– Multimedia collections are growing – Computing power is abundant

  • ADCFs = Hadoop || Spark

– Automatically Distributed Computing Frameworks – Designed for high-throughput processing

18

slide-19
SLIDE 19

Design Choices: ADCF = Spark

  • Hadoop is not suitable (more later)
  • Resilient Distributed Datasets (RDDs)

– Transform one RDD to another via operators – Lazy execution – Master and Workers paradigm

  • Supports deep pipelines
  • Supports worker’s memory sharing
  • Lazy execution allows for optimizations

19

slide-20
SLIDE 20

Design Choices: Application Domain

  • Content-Based Image Retrieval (CBIR)

– Well known application – Two phases: Off-line & “On-line”

Search results Query Image CBIR System

20

slide-21
SLIDE 21

Properties:

  • Clustering-based
  • Deep hierarchical index
  • Approximate k-NN search
  • Trades response time for

throughput by batching Why?

  • Very simple
  • Prototypical of many

CBIR algorithms

  • Previous Hadoop

implementation facilitates comparison

Design Choices: DeCP Algorithm

21

slide-22
SLIDE 22

DeCP as a CBIR System

22

  • Off-line

– Build the index hierarchy – Cluster the data collection

  • On-line

– Approximate k-NN search – Vote aggregation

Index is in RAM Clusters reside on disk

Searching a single feature

Identify Retrieve Scan

k-NN

Clustered collection

slide-23
SLIDE 23

Multi-Core Parallelism: Scaling-Out

l Relative indexing time on 2-24 virtual

cores

2 4 6 8 10 12 14 16 18 20 22 24 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.00 0.10 0.20 0.30 0.40 0.50 0.60

Relative Scalabilty

Measured wall clock time Optimal trend Corrected trend

Number of cores Time relative to using 2 cores

Real cores HT cores

slide-24
SLIDE 24

Batch Processing: Throughput

100K 50K 20K 10K 5K 3K 2K 1K 500 200 100 10 1,000 2,000 3,000 4,000 5,000 6,000

Throughput

8.1B desc. 1.6M clusters, L=4 Index Batches of images from 10 to 100,000

Scan wall time per image Scan CPU time per image

Batch size in images Milliseconds

81ms

slide-25
SLIDE 25

Design Choices: Feature Collection

  • YLI feature corpus from YFCC100M

– Various feature sets (visual, semantic, …) – 99.2M images and 0.8M videos – Largest dataset publicly available

  • Use all 42.9 billion SIFT features!

– Goal is to test at a very large scale – No feature aggregation or compression – Largest feature collection reported!

25

slide-26
SLIDE 26

Research Questions

  • What is the complexity of the Spark

pipeline for typical multimedia-related tasks?

  • How well does background processing

scale as collection size and resources grow?

  • How does batch size impact throughput
  • f an offline service?

26

slide-27
SLIDE 27

Requirements for the ADCF

R1 Scalability Ability to scale out with additional computing power R2 Computational flexibility Ability to carefully balance system resources as needed R3 Capacity Ability to gracefully handle data that vastly exceeds main memory capacity R4 Updates Ability to gracefully update the data structures for dynamic workloads R5 Flexible pipeline Ability to easily implement variations of the indexing and/or retrieval process R6 Simplicity How efficiently the programmer’s time is spent

27

slide-28
SLIDE 28

DeCP on Hadoop

28

  • Prior work evaluated DeCP on Hadoop

using 30 billion SIFTs on 100+ machines

  • Conclusion = limited success

– Scalability limited due to RAM per core – Two-step Map-Reduce pipeline is too rigid

  • Ex: Single data-source only
  • Ex: Could not search multiple clusters

– R1, R2, R3 = partially; R4 = no; R5, R6 = no

slide-29
SLIDE 29

DeCP on Spark

  • A very different ADCF from Hadoop
  • Several advantages

– Arbitrarily deep pipelines

  • Easily implement all features and functionality

– Broadcast variables

  • Solves the RAM per core limitation

– Multiple data sources

  • Ex: Allows join operations for maintenance (R4)

29

slide-30
SLIDE 30

Spark Pipeline Symbols

30

  • .map = one-to-one transformation
  • .flatmap = one-to-any transformation
  • .groupByKey = Hadoop’s Shuffle
  • .reduceByKey = Hadoop’s Reduce
  • .collectAsMap = collect to Master

.map .flatmap .groupByKey .reduceByKey .collectAsMap

slide-31
SLIDE 31

Indexing Pipeline

31

slide-32
SLIDE 32

Search Pipeline

32

Indexing Search

slide-33
SLIDE 33

Evaluation: Off-line Indexing

  • Hardware: 51 AWS c3.8xl nodes

– 800 real cores + 800 virtual cores – 2.8 TB of RAM and 30 TB of SSD space

  • Indexing time as collection grows

33

Features (billions) Indexing time (seconds) Scaling (relative) 8.5 3,287 – 17.2 5,030 1.53 26.0 11,943 3.63 34.5 14,192 4.31 42.9 19,749 6.00

slide-34
SLIDE 34

Evaluation: “On-line” Search

34

  • Throughput with batching

Hadoop limit

slide-35
SLIDE 35

Summary

35

R1 Scalability R2 Computational Flexibility R3 Capacity R4 Updates R5 Flexible Pipelines R6 Simplicity

Spark

Yes Yes Yes Partial

full re- shuffle

Yes Yes

Hadoop

Partial

RAM per core

Partial Partial No No No

demonstration web-site under development

slide-36
SLIDE 36

Largest Collections in the Literature...

1. Guðmundsson, Amsaleg, Jónsson, Franklin (2017)

– 42.9 billion SIFTs – 51 servers

2. Moise, Shestakov, Guðmundsson, Amsaleg (2013)

– 30.2 billion SIFTs – 108 servers

3. Lejsek, Jónsson, Amsaleg (2017?)

– 28.5 billion SIFTs – 1 server

4. Lejsek, Jónsson, Amsaleg (2011)

– 2.5 billion SIFTs – 1 server

5. Jégou et al. (2011)

– 2 billion ... – 1 server

6. Sun et al (2011)

– 1.5 billion ... – 10 servers

slide-37
SLIDE 37

Service Layer

Role of Spark

37 Batch Layer Storage Layer Speed Layer

slide-38
SLIDE 38

Outline

38

  • Motivation:

Scalable multimedia analytics

  • Batch Layer:

Spark and 43 billion high-dim features

  • Service Layer:

Blackthorn and 100 million images

  • Conclusion:

Importance and challenges of scale!

slide-39
SLIDE 39

Service Layer

Framework: Lambda Architecture

39 Batch Layer Storage Layer Speed Layer

slide-40
SLIDE 40

40

Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Blackthorn: Large-Scale Interactive Multimodal Learning Accepted to IEEE Transactions on Multimedia Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Interactive Multimodal Learning on 100 Million Images Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR) New York, NY, USA, June, 2016

slide-41
SLIDE 41

[Zahálka and Worring, 2014; Keim et al., 2010]

Multimedia Analytics Process

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

Blackthorn Motivation

  • Do not impose a dictionary on the user
  • Let the user synthesize categories
  • f relevance from semantic annotations
  • n the fly
  • Let the user search and explore

along those categories interactively

  • Interactive semi-supervised learning

43

at scale!

slide-44
SLIDE 44

Honza’s Scalability Illustration

  • “Yesterday”:

10-100K images

  • YFCC:

100M images

44

Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html

slide-45
SLIDE 45

Scaling > Data Volume

45

  • Single (high-end) workstation

– But … 1000D features à 800GB

  • Interactive response time!

– But … computing 100M feature scores takes minutes!

slide-46
SLIDE 46

Blackthorn Overview

46

slide-47
SLIDE 47

Blackthorn Data

  • Visual:

1000D ImageNet semantic features

  • Text:

100D LDA topics

  • Sparse data (mostly ~0)

47

slide-48
SLIDE 48

Blackthorn Compression

48

slide-49
SLIDE 49

Blackthorn Overview

49

slide-50
SLIDE 50

Experimental Task

50

  • Find images relating to a particular city
  • Compare to standard linear SVM
slide-51
SLIDE 51

Blackthorn Results: 1.2M Collection

51

  • Compression: 880GB à 5GB
  • Precision: 89-108% of SVM
  • Scoring time:

60-80x faster

  • Recall over time:

Blackthorn rocks!

slide-52
SLIDE 52

Blackthorn Results: YFCC100M Collection

  • Scoring time: ~1 second!

52

slide-53
SLIDE 53

Blackthorn Future Work

  • More (user) evaluation is needed
  • Other applications may (will) require

adaptations

  • Further scalability:

Combine eCP and Blackthorn

53

slide-54
SLIDE 54

Outline

  • Motivation:

Scalable multimedia analytics

  • Batch Layer:

Spark and 43 billion high-dim features

  • Service Layer:

Blackthorn and 100 million images

  • Conclusion:

Importance and challenges of scale!

54

slide-55
SLIDE 55

Why Scale?

55

  • Current and future applications
  • Future of computing
  • Because we cannot yet!

“We choose to … in this decade and do the other things, not because they are easy, but because they are hard, …” We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too. JFK, September 12, 1962

slide-56
SLIDE 56

Scalability Hurdles

56

Industry-Level Collections

– Data – Processing capacity

Small-Minded Reviewers

– “Are there users willing to explore 100M data sets interactively?” – Too much engineering – not enough “science”

Interactive Applications

– Application knowledge – User study “victims”

slide-57
SLIDE 57

Summary

  • Motivation:

Scalable multimedia analytics

  • Batch Layer:

Spark and 43 billion high-dim features

  • Serving Layer:

Blackthorn and 100 million images

  • Conclusion:

Importance and challenges of scale!

57