Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - - PowerPoint PPT Presentation
Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group - - PowerPoint PPT Presentation
Towards Scalable Multimedia Analytics Bjrn r Jnsson data sys group Computer Science Department IT University of Copenhagen Todays Media Collections Massive and growing Europeana > 50 million items DeviantArt >
Today’s Media Collections
- Massive and growing
– Europeana > 50 million items – DeviantArt > 250 million items (160K/day) – Facebook > 1,000 billion items (200M/day)
- Variety of users and applications
– Novices à enthusiasts à scholars à experts – Current systems aimed at helping experts
- Need for understanding and insights
2
Media Tasks
3
Search Exploration
Media Tasks
4 [Zahálka and Worring, 2014]
Multimedia Analytics
Multimedia Analytics
Multimedia Analysis Visual Analytics
[Zahálka and Worring, 2014] 5
[Zahálka and Worring, 2014; Keim et al., 2010]
From Data to Insight
6
The Two Gaps
7 Generic data and annotation based on objective understanding Predefined, fixed annotation based
- n understanding of
the collection Specific context and task-driven subjective understanding Dynamically evolving and interaction-driven understanding of collections
Semantic Gap
[Smeulders et al., 2000]
Pragmatic Gap
[Zahálka and Worring, 2014]
[Zahálka and Worring, 2014]
Multimedia Analytics State of the Art
- Theory is developing
- Early systems have appeared
- No real-life applications (?)
- Small collections only
8
Scalable Multimedia Analytics
Scalable Multimedia Analytics
Multimedia Analysis Visual Analytics Data Management
[Jónsson et al., 2016] 9
The Three Gaps
10 Generic data and annotation based on objective understanding Predefined, fixed annotation based
- n understanding of
the collection Pre-computed indices and bulk processing of large datasets Specific context and task-driven subjective understanding Dynamically evolving and interaction-driven understanding of collections Serendipitous and highly interactive sessions on small data subsets
Semantic Gap
[Smeulders et al., 2000]
Pragmatic Gap
[Zahálka and Worring, 2014]
Scale Gap
[Jónsson et al., 2016]
[Jónsson et al., MMM 2016]
Ten Research Questions for Scalable Multimedia Analytics
VELOCITY VOLUME VARIETY VISUAL INTERACTION
12
Building Systems?
13
Service Layer
Big Data Framework: Lambda Architecture
14 Batch Layer Storage Layer Speed Layer [Marz and Warren, 2015]
Service Layer
Big Data Framework: Lambda Architecture
15 Batch Layer Storage Layer Speed Layer
Outline
- Motivation:
Scalable multimedia analytics
- Batch Layer:
Spark and 43 billion high-dim features
- Service Layer:
Blackthorn and 100 million images
- Conclusion:
Importance and challenges of scale!
16
Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson, Michael J. Franklin Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark Proceedings of the ACM Multimedia Systems Conference (MMSys) Taipei, Taiwan, June, 2017
17
Spark Case Study: Motivation
- How can multimedia tasks harness the
power of cloud-computing?
– Multimedia collections are growing – Computing power is abundant
- ADCFs = Hadoop || Spark
– Automatically Distributed Computing Frameworks – Designed for high-throughput processing
18
Design Choices: ADCF = Spark
- Hadoop is not suitable (more later)
- Resilient Distributed Datasets (RDDs)
– Transform one RDD to another via operators – Lazy execution – Master and Workers paradigm
- Supports deep pipelines
- Supports worker’s memory sharing
- Lazy execution allows for optimizations
19
Design Choices: Application Domain
- Content-Based Image Retrieval (CBIR)
– Well known application – Two phases: Off-line & “On-line”
Search results Query Image CBIR System
20
Properties:
- Clustering-based
- Deep hierarchical index
- Approximate k-NN search
- Trades response time for
throughput by batching Why?
- Very simple
- Prototypical of many
CBIR algorithms
- Previous Hadoop
implementation facilitates comparison
Design Choices: DeCP Algorithm
21
DeCP as a CBIR System
22
- Off-line
– Build the index hierarchy – Cluster the data collection
- On-line
– Approximate k-NN search – Vote aggregation
Index is in RAM Clusters reside on disk
Searching a single feature
Identify Retrieve Scan
k-NN
Clustered collection
Multi-Core Parallelism: Scaling-Out
l Relative indexing time on 2-24 virtual
cores
2 4 6 8 10 12 14 16 18 20 22 24 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.00 0.10 0.20 0.30 0.40 0.50 0.60
Relative Scalabilty
Measured wall clock time Optimal trend Corrected trend
Number of cores Time relative to using 2 cores
Real cores HT cores
Batch Processing: Throughput
100K 50K 20K 10K 5K 3K 2K 1K 500 200 100 10 1,000 2,000 3,000 4,000 5,000 6,000
Throughput
8.1B desc. 1.6M clusters, L=4 Index Batches of images from 10 to 100,000
Scan wall time per image Scan CPU time per image
Batch size in images Milliseconds
81ms
Design Choices: Feature Collection
- YLI feature corpus from YFCC100M
– Various feature sets (visual, semantic, …) – 99.2M images and 0.8M videos – Largest dataset publicly available
- Use all 42.9 billion SIFT features!
– Goal is to test at a very large scale – No feature aggregation or compression – Largest feature collection reported!
25
Research Questions
- What is the complexity of the Spark
pipeline for typical multimedia-related tasks?
- How well does background processing
scale as collection size and resources grow?
- How does batch size impact throughput
- f an offline service?
26
Requirements for the ADCF
R1 Scalability Ability to scale out with additional computing power R2 Computational flexibility Ability to carefully balance system resources as needed R3 Capacity Ability to gracefully handle data that vastly exceeds main memory capacity R4 Updates Ability to gracefully update the data structures for dynamic workloads R5 Flexible pipeline Ability to easily implement variations of the indexing and/or retrieval process R6 Simplicity How efficiently the programmer’s time is spent
27
DeCP on Hadoop
28
- Prior work evaluated DeCP on Hadoop
using 30 billion SIFTs on 100+ machines
- Conclusion = limited success
– Scalability limited due to RAM per core – Two-step Map-Reduce pipeline is too rigid
- Ex: Single data-source only
- Ex: Could not search multiple clusters
– R1, R2, R3 = partially; R4 = no; R5, R6 = no
DeCP on Spark
- A very different ADCF from Hadoop
- Several advantages
– Arbitrarily deep pipelines
- Easily implement all features and functionality
– Broadcast variables
- Solves the RAM per core limitation
– Multiple data sources
- Ex: Allows join operations for maintenance (R4)
29
Spark Pipeline Symbols
30
- .map = one-to-one transformation
- .flatmap = one-to-any transformation
- .groupByKey = Hadoop’s Shuffle
- .reduceByKey = Hadoop’s Reduce
- .collectAsMap = collect to Master
.map .flatmap .groupByKey .reduceByKey .collectAsMap
Indexing Pipeline
31
Search Pipeline
32
Indexing Search
Evaluation: Off-line Indexing
- Hardware: 51 AWS c3.8xl nodes
– 800 real cores + 800 virtual cores – 2.8 TB of RAM and 30 TB of SSD space
- Indexing time as collection grows
33
Features (billions) Indexing time (seconds) Scaling (relative) 8.5 3,287 – 17.2 5,030 1.53 26.0 11,943 3.63 34.5 14,192 4.31 42.9 19,749 6.00
Evaluation: “On-line” Search
34
- Throughput with batching
Hadoop limit
Summary
35
R1 Scalability R2 Computational Flexibility R3 Capacity R4 Updates R5 Flexible Pipelines R6 Simplicity
Spark
Yes Yes Yes Partial
full re- shuffle
Yes Yes
Hadoop
Partial
RAM per core
Partial Partial No No No
demonstration web-site under development
Largest Collections in the Literature...
1. Guðmundsson, Amsaleg, Jónsson, Franklin (2017)
– 42.9 billion SIFTs – 51 servers
2. Moise, Shestakov, Guðmundsson, Amsaleg (2013)
– 30.2 billion SIFTs – 108 servers
3. Lejsek, Jónsson, Amsaleg (2017?)
– 28.5 billion SIFTs – 1 server
4. Lejsek, Jónsson, Amsaleg (2011)
– 2.5 billion SIFTs – 1 server
5. Jégou et al. (2011)
– 2 billion ... – 1 server
6. Sun et al (2011)
– 1.5 billion ... – 10 servers
Service Layer
Role of Spark
37 Batch Layer Storage Layer Speed Layer
Outline
38
- Motivation:
Scalable multimedia analytics
- Batch Layer:
Spark and 43 billion high-dim features
- Service Layer:
Blackthorn and 100 million images
- Conclusion:
Importance and challenges of scale!
Service Layer
Framework: Lambda Architecture
39 Batch Layer Storage Layer Speed Layer
40
Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Blackthorn: Large-Scale Interactive Multimodal Learning Accepted to IEEE Transactions on Multimedia Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring Interactive Multimodal Learning on 100 Million Images Proceedings of the ACM International Conference on Multimedia Retrieval (ICMR) New York, NY, USA, June, 2016
[Zahálka and Worring, 2014; Keim et al., 2010]
Multimedia Analytics Process
41
42
Blackthorn Motivation
- Do not impose a dictionary on the user
- Let the user synthesize categories
- f relevance from semantic annotations
- n the fly
- Let the user search and explore
along those categories interactively
- Interactive semi-supervised learning
43
at scale!
Honza’s Scalability Illustration
- “Yesterday”:
10-100K images
- YFCC:
100M images
44
Image credit: http://demonocracy.info/infographics/usa/us_debt/us_debt.html
Scaling > Data Volume
45
- Single (high-end) workstation
– But … 1000D features à 800GB
- Interactive response time!
– But … computing 100M feature scores takes minutes!
Blackthorn Overview
46
Blackthorn Data
- Visual:
1000D ImageNet semantic features
- Text:
100D LDA topics
- Sparse data (mostly ~0)
47
Blackthorn Compression
48
Blackthorn Overview
49
Experimental Task
50
- Find images relating to a particular city
- Compare to standard linear SVM
Blackthorn Results: 1.2M Collection
51
- Compression: 880GB à 5GB
- Precision: 89-108% of SVM
- Scoring time:
60-80x faster
- Recall over time:
Blackthorn rocks!
Blackthorn Results: YFCC100M Collection
- Scoring time: ~1 second!
52
Blackthorn Future Work
- More (user) evaluation is needed
- Other applications may (will) require
adaptations
- Further scalability:
Combine eCP and Blackthorn
53
Outline
- Motivation:
Scalable multimedia analytics
- Batch Layer:
Spark and 43 billion high-dim features
- Service Layer:
Blackthorn and 100 million images
- Conclusion:
Importance and challenges of scale!
54
Why Scale?
55
- Current and future applications
- Future of computing
- Because we cannot yet!
“We choose to … in this decade and do the other things, not because they are easy, but because they are hard, …” We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too. JFK, September 12, 1962
Scalability Hurdles
56
Industry-Level Collections
– Data – Processing capacity
Small-Minded Reviewers
– “Are there users willing to explore 100M data sets interactively?” – Too much engineering – not enough “science”
Interactive Applications
– Application knowledge – User study “victims”
Summary
- Motivation:
Scalable multimedia analytics
- Batch Layer:
Spark and 43 billion high-dim features
- Serving Layer:
Blackthorn and 100 million images
- Conclusion:
Importance and challenges of scale!
57