VCI 2 R at the NTCIR-13 Lifelog-2 LSAT Task Presented by: Qianli Xu - - PowerPoint PPT Presentation

vci 2 r at the ntcir 13 lifelog 2 lsat task
SMART_READER_LITE
LIVE PREVIEW

VCI 2 R at the NTCIR-13 Lifelog-2 LSAT Task Presented by: Qianli Xu - - PowerPoint PPT Presentation

VCI 2 R at the NTCIR-13 Lifelog-2 LSAT Task Presented by: Qianli Xu Co-authors: Jie Lin, Ana del Molino, Qianli Xu, Fen Fang, V. Subbaraju, Joo-Hwee Lim, Liyuan Li, V. Chandrasekhar Organization: Institute for Infocomm Research, A*STAR,


slide-1
SLIDE 1

VCI2R at the NTCIR-13 Lifelog-2 LSAT Task

Presented by: Qianli Xu Co-authors: Jie Lin, Ana del Molino, Qianli Xu, Fen Fang, V. Subbaraju, Joo-Hwee Lim, Liyuan Li, V. Chandrasekhar Organization: Institute for Infocomm Research, A*STAR, Singapore

slide-2
SLIDE 2

About VCI2R

  • Institute for Infocomm Research

(I2R), A*STAR, Singapore

– Visual Computing – Human Language Tech – Data Analytics – Neural Biomedical Tech – etc.

  • Visual Computing Department

– Video/image analytics & search – Augmented visual intelligence – Visual inspection Website: www.a-star.edu.sg/i2r/

slide-3
SLIDE 3

Query Topic Object Classifier Places Classifier Object Detector

NTCIR-13 Classifier

Time tag Loc tag # People Lifelog Images

Training Images

CNN Faster RCNN User-given

Online Offline w1 w2 w3 w4 w5 w6 w7 Feature weight Relevant concepts

Temporal Smoothing … …

LSAT Framework

Image + Metadata Query Topics Semantic Gap

  • Relevant concepts: What are the

CNN predications relevant to query topics?

  • Feature weighting: Which features

contribute the most?

  • Temporal smoothing: Temporal

coherence, remove outliers

  • Post filtering: refine search using

location (GPS) and Time

“Castle @ Night” “Working in a coffee shop” “Gardening in my home”

del Molino, et al., 2017, VC-I2R at ImageCLEF2017: Ensemble of deep learned features for lifelog video summarization. CLEF Working Notes, CEUR.

slide-4
SLIDE 4
  • 1. Getting the Basic Semantics
  • CNN classifiers

– Object: ResNet152 – ImageNet1K – Place: ResNet152 – Place365

  • CNN detector

– Faster R-CNN – MSCOCO (80)

  • NTCIR-13 classifier

– VGG-16 – ImageNet1K – Replace the last layer (1K neurons) with 634 neurons – Sigmoid as the activation function

  • Human detection and counting

– Sighthound (https://www.sighthound.com)

slide-5
SLIDE 5
  • 2. Aggregating & Weighing Features

Objects Places MSCOCO Task Relevant Avoid Relevant Avoid Relevant 1 computer group meeting

  • computer

group meeting etc.

  • laptop

keyboard 2 television food glass computer group meeting living room television room etc. conference room lecture room etc. tv remote etc. 3 computer group meeting

  • ffice

coffee shop living room etc. conference room

  • ffice

etc. laptop keyboard 4 computer pencil notebook

  • ffice

living room hotel room etc. conference room

  • ffice

etc. laptop book etc. 5 food glass drum white goods menu’ food court restaurant etc.

  • fork

sandwich etc.

CRF for Feature weighing that accommodates individual differences Relevance mapping for each topic

Eθ(s) = λ X

i

φu(si) | {z } unary + X

ij

φp(si, sj) | {z } pairwise , the unary potentials enforce the selection of static

ImageNet1K Places365 MSCOCO NTCIR Time # People Location tag Training Images w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 Feature weight Relevant concepts

slide-6
SLIDE 6
  • 3. Temporal Smoothing
  • Adjacent lifelog images may

share similar event.

  • Temporal smoothing is used

to ensure the semantic coherence.

  • A triangular window of size

w is used. w is adaptive to event topics.

  • 4. Post-filtering
  • Increase diversity of retrieved

images (avoid retrieving images of the same event)

  • Use time and location (GPS) to

filter images

  • Exclude images that are closer

in time and location.

slide-7
SLIDE 7

Result

  • Official score (precision): 57.6%

0.2 0.4 0.6 0.8 1 Eat Lunch Gardening Castle at Night Coffee Sunset Graveyard Lecturing Shopping Working Late On Computer Cooking Flying Juice Photo of Sea Beers in Bar Greek Amphit TV Recording Work w Coffee Painting Walls Eating Pasta Exercises Mountain Hiking Turtles User 1 User 2

mAP

slide-8
SLIDE 8

Analysis (Fine-tuning)

0.502 0.528 0.654 0.748 0.761 0.826 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Fixed Adaptive (User) Adaptive (User + Event) mAP User 2 User 1

0.4 0.5 0.6 0.7 0.8 0.9 mAP All − NTCIR−13 − ImageNet1K − Places365 − MSCOCO − Location − Time − #People User 1 User 2

User 2 User 1

Feature importance Decrease in performance when we remove one type of

  • feature. The bigger the

decrease, the more important the feature. Effect of temporal smoothing

Whether temporal smoothing is performed or not

Effect of threshold for relevant concept searching

Semantic concepts which activation level is above the threshold is considered relevant to the query topic

0.528 0.543 0.761 0.789 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 No smoothing Temporal smoothing mAP

slide-9
SLIDE 9

LIT

Summary

Effective Lifelog Image Retrieval

High Quality Data Good Semantic Features Reasonable Ground Truth Intelligence in Interpretation

  • f Query

Topics Intelligence in Model Fine- tuning

  • A lot of fine-tuning and

manual intervention are involved in the retrieval à Over-fitting?

  • “Relevant” concepts may not

be contributing, and vice verse.

  • Interactive retrieval is

probably a good intermediate solution.

Email: qxu@i2r.a-star.edu.sg