Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - PowerPoint PPT Presentation

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil

Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5. Experiments and Results 6. Critique

Motivation Explainable AI : Why should we care about it? Explainability is about trust. It’s important to know why our self-driving car decided to slam on the brakes. Explanation are required for regulatory compliance in certain industries. eg: medical diagnosis, equal credit opportunity act in US Explanation can facilitate model validation and debugging . Models learn associative (not necessarily causal patterns in training data). Explanations can reveal spurious associations. But, tradeoff of performance vs explainability

Motivation : Explainable Models Two broad ideas of : 1. Introspection explanation systems : which explain how a model determines its final output (eg : This is Western grebe because filter 2 has a high activation) 2. Justification explanation systems : which produce sentences detailing how visual evidence is compatible with a system output ( eg : This is Western Gerbe because it has red eyes..) Here, we look as justification explanation systems because they are more suited for non-experts . We apply the principles to the classification by visual systems Here, Applying the idea of explainability to classification by visual systems.

The Problem and Importance Description : sentence based only on visual information (image captioning systems) Visual Explanation : sentence that details why a certain category is appropriate for a given image while only mentioning the image relevant features.

The Approach Condition language generation on image and predicted class label. Other captioning models: condition only on visual features. For this use fine grained recognition pipeline + novel loss function to include class discriminative information. Challenge : Class specificity is a global sentence property i.e the words black or red eye are less class discriminative on their own but the entire sentence: This is an all black bird with a bright red eye is class specific to Bronzed Cowbird. Typical loss functions optimize on sentence alignment b/w generated and the ground truth.

Note on LRCN

Model Inputs : [image, category label, ground truth sentence]

Proposed Loss Proposed loss Relevance Loss Discriminative Loss - Relevance loss (LR ) is related with "Image Relevance" - Discriminative loss ( E[R D ( w̃ )] ) is related with "Class Relevance"

Relevance Loss : N = the batch size | Wt = ground truth word |I = image | C = category - Produces sentences that correspond to the image content - Does not explicitly encourage generated sentences which are both image relevant and category specific . - Class Labels : Average hidden state of another separate LSTM to generate word sequences conditioned on images. [Average across all sequences for all classes in the train set]

Discriminative Loss : p( w | I,C ) = model’s estimate conditional distribution R D ( w̃ ) = reward for the sampled description E[R D ( w̃ )] = Estimation of the reward Agent = LSTM w̃ = sampled description from LSTM (p( w | I,C )) Env = previous generated words - Based on a reinforcement learning paradigm . - R D ( w̃ ) = p D (C| w̃ ) Action = predict next - p D (C| w̃ ) : pretrained sentence classifier word based on policy - The accuracy of this classifier(pretrained) is and the env. not important (22%) : sampled sentences from Policy = defined by LSTM(pL(w)) weights W

Minimizing the loss - Since expectation over descriptions( E[R D ( w̃ )]) is intractable, use Monte Carlo sampling from LSTM [p( w | I,C )]. - p( w | I,C ) is a discrete distribution - To avoid differentiating R D ( w̃ ) w.r.t W use REINFORCE property Log p( w̃ )= log likelihood of the sampled - The final gradient to update weights W description L R = log likelihood of the ground truth description

Dataset Caltech UCSD Birds: 200 classes of North American Bird species |11,788 images | 5 captions/image - Every image belongs to a class and therefore sentence and image are associated with single label. - Descriptive details about each bird class. - Does not explain why an image belongs to a certain class.

Experiments Baseline and ablation model : Description model : generates sentences conditioned only on images - (equivalent to LRCN) Definition model : sentences using only image label as input. - Explanation-label : not trained with discriminative loss - Explanation-discriminative : not conditioned on predicted class. - Metrics : - Image relevance : METEOR, CIDEr - Class relevance : class similarity score, class rank.

Results Small gain in automatic evaluation metrics for Image Relevance. But huge gains in Class relevance Metrics.

Results Comparison of Explanations, Baselines, and Ablations. - Green : correct, Yellow : mostly correct, Red : incorrect - ’Red eye’ is a class relevant attribute

Results Comparison of Explanations and Definitions. - Definition can produce sentences which are not image relevant

Results Comparison of Explanations and Descriptions. - Both models generate visually correct sentences. - ’Black head’ is one of the most prominent distinguishing properties of this vireo type.

Critique – The Good ● Motivation : ○ Novel motivation of making the models more explainable to non experts ● Explanation Model : ○ Novel loss function to include global sentence property. ○ Loss function also has a wide generic applicability. ● Ablation study: ○ Performed ablation study of all the important model components, gives reasoning behind model design decisions.

Critique – The not so good ● Motivation ○ What if underlying feature in the network was not identifying red eye, it was instead identifying that there is a bird flying over water, there is no way that you would know. ● Dataset : ○ Every image belongs to a class and therefore sentence and image are associated with single label. ● Explanation Model: ○ Reduce the variance of the gradient estimation in REINFORCE by inclusion of baseline? ○ Choice of other reward functions based on class similarity and class rank? ○ Use of attention layers to combine text and image features? ● Missing details: ○ Why didn't the accuracy of the LSTM classifier matter? ● Evaluation Methodology: ○ Comparison with other SOTA image captioning models? ● Human evaluation improvements ○ Include a reason why they chose which sentence was ranked higher.

References - Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell, Generating Visual Explanations, European Conference on Computer Vision (ECCV), 2016

Additional Examples

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - PowerPoint PPT Presentation

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5.

Generating Visual Explanations Lisa et al. Seoul National University

H2 F2009 H2 F2009 GENERATING GENERATING GENERATING GENERATING FREE CASH FLOW FREE CASH FLOW

Counterfactual Visual Explanations Yash Goyal Ziyan Wu Jan Ernst Dhruv Batra Devi Parikh

Teaching Categories to Human Learners with Visual Explanations Oisin Mac Aodha Can we design

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Advanced Electric Generating Advanced Electric Generating Advanced Electric Generating

Ratchaburi Electricity Generating Holding PCL. Ratchaburi Electricity Generating Holding PCL.

Recursive Definitions Generating Functions Lecture 18 Generating Functions A generating

Explanations in Constraint Programming Barry OSullivan Cork Constraint Computation Centre

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Generating Subfields Mark van Hoeij June 15, 2017 Mark van Hoeij Generating Subfields Overview

Atikokan Generating Station Thunder Bay Generating Station March 5, 2013 Alberta Biomaterials

Recap by Milo Davies, SAS NZ POWERFUL ADAPTIVE OPEN UNIFIED SAS Visual Analytics SAS Visual

Visual Analytics Visual Analytics is the science of analytical reasoning supported by interactive

What is This Thing Called Swing? And other kinds of jazz for that matter? What is This Thing

KEAGAN SPRATT SIGNS historic significance. Keagan Spratt signed his life away towards a promising

CO COMIN MING G TO TO TE TERM RMS S with Biblical Time Markers Biblical Timekeeping Series

Introduction to Debugging Software Engineering Andreas Zeller Saarland University 1 The

Trade Studies A Hypothetical Example Based On a Real Study 1 Human Factors Research to Inform the

LAX to IND redeye (Boeing 727) photo bombs M33 A Survey of Luminous Stars in M31 and M33

CSCI 1101B Advanced Image Manipulation Techniques Mohammad T . Irfan 3/6/14 Red-eye reduction

Teraflops for the Masses: Killer Apps of Tom orrow Pradeep K. Dubey Senior Principal Engineer