Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu - - PowerPoint PPT Presentation

lear learning m ning multi ulti moda modal l
SMART_READER_LITE
LIVE PREVIEW

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu - - PowerPoint PPT Presentation

March 2020 Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic Semantics by Playing I Spy Tong Gao Introduction Early work on grounded language learning enabled a machine to map from


slide-1
SLIDE 1

Tong Gao

March 2020

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic Semantics by Playing “I Spy”

slide-2
SLIDE 2

Introduction

  • Early work on grounded language learning enabled a

machine to map from adjectives and nouns to objects in a scene

  • But other sensory modalities such as haptic and auditory are

also useful

  • This paper proposes the first robotic system to perform

natural language grounding using multi-modal sensors perception

slide-3
SLIDE 3

Robot Configuration

  • Kinova MICO arm

mounted on top of a custom-built mobile base

  • Perception:

Sensors in each motors

a microphone

Xtion Asus Pro RGBD camera

slide-4
SLIDE 4

Robot Actions

  • Grasp
  • Lift
  • Hold
  • Lower
  • Drop
  • Push
  • Press
slide-5
SLIDE 5

Objects in the Dataset

  • 32 common household items

Cups

Bottles

Cans

  • Some contains liquids/other contents
  • Others are empty
slide-6
SLIDE 6

Sensory Data Gathering

  • Given a context 𝑑 ∈ 𝐷, object 𝑗 ∈

𝑃,

  • Performed full sequence of

exploratory actions on object 𝑗 for five different times, let the set 𝑌𝑗

𝐷

contain all five features vectors.

slide-7
SLIDE 7

Visual Context

  • Robot performs look

action which produces:

RGB color histogram, 8 bins per channel

Fast point feature histogram (fpfh) shape features

Deep visual features from 16-layer VGG network

slide-8
SLIDE 8

Multimodal Context

Record haptic and auditory sensory modalities during executing actions; Record proprioceptive information during the grasp action.

  • Haptics & proprioception: joint

efforts & joint positions, recorded for 6 joints at 15 Hz

  • Audio: Discrete Fourier Transform, 65

frequency bins

slide-9
SLIDE 9

“I Spy” Task

  • Human and robot take turns describing objects from among

4 on a tabletop

  • Participants describe objects using attributes - robot guess

E.g. “black rectangle” as opposed to “whiteboard eraser”

  • Robot describe a random object with up to 3 predicates -

participants guess

slide-10
SLIDE 10

“I Spy” Task

  • Metrics: robot guess and human guess
  • Compare performance of two playing systems: multi-modal

and vision-only

slide-11
SLIDE 11

Predicate Classifier

𝐻𝑞 𝜆𝑑2 × 𝑁𝑑2 𝑌𝑗

𝑑2

𝜆𝑑1 × 𝑁𝑑1 𝑌𝑗

𝑑1

𝜆𝑑18 × 𝑁𝑑18 𝑌𝑗

𝑑18

… 𝜆𝑑 ∈ [0,1], Cohen’s Kappa, measuring the performance of 𝑁𝑑 on the ground truth labels # contexts: 18 𝑁𝑑(𝑌𝑗

𝑑) ∈ [−1,1], a quadratic-kernel

SVM For each language predicate 𝑞, a classifier 𝑯𝒒 ∈ [−1,1] is learned to decide whether objects possessed the attribute 𝑞:

slide-12
SLIDE 12

For example…

  • With wide & yellow cylinder, we want to determine whether “fat” is

applicable to it

  • 𝐻𝑔𝑏𝑢 𝑥𝑗𝑒𝑓 − 𝑧𝑓𝑚𝑚𝑝𝑥 − 𝑑𝑧𝑚𝑗𝑜𝑒𝑓𝑠 = 0.137

– Correlated since the sign is positive – With confidence 0.137 = 0.137

  • 𝜆𝑕𝑠𝑏𝑡𝑞,𝑏𝑣𝑒𝑗𝑢𝑝𝑠𝑧 = 0.515

– In 𝐻𝑔𝑏𝑢, we are confident on the decision made by classifier 𝑁𝑕𝑠𝑏𝑡𝑞,𝑏𝑣𝑒𝑗𝑢𝑝𝑠𝑧

slide-13
SLIDE 13

Grounded Language Learning – Human Turn

  • Participant pick one of the four objects and describe it in one

phrase

  • Robot

Strip out stopwords, remaining words are treated as a set 𝐼𝑞 of language predicates

Compute the sum of scores 𝐻𝑞 for 𝑞 ∈ 𝐼𝑞 for each object

Guess objects in descending order by score

  • Ties are broken randomly

Add positive training example for all 𝑞 ∈ 𝐼𝑞 after a correct guess

slide-14
SLIDE 14

Grounded Language Learning – Robot Turn

  • Robot attempted to describe the object with predicates (not

ambiguously)

  • Denote 𝑃𝑈 the set of objects on the table during a given
  • game. For the chosen object 𝑗∗, compute the score 𝑆 for a

predicate 𝑞 as:

  • Choose up to 3 highest scoring predicates ෠

𝑄

slide-15
SLIDE 15

Grounded Language Learning – Follow up

  • In addition to ෠

𝑄, robot also selects 5 − | ෠ 𝑄| additional predicates, whose confidences are likely to be close to zero

  • Asks participants whether or not 𝑞 can be applied to the
  • bject 𝑗∗

Collect more positive/negative examples.

slide-16
SLIDE 16

Experiments

  • 42 human participants
  • Undergraduate & graduate students + some staff at our

university

  • Divide 32-object dataset into 4 folds
  • ≥ 10 human participants played “I Spy” on both systems for

each fold.

  • Each participant played 4 games
slide-17
SLIDE 17

Experiments

  • For fold 0, the systems were undifferentiated and so only
  • ne set of 2 games was played by each participant
  • For subsequent folds, the systems were incrementally

trained using labels from previous folds only, such that the systems were always being tested against novel, unseen

  • bjects
slide-18
SLIDE 18

Quantitative Results – Robot Guess

slide-19
SLIDE 19

Quantitative Results – Human Guess

  • Human guesses hovered around 2.5 throughout all levels of

training and sets of objects.

  • Reflection: Confidence 𝜆 can easily reaches 1 with only few

examples – can system perform better if 𝑆 favored predicates trained with many examples?

slide-20
SLIDE 20

Quantitative Results – Predicate Agreement

  • Train the predicate classifiers using leave-one-out cross

validation over objects

slide-21
SLIDE 21

Qualitative Results – when multi-model helps?

slide-22
SLIDE 22

Qualitative Results – Correlations to physical properties

  • Compute Pearson’s correlation 𝑠 between the decision and the object’s

weight, height and width

  • Vision only – no predicates correlated against these physical object

features

  • Multi-model:

– Tall - height (0.521) – Small - width (-0.665) – Water - weight (0.814) – Blue - weight (0.549, spurious)

slide-23
SLIDE 23

Critique

  • Do pre-defined actions really produce informative audios?
  • To what extend the robot should apply the force, so that sound can be produced

while the object is not destroyed?

  • While some features might be too complicated for SVM, some other might be

redundant - Ablation study

  • Lack of generalization ability to new predicates?

Classify new predicates by their distance to learned predicates in Wordnet or word embeddings?

  • Qualitative results only cared about the weight, height and width

If that’s all what they want, why not measure these properties with ruler & weight scale?

Should select some physical properties that can only be obtained by multi-modal system.

slide-24
SLIDE 24

Thank you!