 
              March 2020 Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic Semantics by Playing “I Spy” Tong Gao
Introduction Early work on grounded language learning enabled a • machine to map from adjectives and nouns to objects in a scene But other sensory modalities such as haptic and auditory are • also useful This paper proposes the first robotic system to perform • natural language grounding using multi-modal sensors perception
Robot Configuration Kinova MICO arm • mounted on top of a custom-built mobile base Perception: • – Sensors in each motors – a microphone – Xtion Asus Pro RGBD camera
Robot Actions Grasp • Lift • Hold • Lower • Drop • Push • Press •
Objects in the Dataset 32 common household items • – Cups – Bottles – Cans – … Some contains liquids/other contents • Others are empty •
Sensory Data Gathering • Given a context 𝑑 ∈ 𝐷 , object 𝑗 ∈ 𝑃 , • Performed full sequence of exploratory actions on object 𝑗 for 𝐷 five different times, let the set 𝑌 𝑗 contain all five features vectors.
Visual Context Robot performs look • action which produces: – RGB color histogram, 8 bins per channel – Fast point feature histogram ( fpfh ) shape features – Deep visual features from 16-layer VGG network
Multimodal Context Record haptic and auditory sensory modalities during executing actions; Record proprioceptive information during the grasp action. - Haptics & proprioception : joint efforts & joint positions, recorded for 6 joints at 15 Hz - Audio : Discrete Fourier Transform, 65 frequency bins
“I Spy” Task Human and robot take turns describing objects from among • 4 on a tabletop Participants describe objects using attributes - robot guess • – E.g. “black rectangle” as opposed to “whiteboard eraser” Robot describe a random object with up to 3 predicates - • participants guess
“I Spy” Task Metrics: robot guess and human guess • Compare performance of two playing systems: multi-modal • and vision-only
Predicate Classifier For each language predicate 𝑞 , a classifier 𝑯 𝒒 ∈ [−1,1] is learned to # contexts: 18 decide whether objects possessed 𝑑 1 𝜆 𝑑 1 × 𝑁 𝑑 1 𝑌 𝑗 the attribute 𝑞 : 𝑑 2 𝜆 𝑑 2 × 𝑁 𝑑 2 𝑌 𝑗 𝐻 𝑞 𝑑 ) ∈ [−1,1] , a quadratic-kernel 𝑁 𝑑 (𝑌 𝑗 … SVM 𝜆 𝑑 ∈ [0,1] , Cohen’s Kappa, measuring 𝑑 18 the performance of 𝑁 𝑑 on the ground 𝜆 𝑑 18 × 𝑁 𝑑 18 𝑌 𝑗 truth labels
For example… • With wide & yellow cylinder, we want to determine whether “fat” is applicable to it • 𝐻 𝑔𝑏𝑢 𝑥𝑗𝑒𝑓 − 𝑧𝑓𝑚𝑚𝑝𝑥 − 𝑑𝑧𝑚𝑗𝑜𝑒𝑓𝑠 = 0.137 – Correlated since the sign is positive – With confidence 0.137 = 0.137 • 𝜆 𝑠𝑏𝑡𝑞,𝑏𝑣𝑒𝑗𝑢𝑝𝑠𝑧 = 0.515 – In 𝐻 𝑔𝑏𝑢 , we are confident on the decision made by classifier 𝑁 𝑠𝑏𝑡𝑞,𝑏𝑣𝑒𝑗𝑢𝑝𝑠𝑧
Grounded Language Learning – Human Turn Participant pick one of the four objects and describe it in one • phrase Robot • – Strip out stopwords, remaining words are treated as a set 𝐼 𝑞 of language predicates – Compute the sum of scores 𝐻 𝑞 for 𝑞 ∈ 𝐼 𝑞 for each object – Guess objects in descending order by score • Ties are broken randomly – Add positive training example for all 𝑞 ∈ 𝐼 𝑞 after a correct guess
Grounded Language Learning – Robot Turn Robot attempted to describe the object with predicates (not • ambiguously) Denote 𝑃 𝑈 the set of objects on the table during a given • game. For the chosen object 𝑗 ∗ , compute the score 𝑆 for a predicate 𝑞 as: Choose up to 3 highest scoring predicates  𝑄 •
Grounded Language Learning – Follow up In addition to  𝑄 , robot also selects 5 − |  𝑄| additional • predicates, whose confidences are likely to be close to zero Asks participants whether or not 𝑞 can be applied to the • object 𝑗 ∗ – Collect more positive/negative examples.
Experiments 42 human participants • Undergraduate & graduate students + some staff at our • university Divide 32-object dataset into 4 folds • ≥ 10 human participants played “I Spy” on both systems for • each fold. Each participant played 4 games •
Experiments For fold 0, the systems were undifferentiated and so only • one set of 2 games was played by each participant For subsequent folds, the systems were incrementally • trained using labels from previous folds only, such that the systems were always being tested against novel, unseen objects
Quantitative Results – Robot Guess
Quantitative Results – Human Guess Human guesses hovered around 2.5 throughout all levels of • training and sets of objects. Reflection : Confidence 𝜆 can easily reaches 1 with only few • examples – can system perform better if 𝑆 favored predicates trained with many examples?
Quantitative Results – Predicate Agreement Train the predicate classifiers using leave-one-out cross • validation over objects
Qualitative Results – when multi-model helps?
Qualitative Results – Correlations to physical properties • Compute Pearson’s correlation 𝑠 between the decision and the object’s weight, height and width • Vision only – no predicates correlated against these physical object features • Multi-model: – Tall - height (0.521) – Small - width (-0.665) – Water - weight (0.814) – Blue - weight (0.549, spurious)
Critique • Do pre-defined actions really produce informative audios? • To what extend the robot should apply the force, so that sound can be produced while the object is not destroyed? • While some features might be too complicated for SVM, some other might be redundant - Ablation study • Lack of generalization ability to new predicates? – Classify new predicates by their distance to learned predicates in Wordnet or word embeddings? • Qualitative results only cared about the weight, height and width – If that’s all what they want, why not measure these properties with ruler & weight scale? – Should select some physical properties that can only be obtained by multi-modal system.
Thank you!
Recommend
More recommend