lear learning m ning multi ulti moda modal l
play

Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu - PowerPoint PPT Presentation

March 2020 Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic Semantics by Playing I Spy Tong Gao Introduction Early work on grounded language learning enabled a machine to map from


  1. March 2020 Lear Learning M ning Multi ulti-Moda Modal l Grounded Lingu Grounded Linguistic istic Semantics by Playing “I Spy” Tong Gao

  2. Introduction Early work on grounded language learning enabled a • machine to map from adjectives and nouns to objects in a scene But other sensory modalities such as haptic and auditory are • also useful This paper proposes the first robotic system to perform • natural language grounding using multi-modal sensors perception

  3. Robot Configuration Kinova MICO arm • mounted on top of a custom-built mobile base Perception: • – Sensors in each motors – a microphone – Xtion Asus Pro RGBD camera

  4. Robot Actions Grasp • Lift • Hold • Lower • Drop • Push • Press •

  5. Objects in the Dataset 32 common household items • – Cups – Bottles – Cans – … Some contains liquids/other contents • Others are empty •

  6. Sensory Data Gathering • Given a context 𝑑 ∈ 𝐷 , object 𝑗 ∈ 𝑃 , • Performed full sequence of exploratory actions on object 𝑗 for 𝐷 five different times, let the set 𝑌 𝑗 contain all five features vectors.

  7. Visual Context Robot performs look • action which produces: – RGB color histogram, 8 bins per channel – Fast point feature histogram ( fpfh ) shape features – Deep visual features from 16-layer VGG network

  8. Multimodal Context Record haptic and auditory sensory modalities during executing actions; Record proprioceptive information during the grasp action. - Haptics & proprioception : joint efforts & joint positions, recorded for 6 joints at 15 Hz - Audio : Discrete Fourier Transform, 65 frequency bins

  9. “I Spy” Task Human and robot take turns describing objects from among • 4 on a tabletop Participants describe objects using attributes - robot guess • – E.g. “black rectangle” as opposed to “whiteboard eraser” Robot describe a random object with up to 3 predicates - • participants guess

  10. “I Spy” Task Metrics: robot guess and human guess • Compare performance of two playing systems: multi-modal • and vision-only

  11. Predicate Classifier For each language predicate 𝑞 , a classifier 𝑯 𝒒 ∈ [−1,1] is learned to # contexts: 18 decide whether objects possessed 𝑑 1 𝜆 𝑑 1 × 𝑁 𝑑 1 𝑌 𝑗 the attribute 𝑞 : 𝑑 2 𝜆 𝑑 2 × 𝑁 𝑑 2 𝑌 𝑗 𝐻 𝑞 𝑑 ) ∈ [−1,1] , a quadratic-kernel 𝑁 𝑑 (𝑌 𝑗 … SVM 𝜆 𝑑 ∈ [0,1] , Cohen’s Kappa, measuring 𝑑 18 the performance of 𝑁 𝑑 on the ground 𝜆 𝑑 18 × 𝑁 𝑑 18 𝑌 𝑗 truth labels

  12. For example… • With wide & yellow cylinder, we want to determine whether “fat” is applicable to it • 𝐻 𝑔𝑏𝑢 𝑥𝑗𝑒𝑓 − 𝑧𝑓𝑚𝑚𝑝𝑥 − 𝑑𝑧𝑚𝑗𝑜𝑒𝑓𝑠 = 0.137 – Correlated since the sign is positive – With confidence 0.137 = 0.137 • 𝜆 𝑕𝑠𝑏𝑡𝑞,𝑏𝑣𝑒𝑗𝑢𝑝𝑠𝑧 = 0.515 – In 𝐻 𝑔𝑏𝑢 , we are confident on the decision made by classifier 𝑁 𝑕𝑠𝑏𝑡𝑞,𝑏𝑣𝑒𝑗𝑢𝑝𝑠𝑧

  13. Grounded Language Learning – Human Turn Participant pick one of the four objects and describe it in one • phrase Robot • – Strip out stopwords, remaining words are treated as a set 𝐼 𝑞 of language predicates – Compute the sum of scores 𝐻 𝑞 for 𝑞 ∈ 𝐼 𝑞 for each object – Guess objects in descending order by score • Ties are broken randomly – Add positive training example for all 𝑞 ∈ 𝐼 𝑞 after a correct guess

  14. Grounded Language Learning – Robot Turn Robot attempted to describe the object with predicates (not • ambiguously) Denote 𝑃 𝑈 the set of objects on the table during a given • game. For the chosen object 𝑗 ∗ , compute the score 𝑆 for a predicate 𝑞 as: Choose up to 3 highest scoring predicates ෠ 𝑄 •

  15. Grounded Language Learning – Follow up In addition to ෠ 𝑄 , robot also selects 5 − | ෠ 𝑄| additional • predicates, whose confidences are likely to be close to zero Asks participants whether or not 𝑞 can be applied to the • object 𝑗 ∗ – Collect more positive/negative examples.

  16. Experiments 42 human participants • Undergraduate & graduate students + some staff at our • university Divide 32-object dataset into 4 folds • ≥ 10 human participants played “I Spy” on both systems for • each fold. Each participant played 4 games •

  17. Experiments For fold 0, the systems were undifferentiated and so only • one set of 2 games was played by each participant For subsequent folds, the systems were incrementally • trained using labels from previous folds only, such that the systems were always being tested against novel, unseen objects

  18. Quantitative Results – Robot Guess

  19. Quantitative Results – Human Guess Human guesses hovered around 2.5 throughout all levels of • training and sets of objects. Reflection : Confidence 𝜆 can easily reaches 1 with only few • examples – can system perform better if 𝑆 favored predicates trained with many examples?

  20. Quantitative Results – Predicate Agreement Train the predicate classifiers using leave-one-out cross • validation over objects

  21. Qualitative Results – when multi-model helps?

  22. Qualitative Results – Correlations to physical properties • Compute Pearson’s correlation 𝑠 between the decision and the object’s weight, height and width • Vision only – no predicates correlated against these physical object features • Multi-model: – Tall - height (0.521) – Small - width (-0.665) – Water - weight (0.814) – Blue - weight (0.549, spurious)

  23. Critique • Do pre-defined actions really produce informative audios? • To what extend the robot should apply the force, so that sound can be produced while the object is not destroyed? • While some features might be too complicated for SVM, some other might be redundant - Ablation study • Lack of generalization ability to new predicates? – Classify new predicates by their distance to learned predicates in Wordnet or word embeddings? • Qualitative results only cared about the weight, height and width – If that’s all what they want, why not measure these properties with ruler & weight scale? – Should select some physical properties that can only be obtained by multi-modal system.

  24. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend