Part 3: Audio-Visual Child-Robot Interaction Petros Maragos - - PowerPoint PPT Presentation

part 3 audio visual child robot interaction
SMART_READER_LITE
LIVE PREVIEW

Part 3: Audio-Visual Child-Robot Interaction Petros Maragos - - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation


slide-1
SLIDE 1

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory

National Technical University of Athens, Greece (NTUA)

Robot Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Petros Maragos

1

Tutorial at INTERSPEECH 2018, Hyderabad, India, 2 Sep. 2018

slides: http://cvsp.cs.ntua.gr/interspeech2018

Part 3: Audio-Visual Child-Robot Interaction

slide-2
SLIDE 2

2

EU project BabyRobot: Experimental Setup Room

slide-3
SLIDE 3

3

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

TD experiments video

slide-4
SLIDE 4

4

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Act Think Sense

Visual Stream Audio Stream

Visual Gesture Recognition Distant Speech Recognition AV Localization & Tracking Action Recognition 3d Object Tracking Visual Emotion Recognition Speech Emotion Recognition Text Emotion Recognition Behavioral Monitoring IrisTK behavior generation child’s activity child’s behavioral state

Action Branch Behavioral Branch

Audio Related Information Visual Related Information Wizard‐of‐Oz

IrisBroker

Perception System

slide-5
SLIDE 5

Experimental Setup: Hardware & Software

slide-6
SLIDE 6

6

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Action Branch: Developed Technologies

3D Object Tracking Multiview Gesture Recognition Multiview Action Recognition Speaker Localization and Distant Speech Recognition

slide-7
SLIDE 7

7

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

 Track multiple persons using Kinect skeleton.  Select the person closest to the auditory source position.  Rcor: percentage of correct estimations (deviation from ground truth less than 0.5m)

 Audio Source Localization: 45.5%  Audio-Visual Localization: 85.6%

Audio-Visual Localization Evaluation

slide-8
SLIDE 8

8

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

 Multiple views of the child’s gesture from different sensors  Fusion of the three sensors’ decisions

Multi-view Gesture Recognition

slide-9
SLIDE 9

9

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Nod Greet Come Closer Sit Stop Point Circle

Gesture Recognition – Vocabulary

slide-10
SLIDE 10

10

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multi-view Gesture Recognition - Evaluation

 7 classes: nod, greet, come closer, sit, stop, point, circle  Average classification accuracy (%) for the employed gestures performed by 28 children (development corpus).  Results for the five different features for both single and multi-steam cases.

slide-11
SLIDE 11

11

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Multi-view Gesture Recognition - Children vs. Adults

 different training schemes

 Adults models  Children models  Mixed model

Employed Features: MBH

  • A. Tsiami, P. Koutras, N. Efthymiou, P. Filntisis, G. Potamianos, P. Maragos, “Multi3: Multi-sensory

Perception System for Multi-modal Child Interaction with Multiple Robots”, Proc. ICRA, 2018.

slide-12
SLIDE 12

12

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Distant Speech Recognition System

12

 DSR model training and adaptation per Kinect (Greek models)

Collected Data

I think that you are hammering a nail I think that you are painting I think that it is the rabbit It relates to peace

slide-13
SLIDE 13

13

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Spoken Command Recognition Evaluation

13

  • TD (Typically-Developing) children data: 40 phrases
  • average word (WCOR) and sentence accuracy (SCOR)

for the DSR task, per utterance set for all adaptation choices.

  • 4-fold cross-validation
slide-14
SLIDE 14

14

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

 different training schemes

 Adults models  Children models  Mixed model

Spoken Command Recognition – Children vs Adults

slide-15
SLIDE 15

15

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

slide-16
SLIDE 16

16

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Cleaning a window Ironing a shirt Digging a hole Driving a bus Painting a wall Hammering a nail Wiping the floor Reading Swimming Working Out Playing the guitar Dancing

Action Recognition- Vocabulary

slide-17
SLIDE 17

17

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

 13 classes of pantomime actions  Average classification accuracy (%) for the employed gestures performed by 28 children (development corpus).  Results for the five different features for both single and multi-steam cases.

Multi-view Action Recognition - Evaluation

  • N. Efthymiou, P. Koutras, P. Filntisis, G. Potamianos, P. Maragos, “Multi-view Fusion for Action Recognition

in Child-Robot Interaction”, Proc. ICIP, 2018.

slide-18
SLIDE 18

18

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

 different training schemes

 Adults models  Children models  Mixed model

Employed Features: MBH

Multi-view Action Recognition – Children vs Adults

slide-19
SLIDE 19

19

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

  • A. Tsiami, P. Filntisis, N. Efthymiou, P. Koutras, G. Potamianos, P. Maragos, “Multi3: Multi-sensory

Perception System for Multi-modal Child Interaction with Multiple Robots”, Proc. ICRA, 2018.

Children-Robot Interaction: TD video - Rock Paper Scissors

slide-20
SLIDE 20

20

Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction Interspeech 2018 Tutorial: Multimodal Speech & Audio Processing in Audio-Visual Human-Robot Interaction

Part 3: Conclusions

 Synopsis:

  • Data collection and annotation: 28 TD and 15 ASD children (+ 20 adults)
  • Audio-Visual localization and tracking
  • 3D Object tracking
  • Multi-view Gesture and Action recognition
  • Distant Speech recognition
  • Multimodal Emotion recognition

 Ongoing work:

  • Evaluate the whole perception system with TD and ASD children
  • Extend and develop methods for engagement and behavioral understanding

Tutorial slides: http://cvsp.cs.ntua.gr/interspeech2018 For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr