SLIDE 13 Look, Listen, and Learn (L3-Net)
Batch Normalization Conv: 64 (3,3) + BN + ReLU Conv: 64 (3,3) + BN + ReLU Max pool: (2,2) Conv: 128 (3,3) + BN + ReLU Conv: 128 (3,3) + BN + ReLU Max pool: (2,2) Conv: 256 (3,3) + BN + ReLU Conv: 256 (3,3) + BN + ReLU Max pool: (2,2) Conv: 512 (3,3) + BN + ReLU Conv: 512 (3,3) + BN + ReLU Max pool: (28,28) Concatenate Dense: 128 + ReLU Dense: 2 + SoftMax Correspond? (Yes / No) 1 2 3 4 5 6 7 8 Audio subnetwork 1 s Mel-spectrogram Input Size: (256, 199, 1) Batch Normalization Conv: 64 (3,3) + BN + ReLU Conv: 64 (3,3) + BN + ReLU Max pool: (2,2) Conv: 128 (3,3) + BN + ReLU Conv: 128 (3,3) + BN + ReLU Max pool: (2,2) Conv: 256 (3,3) + BN + ReLU Conv: 256 (3,3) + BN + ReLU Max pool: (2,2) Conv: 512 (3,3) + BN + ReLU Conv: 512 (3,3) + BN + ReLU Max pool: (32,24) Video subnetwork Single image video frame Size: (224, 224, 3) Fusion layers
Figure 2: Architecture of the L3-Net embedding models
L3-Net trains audio embedding by learning associations between audio snippets and video frames 1
Audio-Visual Correspondence (AVC) task
1Arandjelovic, Relja and Zisserman, Andrew. "Look, Listen and Learn". IEEE ICCV. 2017.
PAISE 2019 Workshop May 24, 2019 8 / 26