LID-senone Extraction via Deep Neural Networks for End-to-End - - PowerPoint PPT Presentation

lid senone extraction via deep neural networks for end to
SMART_READER_LITE
LIVE PREVIEW

LID-senone Extraction via Deep Neural Networks for End-to-End - - PowerPoint PPT Presentation

LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification Ma Jin 1 Yan Song 1 Ian Mcloughlin 2 Li-Rong Dai 1 Zhong-Fu Ye 1,3 1 National Engineering Laboratory of Speech and Language Information Processing University of


slide-1
SLIDE 1

LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification

Ma Jin 1 Yan Song 1 Ian Mcloughlin 2 Li-Rong Dai 1 Zhong-Fu Ye 1,3

1 National Engineering Laboratory of Speech and Language Information Processing

University of Science and Technology of China, China

2 School of Computing, University of Kent, Medway, UK 3 State Key Laboratory of Mathematical Engineering and Advanced Computing, China

Presented ed b by Profes essor Ian n McLou

  • ughlin

2016. 2016.06.22

slide-2
SLIDE 2

Outline

  • Introduction
  • Proposed Method
  • Experiments and Analysis
  • Conclusion and Feature Work
slide-3
SLIDE 3

Introduction – background

  • What is Language Identification?
  • Extract utterance representation from a given speech
  • State-of-the-art Method
  • GMM/i-vector
  • Unsupervised fashion
  • Deep Learning Method
  • Natural advantages of supervised training
slide-4
SLIDE 4

Introduction – existing method

  • Improved i-vector Method via Deep Learning
  • Deep bottleneck network based i-vector representation for language

identification (Song et.al)

  • Study of senone-based deep neural network approaches for spoken language

recognition (Ferrer et.al)

  • End-to-End Neural Network
  • Automatic language identification using deep

eep n neu eural n networks (Lopez-Moreno et.al)

  • Automatic language identification using lon

long s shor

  • rt-term me

memor

  • ry r

y recurrent neu eural n networks (Gonzalez-Dominguez et.al)

  • An end-to-end approach to language identification in short utterances using

con

  • nvolu

lution ional al n neural al n networ

  • rks (Lozano-Diez et.al)
slide-5
SLIDE 5

Outline

  • Introduction
  • Proposed Method
  • Experiments and Analysis
  • Conclusion and Feature Work
slide-6
SLIDE 6

Proposed Method – motivation and structure

  • Convolutional Neural Network
  • convolutional layers: feature extractor at frame level
  • pooling layers: map frame level features to utterance representation
  • Structure
  • DNN layer: transform acoustic features to a compact representation frame by frame
  • convolutional layer: transform BN features into units discriminative to languages
slide-7
SLIDE 7

Proposed Method – structure details

  • LID-feature
  • general acoustic features contain too much useless information, may degrade performance
  • deep bottleneck features (DBF) are discriminate on phones, not on languages
  • LID-features are discriminative on languages, and irrelevant between dimensions (large conv

kernel)

  • Spatial Pyramid Pooling
  • spans features from frame level to utterance level
  • deals with arbitrary input sizes
  • obtain statistical information at different time scales
slide-8
SLIDE 8

Proposed Method – incremental training strategy

  • LID-features cannot be extracted

directly from general acoustic features

  • lack of training data
  • features should be bonded with phones

at a frame level, so the target cannot be languages

  • Incremental Training Strategy
  • transfer learning from large-scale corpus
  • incremental training with language

corpus

slide-9
SLIDE 9

Proposed Method – LID-senone and its statistics

discriminate on utterance level discriminate at frame level

  • nly few LID-senones

can be activated

slide-10
SLIDE 10

Proposed Method – hybrid temporal evaluation

  • 30s/10s/3s neural networks are trained independently
  • 30s speech could be segmented into 10s/3s and use corresponding

networks

  • 10s speech could be segmented into 3s and use the corresponding

network

slide-11
SLIDE 11

Outline

  • Introduction
  • Proposed Method
  • Experiments and Analysis
  • Conclusion and Feature Work
slide-12
SLIDE 12

Experiments and Analysis

  • Dataset
  • six most confusable languages from NIST LRE 09 (Dari, Farsi, Russian, Ukrainian,

Hindi and Urdu)

  • training duration about 150 hours
  • evaluation on 30s/10s/3s
  • Performance indicators: Equal Error Rate (EER)
  • System
  • baseline1: BN-GMM/i-vector
  • baseline2: BN-DNN/i-vector
  • proposed network1: LID-net
  • proposed network2: LID-HT-net, LID-net with hybrid temporal evaluation
slide-13
SLIDE 13

Experiments and Analysis

  • Evaluation of Different Convolutional Filter Sizes

n c changes

  • As a consequence, a filter size of 50x21 is

selected for all of the following experiments.

slide-14
SLIDE 14

Experiments and Analysis

  • Evaluation of Convolutional Layer Complexity

complexity o

  • f conv. l

layer er changes es

  • The performance improves when the

complexity increases

slide-15
SLIDE 15

Experiments and Analysis

  • Hybrid Temporal Evaluation
  • the final LID-net performs well compared with the two baseline systems
  • i-vector use both zeroth order and first order Baum-Welch statistics.

In LID-net, the SPP layer only uses zeroth order Baum-Welch statistics

slide-16
SLIDE 16

Outline

  • Introduction
  • Proposed Method
  • Experiments and Analysis
  • Conclusion and Future Work
slide-17
SLIDE 17

Conclusion and Feature Work

  • Conclusion
  • we have proposed a comprehensive task-aware network spanning frame to

utterance level

  • an incremental training strategy scheme has been introduced to address over-

fitting issues in the deep structure

  • hybrid temporal evaluation is proposed for various time scales in the same test

dataset

  • Future Work
  • consider a more comprehensive network rather than relying on three

independent networks

  • Can we incorporate first order B-W statistics?