7-Speech Quality Assessment Quality Levels Subjective Tests - - PowerPoint PPT Presentation

▶

Jul 27, 2023 261 likes •553 views

7-Speech Quality Assessment Quality Levels Subjective Tests Objective Tests Intelligibility Naturalness Quality Levels Synthetic Quality (Under 4.8 kbps) Communication Quality (4.8 to 13 kbps) Toll Quality (13 to 64 kbps) Broadcast Quality

SLIDE 1

7-Speech Quality Assessment

Quality Levels Subjective Tests Objective Tests Intelligibility Naturalness

SLIDE 2

Quality Levels

Synthetic Quality (Under 4.8 kbps) Communication Quality (4.8 to 13 kbps) Toll Quality (13 to 64 kbps) Broadcast Quality (Upper than 64 kbps)

SLIDE 3

Test Types

Intelligibility Naturalness Subjective DRT, MRT MOS, DAM Objective None. Future ASR systems

AI, Global SNR, Seg. SNR, FW-Seg. SNR, Itakura Measure, WSSM

SLIDE 4

First Class Subjective Intelligibility Tests

Diagnostic Rhyme Test (DRT)

– Selecting between two CVC by different first C – First C should have specific properties – Ex. hop - fop And than - dan

Modified Rhyme Test (MRT)

– Selecting between CVC’s by different first C – Ex. Cat, bat, rat, mat, fat, sat

SLIDE 5

First Class (Cont’d) Subjective Intelligibility tests

DRT is very applicable and credible In this test user can hear the speech only

100 %   

Tests Incorrect Correct

N N N DRT

SLIDE 6

Second Class Subjective Naturalness tests

Mean Opinion Score (MOS)

– MOS is very applicable and credible – In this test user can hear the speech a lot

Diagnostic Acceptability Measure (DAM)

– This test is very complex

SLIDE 7

Mean Opinion Score (MOS)

Scores for MOS are like this Score Speech Quality

1 2 3 4 5 Not Acceptable Weak Medium Good Excellent

SLIDE 8

Diagnostic Acceptability Measure (DAM)

This test is very complex In this test there is 19 different parameters for score. These parameters divide into 3 main groups:

– Signal Quality – Background Quality – Total Quality

SLIDE 9

Objective Tests

These tests can not be used for

intelligibility. Because system couldn’t

recognize speech intelligibility Objective tests can only be used for speech Naturalness

SLIDE 10

Objective Tests (Cont’d)

Articulation Index (AI) Signal to Noise Ratio (SNR)

– Global (Classic) SNR – Segmental SNR – Frequency Weighted Segmental SNR

SLIDE 11

Articulation Index (AI)

AI assumes that different frequency bands distortion are independent, and measure signal quality in different bands. In each band determines percentage of perceptible signal by listener

. . . . . . . . . 20 Bands HZ 200 6100

SLIDE 12

Articulation index (Cont’d)

Perceptible by user signal :

– 1- Upper than human hearing threshold – 2- Under than human pain threshold – 3- Upper than Masking Noise level – In each case one of the states 1 or 3 is prevail

SLIDE 13

Articulation index (Cont’d)

In AI SNR measured isolated in each band





20 1

30 ) 30 , ( 20 1

SNR Min AI

SLIDE 14

Signal To Noise Ratio(SNR)

) ( ) ( ) (

ˆ n

n n

s s   

 

     

  

n n n n n

s s E

2 ) ( ) ( 2 ) (

] ˆ [ 





  



n n s

s E

2 ) (

 

     

  

n n n n n s global

s s s E E SNR

2 ) ( ) ( 2 ) ( ) (

] ˆ [ log 10 log 10



SLIDE 15

Segmental SNR

  

      

 

N j m M m n m M m n seg

j j j j

n s n s n s N SNR

1 1 2 1 2 ) (

] ] ) ( ˆ ) ( [ ) ( [ log 10 1

j’th Frame SNR

N : Number of frames M: Frame length Usually averaged over “good frames” “good frames”: having SNRs of higher than -10dB and Saturated at +30dB

SLIDE 16

Frequency Weighted Segmental SNR

F : Number of frequency bands N : Number of frames 𝑇𝑂𝑆𝐺𝑋𝑇 = 1 𝑂 ෍

𝑙=1 𝑂

1 𝑋

𝑙

෍

𝑘=1 𝐺

10𝑚𝑝𝑕10 𝑥

𝑘,𝑙 σ 𝑡(𝑜)2

σ[(𝑡 𝑜 − Ƹ 𝑡 𝑜 ]2

𝑋

𝑙 = ෍ 𝑘=1 𝐺

𝑥

𝑘,𝑙

Siemens Formula:

SLIDE 17

Frequency Weighted Segmental SNR Deller Formula

, 10 , , 1 1 ( ) 10 , 1

10log [ ( ) ( )] 1 10log [ ]

K j k s k j k j M k fw seg K j j k k

w E m E m SNR M w

     



  

SLIDE 18

Frequency Weighted Segmental SNR Other Formulas:

1 , ( ) 10 , 1 , , 1

( ) 1 1 10log ( )

M K s k j fw seg j k K j k k j j k k

E m SNR w M E m w

     

        

  

, 10 , , 1 1 ( ) , 1

10log [ ( ) ( )] 1

K j k s k j k j M k fw seg K j j k k

w E m E m SNR M w

     

            

  

SLIDE 19

The Final Formula

The right formula for fw-seg SNR is thus:

, 10 , , 1 1 ( ) , 1

10log [ ( ) ( )] 1

K j k s k j k j M k fw seg K j j k k

w E m E m SNR M w

     

            

  

SLIDE 20

The Final Formula

Where

– M is the number of frames – j is the frame index – k is the frequency band index – wj,k is the weight of the kth band of the jth frame – Es,k and Ee,k are the energies of the kth band

f signal and noise respectively

SLIDE 21

Itakura Measure

) ( H ) ( S

) ( H

Is the envelope spectrum

| ) ( | ) ( )} ( { ) (     X S R F S   

Use from All-Pole (AR) Model

SLIDE 22

Itakura Measure (Cont’d)



 

 

p i j ie

a H

1 1 ) (



This is based on the spectrum difference between main signal and assessment signal

Autoregressive Coefficients Reflection Coefficients Autocorrelation Coefficients

SLIDE 23

Itakura Measure (Cont’d)





 

M l s s s s

m l g m l g M m g m g d

1 2 ˆ ˆ

)] , ( ) , ( [ 1 )) ( ), ( (

m :Index of frame l : Index of coefficients

SLIDE 24

Itakura Measure (Cont’d)

 

   

1 1 ' , , 1 ˆ ' , , ˆ

] )] ' , ( ) , ( [ [ )) ' ( ), ( ( ~

 

 

 

M l m m l M l s s m m l s s lp

W m l m l W m m d

) , ( m l



Is the l’th parameter of the frame that conduces m’th sample

SLIDE 25

Weighted Spectral Slope Measure (WSSM)

| ) , ( | | ) , 1 ( | | ) , ( | m k s m k s m k s     | ) , ( ˆ | | ) , 1 ( ˆ | | ) , ( ˆ | m k s m k s m k s    

2 36 1 ,

] | ) , ( ˆ | | ) , ( | [ |) ) , ( ˆ | |, ) , ( (|





    

k m k WSSM

m k s m k s W K m s m s d  

) , ( m k s

Is STFT of k’th band of the frame that conduces m’th sample

dB. in are | ) , ( | | ) , 1 ( | m k s and m k s 

SLIDE 26

PESQ

Perceptual Evaluation of Speech Quality

SLIDE 27

PESQ

The most eminent result of PESQ is the MOS. It directly expresses the voice quality. The PESQ MOS as defined by the ITU recommendation P.862 ranges from 1.0 (worst) up to 4.5 (best). This may surprise at first glance since the ITU scale ranges up to 5.0, but the explanation is simple: PESQ simulates a listening test and is optimized to reproduce the average result of all listeners (remember, MOS stands for Mean Opinion Score). Statistics however prove that the best average result one can generally expect from a listening test is not 5.0, instead it is ca. 4.5. It appears the subjects are always cautious to score a 5, meaning "excellent", even if there is no degradation at all.