FINDING YOUR VOICE IN THE REGULATORY AGE
NIGEL CANNINGS CTO
nigel.cannings@intelligentvoice.com @intelligentvox
FINDING YOUR VOICE IN THE REGULATORY AGE NIGEL CANNINGS CTO - - PowerPoint PPT Presentation
FINDING YOUR VOICE IN THE REGULATORY AGE NIGEL CANNINGS CTO nigel.cannings@intelligentvoice.com @intelligentvox 2017! 2016? 2015 THE YEAR OF VOICE As almost 50% of all corporate data will have a voice component within 5 years, either as
NIGEL CANNINGS CTO
nigel.cannings@intelligentvoice.com @intelligentvox
LIBOR FX Scandal Banks face Multi-Billion $ fines Amazon Alexa SIRI(?)
As almost 50% of all corporate data will have a voice component within 5 years, either as audio or video, all companies, but particularly banks and insurance companies, need to get a handle not just on where this data is being held, but what is being said in it, and also who is saying it.
AUDIENCE PARTICIPATION
HOW OFTEN DO YOU USE A VOICE ASSISTANT?
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Daily Weekly Monthly Never
Results taken from a survey on 5th October 2017 of 1500 people across Europe
Of the people with a smart phone how many use their integrated voice assistant (e.g. Siri, Cortana):
HOW OFTEN DO YOU USE A VOICE ASSISTANT?
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% Daily Weekly Monthly
Results taken from a survey on 5th October 2017 of 1500 people across Europe
Of the people with an Alexa home assistant how often do they use it:
Where?
What? Who?
Where is your voice stored?
Your voice could be used for any number of the following:
→ Your identity, gender, nationality (accent), emotional state..
Privacy preserving encrypted phonetic search of speech data C Glackin, G Chollet, N Dugan, N Cannings, J Wall, S Tahir, IG Ray IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. A New Secure and Lightweight Searchable Encryption Scheme over Encrypted Cloud Data S Tahir, S Ruj, Y Rahulamathavan, M Rajarajan, C Glackin IEEE Transactions on Emerging Topics in Computing, 2017.
AES Encryption (Public key)Powered by machine learning Powered by GPU
Recurrent lateral inhibitory spiking networks for speech enhancement J Wall, C Glackin, N Cannings, G Chollet, N Dugan International Joint Conference on Neural Networks (IJCNN), pp. 1023-1028, 2016.
CONVOLUTIONAL NEURAL NETWORKS FOR ACOUSTIC MODELLING
TIMIT Speech Corpus 1.4M spectrograms for the training set Sliding window used for timing 4 to 5 phones in each 0.256 second window 61 Phoneme Classes
10 30 50 80 100
20 40 60 80 100 120
Times Real Time
100x Realtime using P5000
But this is just the beginning: Voice data is generated not only in the organisation, but externally, maybe as YouTube content. One area commonly forgotten is mobile
not just on recording calls made from a regulated
Intelligent Voice are working with Telefonica/O2 to capture, index and analyse mobile phone calls, and introduce them as part of a compliance and monitoring workflow for MiFID II .
WHAT IS WRONG WITH THESE STATEMENTS?
“Woke up at 7:30. Had a shower. Made breakfast and read the newspaper. At 8:30, drove to work.” “We should have done a better job.” “That’s their way of doing things.” “You’d better ask them.” Alleged robbery victim: “The man asked for my money.” “He told me not to look at him. He said he would shoot me if I screamed.”
Pronouns: Omission, Improper use, Higher rates of third person plural pronounced person plural pronouns Complexity: Parameters such as number of letters/syllables per word, higher word count, higher rate of pauses Speaking verbs: Strong tone (told, demanded, telling), soft tone (said, asked, stated, saying) – tone changes Tempo: Slow tempo (indicator of cognitive load), fast tempo (indicator of arousal and negative effects) Pitch: Higher pitch/lower voice quality at specific times are indications of fraudulent related utterances Specific Words: Explainers (so, since therefore, because…) These are just a few of the indicators of suspicious language
Voice Activity Detection i-vector diarization
What happened next? He told me not to look at him. He said he would shoot me if… INTERVIEWER CALLER
… He told me not to look at him . He said … Embedding LSTM LSTM Strong tone Weak tone followed by Inspired by recurrent networks for named entity recognition and part of speech tagging We can use bi-directional recurrent networks to attach credibility tags to the speech transcription Bi-directionality is important for context Network can tag explainers, changes in tone, pronouns etc. GPU- accelerated RNN-based Speech to Text
RASTA SOX MATLAB PYTHON RASTA 12
Dialect identification via images and DIGITS NIST evaluation of 500 hours and 20 dialects
Preliminary Results
50 100 English- Portuguese-Brazilian Spanish- Spanish-European Chinese-Min_Dong Arabic Chinese-Cantonese Arabic-Egyptian English-British Spanish-Caribbean Slavic-Russian Arabic-Maghrebi Chinese-Mandarin Arabic-Iraqi English-American Chinese-Wu Slavic-Polish French-Haitian Arabic-Leventine French-West_African
https://celebsoundalike.com/
Tweet your results to @intelligentvox
THANK YOU
nigel.cannings@intelligentvoice.com @intelligentvox