natural language analysis to detect parkinson s disease
play

Natural language analysis to detect Parkinsons disease Paula Andrea - PowerPoint PPT Presentation

Natural language analysis to detect Parkinsons disease Paula Andrea Prez-Toro 1 , Juan Camilo Vsquez-Correa 1 , 2 , Martin Strauss 2 , Juan Rafael Orozco-Arroyave 1 , 2 , and Elmar Nth 2 1 Faculty of Engineering, University of Antioquia,


  1. Natural language analysis to detect Parkinson’s disease Paula Andrea Pérez-Toro 1 , Juan Camilo Vásquez-Correa 1 , 2 , Martin Strauss 2 , Juan Rafael Orozco-Arroyave 1 , 2 , and Elmar Nöth 2 1 Faculty of Engineering, University of Antioquia, Medellin, Colombia 2 Pattern Recognition Lab, Friedrich-Alexander University of Erlangen-Nürnberg September 30, 2019

  2. Introduction: Parkinson’s Disease (PD) • Second neuro-degenerative disor- der worldwide. • 6.000.000 Parkinson’s patients around the world. • Neurologists evaluated PD accord- ing to MDS-UPDRS-III scale (Goetz et al. 2008). Motor impairments • Bradykinesia • Rigidity • Resting tremor • Micrographia • Dysartrhia J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 1

  3. Introduction: Parkinson’s Disease (PD) Non-motor symptoms • Sleep disturbances. • Depression. • Cognitive impairments. • Communication disorders. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 1

  4. Introduction: Parkinson’s Disease (PD) Communication and Language impairments • Deficits in grammar production. • Less use of action verbs. • Low information context. • Simple syntax. • Differences in sentence length, number of propositions, and grammatical complexity. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 1

  5. Introduction: Hypothesis and Aims Hyphotesis: We believe that using NLP methods can also capture the effect of language impairments that affect the communication capabilities in PD, and also to detect the presence of the disease. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 2

  6. Introduction: Hypothesis and Aims Hyphotesis: We believe that using NLP methods can capture the effect of language impairments that affect the communication capabilities of PD patients, and detect the presence of the disease. Aims: • To model components related to communication deficits in PD using verbal information. • To analyze the suitability of NLP methods to discriminate PD vs. Healthy Control (HC) subjects. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 2

  7. Database Table: General information of the subjects. Time since diagnosis, age and education are given in years. PD patients HC subjects Gender [F/M] 25/25 25/25 Age [F/M] 60.7(7.3)/61.3(11.7) 61.4(7.1)/60.5(11.6) Education [F/M] 11.5(4.1)/10.9(4.5) 11.5(5.2)/10.6(4.4) Time since diagnosis [F/M] 12.6(11.5)/8.7(5.8) MDS–UPDRS–III [F/M] 37.6(14.0)/37.8(22.1) • The task consisted on asking the participants to talk about their daily routines • Average duration of the monologues: 48 ± 29 seconds for the patients and 45 ± 24 for the healthy subjects. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 3

  8. Methods: Methodology Training and Text Feature Training Development Pre­ Extraction Classifier Sets processing Noisy entities removal, lexicon BoW, TF­IDF, Classification normalization, W2V cleaned text Text Test Feature Class Pre­ Sample Extraction Prediction processing J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 4

  9. Methods: Pre-processing The data is cleaned and standardized, making it noise–free and ready for analysis. Noise Lexicon Tokenization Removal Normalization J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 5

  10. Methods: Bag of Words-BoW Collection of words into a feature vector. 1. The sentences are represented as a collection of words. 2. Vocabulary → 1182 words. 3. The words of the transcripts are counted and stored as the feature vector. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 6

  11. Methods: Term Frequency-Inverse Document Frequency–TF-IDF • TF: gives the relative frequency of a specific word. • IDF: the frequency of occurrence of the word in the collection of documents. • TF-IDF features aims to model the vocabulary of the patients, and the relevance of the word they use in their transcripts. • TF-IDF is given for the word W i , j by: � � N W i , j = TF i , j log d f i TF i , j : the number of occurrences of the term i in the document j . d f i : the number of documents containing i . N : the total number of documents. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 7

  12. Methods: Word2Vec-W2V • A Neural Network with one hidden layer. • Input → One-hot-Encoding representation of the words. • Activations of the hidden layer are the “word vectors". Input layer Hidden layer Output layer y 1 X 1 X 2 y 2 X 3 y 3 h 1 h 2 X k y k h i W VxN ={ w ki } W' VxN ={ w' ij } h N X V-1 y V-1 X V y V J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 8

  13. Methods: Word2Vec-W2V • The model was trained with a continuos bag of words (CBOW) architecture. • Trained using the Spanish WikiCorpus, which contains 120 millions of words. • The model considered a window size of 7 words to model the temporal context. • Dimension of the word vectors was set to 100. • Statistical functionals were computed for the transcript of each user: average, standard deviation, skewness, and kurtosis. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 9

  14. Methods: Classification • Two classifiers are considered: A soft margin Support Vector Machine (SVM) with Gaussian kernel, and a Random Forest (RF). • Validation: A ten-fold cross-validation scheme was implemented. • An early fusion strategy was implemented to combine the different feature sets. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 10

  15. Results A B Word cloud representation: A) PD patient. B) HC subject. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 11

  16. Results Table: Classification results. RBF-SVM RF Features Acc(%) Sens(%) Spe(%) AUC Acc(%) Sens(%) Spe(%) AUC BoW 62.0 70.0 54.0 0.60 70.0 74.0 66.0 0.76 TF–IDF 58.0 58.0 56.0 0.60 67.0 68.0 66.0 0.71 72.0 92.0 52.0 0.66 W2V 67.0 74.0 60.0 0.71 Fusion 60.0 62.0 58.0 0.62 66.0 68.0 64.0 0.71 Notes: Acc : accuracy. Sens : sensitivity. Spe : specificity. AUC : Area under the ROC curve. • PD patients are better discriminated in most of the cases. • The fusion strategy did not improve the results indicating that the considered features are not complementary. • Further research is required to find an optimal strategy to merge such information. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 12

  17. Results 1.0 0.8 True Positive 0.6 0.4 0.2 BoW -> auc=0.7588 TF-IDF -> auc=0.7098 W2V -> auc=0.7018 Fusion -> auc=0.7084 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive ROC curves for the different feature sets. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 13

  18. Results Scores obtained for the BoW feature set. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 14

  19. Conclusion • Several NLP techniques were considered in this paper to discriminate between HC subjects and PD patients. • The proposed approach allows the study of different communication disorders that cannot be observed in motor activities. • PD patients do mainly passive activities like reading, thinking, and taking their medication, while HC subjects do more active activities. • The results suggest that there is information that reflects language impair- ments in PD patients. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 15

  20. Conclusion • Limitation: the task performed by the participants might not reflect properly the communication deficits of PD patients, but the difference between the daily routine performed by the patients and the HC subjects. • Our team is currently collecting more recordings with the aim to evaluate the suitability of other tasks. • Further experiments will explore more robust word embedding methods such as ELMo or BERT to improve the performance of the system. • Fusion of acoustic and language information will be implemented. • Evaluation of specific non-motor impairments of PD patients will be addressed in further experiments: depression, anxiety, among others. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 16

  21. Thank you for your attention. Questions? Camilo Vasquez Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander University Erlangen-Nurenberg, Erlangen, Germany juan.vasquez@fau.de This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement No 766287. This project was also funded by CODI at UdeA grant # PRG2017-15530. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 16

  22. References I Goetz, C.G. et al. (2008). “Movement Disorder Society-sponsored revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS): Scale presentation and clinimetric testing results”. In: Movement Disorders 23.15, pp. 2129–2170. J. C. Vásquez-Correa | TSD 2019, Ljubljana, Slovenia September 30, 2019 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend