multilingual automated text anonymization
play

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias - PowerPoint PPT Presentation

MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt 03/06/2016 Instituto Superior Tcnico Slide 1 of 52 INTRODUCTION RELATED WORK ARCHITECTURE INTRODUCTION ANONYMIZATION METHODS


  1. MULTILINGUAL AUTOMATED TEXT ANONYMIZATION Francisco Dias francisco.m.c.dias@tecnico.ulisboa.pt 03/06/2016 Instituto Superior Técnico Slide 1 of 52

  2.  INTRODUCTION  RELATED WORK  ARCHITECTURE INTRODUCTION  ANONYMIZATION METHODS  EVALUATION  INTEGRATING OUR SYSTEM  CONCLUSION 03/06/2016 Instituto Superior Técnico Slide 2 of 52

  3. INTRODUCTION ANOMYMIZATION - From the Old Greek anónumos (transl: “without name”); - It suppresses names and sensitive information ; TEXT - It processes data in the form of text ; - A text contains unstructured data; AUTOMATED - It runs without human intervention; MULTILINGUAL - It processes texts written in different languages . 03/06/2016 Instituto Superior Técnico Slide 3 of 52

  4. MOTIVATION - Information sharing in text-form is important in some areas; (clinical and scientific research, decision making, among others) - Texts may contain private information , protected by law; - In order to share information in text-form, all sensitive information should be removed . - Manual redaction is a hard and time-consuming task. An automated anonymization system could help in this task. 03/06/2016 Instituto Superior Técnico Slide 4 of 52

  5. CHALLENGE - To implement a multilingual anonymization system : → STRING NLP Chain; → Unbabel Translation Pipeline; - Support 4 languages: English, German, Portuguese, Spanish; - Evaluate the anonymization system: → does it remove all sensitive information? → does it replace same entities by the same label? → does the results look natural to a human reader? 03/06/2016 Instituto Superior Técnico Slide 5 of 52

  6. RELATED WORK 03/06/2016 Instituto Superior Técnico Slide 6 of 52

  7. RELATED WORK - Most of the previous works are based on NER techniques; - The evaluation of the previous works was based on the detection of entities in the text; - I2B2 launched two de-identification challenges in the past: 2006 and 2014. 03/06/2016 Instituto Superior Técnico Slide 7 of 52

  8. RELATED WORK - MITRE, Wellner et al ., 2006 - Model-based and Pattern-matching techniques; - Best performance on i2b2 2006 challenge; 03/06/2016 Instituto Superior Técnico Slide 8 of 52

  9. RELATED WORK - Szarvas et al . System, 2006 - Model-based classifiers in parallel and a voting module; - Post-processing iteration in order to detect more candidates; 03/06/2016 Instituto Superior Técnico Slide 9 of 52

  10. RELATED WORK - Arakami et al . System, 2006 - A CRF * classifier detects candidates to sensitive information; - Label-consistency post-processing; * CRF: Conditional Random Fields 03/06/2016 Instituto Superior Técnico Slide 10 of 52

  11. RELATED WORK - HIDE, Gardner et al ., 2008 - A CRF classifier detects candidates to sensitive information; - Uses coreferences in order to detect more candidates; 03/06/2016 Instituto Superior Técnico Slide 11 of 52

  12. RELATED WORK - “Nottingham System”, Yang & Garibaldi, 2014 - Model-based (CRF) and Pattern-matching techniques; - It uses coreferences in order to detect more candidates; - Best performance on i2b2 2014 challenge; 03/06/2016 Instituto Superior Técnico Slide 12 of 52

  13. ARCHITECTURE 03/06/2016 Instituto Superior Técnico Slide 13 of 52

  14. ARCHITECTURE Pre-processing - Pipeline with 5 modules; - Based on NER techniques; NER - Post-processing and coreference modules; Second-pass Detection Coreference Resolution Anonymization 03/06/2016 Instituto Superior Técnico Slide 14 of 52

  15. ARCHITECTURE - The NER module detects sensitive information contained in the text; - It is composed of several parallel components; Main NER Pattern- Parallel NER Parallel NER Classifier matching Classifier 1 Classifier 2 Voting 03/06/2016 Instituto Superior Técnico Slide 15 of 52

  16. ARCHITECTURE SECOND-PASS DETECTION - Post-processing step; corrections over NER results; - It applies Short-forms and Label-Consistence ; COREFERENCE RESOLUTION - Groups named entities into mentions ; - Each mention refers to the same extra-linguistic object; ANONYMIZATION MODULE - Implements anonymization methods ; - Returns an anonymized text and a table of solutions. 03/06/2016 Instituto Superior Técnico Slide 16 of 52

  17. ANONYMIZATION METHODS 03/06/2016 Instituto Superior Técnico Slide 17 of 52

  18. ANONYMIZATION METHODS - The methods obfuscate original entities in text using replacement tags or entities; - We implemented 4 anonymization methods: - Suppression → Lisbon ***** - Tagging → Lisbon [LOCATION] - Random Substitution → Lisbon Cairo - Generalization → Lisbon City 03/06/2016 Instituto Superior Técnico Slide 18 of 52

  19. RANDOM SUBSTITUTION - Random substitution replaces an entity by another random entity from the same class and morphosyntactic features ; - Morphosyntactic features are determined by the headword ; ● A e r o p o r t o ( m a s c , s i n g ) → R e c i n t o ( m a s c , s i n g ) ● F r a n k f u r t s ( m a s c , s i n g , g e n i t i v ) → We g s ( m a s c , s i n g , g e n i t i v ) - Random entities are looked up from a default list of entities; Language Class Number Gender Case Term PT location singular masculine recinto ES location singular feminine arena EN location singular venue DE location singular neuter nominative Wahrzeichen 03/06/2016 Instituto Superior Técnico Slide 19 of 52

  20. GENERALIZATION - Generalization is any method of replacing an entity by another that mentions an item of the same type but in a more general way; - This method accesses a Knowledge Base in order to retrieve the superclasses of a given entity. City Berlin London Lisbon Madrid 03/06/2016 Instituto Superior Técnico Slide 20 of 52

  21. EVALUATION 03/06/2016 Instituto Superior Técnico Slide 21 of 52

  22. EVALUATION 1) Detection of sensitive information (named entities) → does it remove all sensitive information? 2) Performance of the coreference between entities → does it replace same entities by the same label? 3) Adequacy of the replacements → does the results look natural to a human reader? Evaluation of previous studies on anonymization aimed at: - detection of entities in a text (we also evaluate points 2 and 3); - clinical report text style (we aim various text styles); 03/06/2016 Instituto Superior Técnico Slide 22 of 52

  23. DATASETS - We aim at different domains of text and languages . - We use corpora divided into documents from 2 different sources, with different text domains for each language: English: CoNLL 2003 + DCEP German: CoNLL 2003 + DCEP Portuguese: Segundo HAREM + DCEP Spanish: CoNLL 2002 + DCEP - DCEP reports were manually annotated for named entities; - All datasets were manually annotated for coreference between entities; 03/06/2016 Instituto Superior Técnico Slide 23 of 52

  24. DETECTION OF SENSITIVE INFORMATION - Intrinsic evaluation of the performance of NER: f1-score (also recall); - 3 classes of entities: Location, Organization and Person; - 5 configurations: - Baseline ; - Baseline + Pattern-matching ; - Baseline + Second-pass Detection ; - Baseline + Parallel NER classifier ; - Baseline + All previous configurations; - Statistically different results from the baseline. 03/06/2016 Instituto Superior Técnico Slide 24 of 52

  25. DETECTION OF SENSITIVE INFORMATION - Baseline performance depends on the text domain; - Gazetteers improve significantly* recall; - Second-pass improves significantly* f1-score (some datasets); also adds false positives; - Parallel NER improves the performance (same training text domain ) not significantly* when compared . with Second-pass in CoNLL; - All modules improves f1-score only on DCEP; * p < 0.01, compared with baseline 03/06/2016 Instituto Superior Técnico Slide 25 of 52

  26. DETECTION OF SENSITIVE INFORMATION 03/06/2016 Instituto Superior Técnico Slide 26 of 52

  27. DETECTION OF SENSITIVE INFORMATION 03/06/2016 Instituto Superior Técnico Slide 27 of 52

  28. COMPARING WITH I2B2 RESULTS 2006 2014 5º 5º against 6 other systems against 10 other systems 03/06/2016 Instituto Superior Técnico Slide 28 of 52

  29. COREFERENCE RESOLUTION - Baseline: no coreference; - Metrics: B-Cubed Score ; - Results depend on the language and text domain ; - Performance of coreference resolution is satisfactory : - Precision close to 1.0; - Recall much higher than the baseline. 03/06/2016 Instituto Superior Técnico Slide 29 of 52

  30. COREFERENCE RESOLUTION 03/06/2016 Instituto Superior Técnico Slide 30 of 52

  31. ANONYMIZATION - Metrics: Availability and relevance of a substitution; - Effects of anonymization in the coreference of entities; - The relevance of a substitution within a context was measured using human raters, as the ratio: 03/06/2016 Instituto Superior Técnico Slide 31 of 52

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend