an approach to automated thesaurus construction using
play

An Approach to Automated Thesaurus Construction Using - PowerPoint PPT Presentation

An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia Thesaurus definition Thesaurus is a vocabulary of controlled


  1. An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia

  2. Thesaurus definition Thesaurus is a vocabulary of controlled indexing language, formally organized so that a priori relationships between concepts are made explicit . [J.Aitchison,A.Gilchrist, D.Bawden. Thesaurus construction and use: a practical manual] Nadezhda Lagutina 2

  3. Thesaurus purpose  Indexing of documents using concepts from semantic resources and enhancement of results of the user’s search  Classification and division of documents into clusters Nadezhda Lagutina 3

  4. Thesaurus construction  Domain  Clusters  Terms  Hierarchical relations  Associative relations Nadezhda Lagutina 4

  5. Disadvantages of manual thesaurus-making  High cost  Long duration  Restrictions of manual analysis of the large text corpus Nadezhda Lagutina 5

  6. Automated approach for thesaurus construction  Preliminary processing of the text corpus  Automatic generation of a set of candidate terms  Correction of the set resulted from the the previous step by the expert  Automatic clustering of the terms into the clusters.  Estimation of clustering results by the expert  Establishment of the semantic relations between the terms by the expert Nadezhda Lagutina 6

  7. Example of thesaurus construction  Domain: Cardiology  Dictionary: Online Stedman’s Medical Dictionary  Key words: heart, -card-, valv-, vessel, trunk, vascular, vein, artery, aorta, atrium, ventric-, block, hypertension, hypotension Nadezhda Lagutina 7

  8. Automatic generation of the candidate term list Key word search  by full words  by word morphemes Quantitative characteristics  number of key words (morphemes) in a dictionary entry (absolute frequency)  percentage of key words (morphemes) in a dictionary entry (relative frequency) Nadezhda Lagutina 8

  9. Candidate term list  Stedman’s Medical Dictionary: 100 000 terms  Candidate term list: 2 039 terms  Example: trocho card ia . A rotary displacement of the heart around its axis. Nadezhda Lagutina 9

  10. Clustering The CLOPE algorithm [Y. Yang, X. Guan, J. You. CLOPE: a fast and effective clustering algorithm for transactional data] T = {t 1 , t 2 ,...,t n } – set of transactions t i = {x i1 , x i2 ,...} – dictionary entry x ij – word Clustering C = {C 1 , . . . ,C k } ∑ S ( C i ) r × ∣ C i ∣ ∣ D ( C i ) ∣ Profit r (C) = → max ∑ ∣ C i ∣ Nadezhda Lagutina 10

  11. Small cluster (<10 terms) dexiocardia / dextrocardia, pericardium / heart-sac, sphygmocardiocsope / sphygmocardiograph, bradycardia/ bradycardia / brachycardia / areocardia / araiocardia Nadezhda Lagutina 11

  12. Larger cluster (> 50 terms) [Standard anatomy of heart and blood vessels]: interatrial, interventricular, intravascular, intra-atrial, endocardium, intravenous, intramyocardial, periatrial, …, [Pathology of heart and blood vessels]: phlebocholosis, phlebectasia, vasoconstriction, cardiopalmus, cardiomegaly, capillarectasia, …, [Standard anatomy of heart and blood vessels]: angiogenesis, intra-auricular, …, [Tools and instruments]: hleborrhaphy, venesuture, … , [Pathology of heart and blood vessels]: telangiitis, cardiodynia, angiocarditis, omphalophlebitis Nadezhda Lagutina 12

  13. Semantic clusters  Standard anatomy of heart and blood vessels  Standard physiology of heart and blood vessels  Pathology of heart and blood vessels  Tools and instruments  Pharmacology  Surgical intervention and manipulations Nadezhda Lagutina 13

  14. Results  The proposed approach allows to construct the thesaurus corpus that adequately represents the target domain  The clustering results simplify the expert’s work because the the terms from the same area are usually follow one another in the clusters Nadezhda Lagutina 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend