An Approach to Automated Thesaurus Construction Using - - PowerPoint PPT Presentation
An Approach to Automated Thesaurus Construction Using - - PowerPoint PPT Presentation
An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia Thesaurus definition Thesaurus is a vocabulary of controlled
Nadezhda Lagutina 2
Thesaurus definition
Thesaurus is a vocabulary of controlled indexing language, formally organized so that a priori relationships between concepts are made explicit.
[J.Aitchison,A.Gilchrist, D.Bawden. Thesaurus construction and use: a practical manual]
Nadezhda Lagutina 3
Thesaurus purpose
Indexing of documents using concepts
from semantic resources and enhancement of results of the user’s search
Classification and division of
documents into clusters
Nadezhda Lagutina 4
Thesaurus construction
Domain Clusters Terms Hierarchical
relations
Associative
relations
Nadezhda Lagutina 5
Disadvantages of manual thesaurus-making
High cost Long duration Restrictions of manual analysis of the
large text corpus
Nadezhda Lagutina 6
Automated approach for thesaurus construction
Preliminary processing of the text corpus Automatic generation of a set of candidate terms Correction of the set resulted from the the previous
step by the expert
Automatic clustering of the terms into the clusters. Estimation of clustering results by the expert Establishment of the semantic relations between the
terms by the expert
Nadezhda Lagutina 7
Example of thesaurus construction
Domain: Cardiology Dictionary: Online Stedman’s Medical
Dictionary
Key words: heart, -card-, valv-,
vessel, trunk, vascular, vein, artery, aorta, atrium, ventric-, block, hypertension, hypotension
Nadezhda Lagutina 8
Automatic generation of the candidate term list
Key word search
by full words by word morphemes
Quantitative characteristics
number of key words (morphemes) in a dictionary
entry (absolute frequency)
percentage of key words (morphemes) in
a dictionary entry (relative frequency)
Nadezhda Lagutina 9
Candidate term list
Stedman’s Medical Dictionary: 100 000
terms
Candidate term list: 2 039 terms Example:
trochocardia . A rotary displacement of the heart around its axis.
Nadezhda Lagutina 10
Clustering
The CLOPE algorithm
[Y. Yang, X. Guan, J. You. CLOPE: a fast and effective clustering algorithm for transactional data]
T = {t1, t2,...,tn} – set of transactions ti = {xi1, xi2,...} – dictionary entry xij – word Clustering C = {C1, . . . ,Ck} Profitr (C) = → max
∑ S (C i)
∣D(C i) ∣
r ×∣
C i∣
∑∣
C i∣
Nadezhda Lagutina 11
Small cluster (<10 terms)
dexiocardia / dextrocardia, pericardium / heart-sac, sphygmocardiocsope / sphygmocardiograph, bradycardia/ bradycardia / brachycardia / areocardia / araiocardia
Nadezhda Lagutina 12
Larger cluster (> 50 terms)
[Standard anatomy of heart and blood vessels]: interatrial, interventricular, intravascular, intra-atrial, endocardium, intravenous, intramyocardial, periatrial, …, [Pathology of heart and blood vessels]: phlebocholosis, phlebectasia, vasoconstriction, cardiopalmus, cardiomegaly, capillarectasia, …, [Standard anatomy of heart and blood vessels]: angiogenesis, intra-auricular, …, [Tools and instruments]: hleborrhaphy, venesuture, … , [Pathology of heart and blood vessels]: telangiitis, cardiodynia, angiocarditis, omphalophlebitis
Nadezhda Lagutina 13
Semantic clusters
Standard anatomy of heart and blood vessels Standard physiology of heart and blood
vessels
Pathology of heart and blood vessels Tools and instruments Pharmacology Surgical intervention and manipulations
Nadezhda Lagutina 14
Results
The proposed approach allows to construct
the thesaurus corpus that adequately represents the target domain
The clustering results simplify the expert’s