An Approach to Automated Thesaurus Construction Using - - PowerPoint PPT Presentation

an approach to automated thesaurus construction using
SMART_READER_LITE
LIVE PREVIEW

An Approach to Automated Thesaurus Construction Using - - PowerPoint PPT Presentation

An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia Thesaurus definition Thesaurus is a vocabulary of controlled


slide-1
SLIDE 1

An Approach to Automated Thesaurus Construction Using Clusterization-Based Dictionary Analysis

Nadezhda Lagutina P.G. Demidov Yaroslavl State University Yaroslavl, Russia

slide-2
SLIDE 2

Nadezhda Lagutina 2

Thesaurus definition

Thesaurus is a vocabulary of controlled indexing language, formally organized so that a priori relationships between concepts are made explicit.

[J.Aitchison,A.Gilchrist, D.Bawden. Thesaurus construction and use: a practical manual]

slide-3
SLIDE 3

Nadezhda Lagutina 3

Thesaurus purpose

 Indexing of documents using concepts

from semantic resources and enhancement of results of the user’s search

 Classification and division of

documents into clusters

slide-4
SLIDE 4

Nadezhda Lagutina 4

Thesaurus construction

 Domain  Clusters  Terms  Hierarchical

relations

 Associative

relations

slide-5
SLIDE 5

Nadezhda Lagutina 5

Disadvantages of manual thesaurus-making

 High cost  Long duration  Restrictions of manual analysis of the

large text corpus

slide-6
SLIDE 6

Nadezhda Lagutina 6

Automated approach for thesaurus construction

 Preliminary processing of the text corpus  Automatic generation of a set of candidate terms  Correction of the set resulted from the the previous

step by the expert

 Automatic clustering of the terms into the clusters.  Estimation of clustering results by the expert  Establishment of the semantic relations between the

terms by the expert

slide-7
SLIDE 7

Nadezhda Lagutina 7

Example of thesaurus construction

 Domain: Cardiology  Dictionary: Online Stedman’s Medical

Dictionary

 Key words: heart, -card-, valv-,

vessel, trunk, vascular, vein, artery, aorta, atrium, ventric-, block, hypertension, hypotension

slide-8
SLIDE 8

Nadezhda Lagutina 8

Automatic generation of the candidate term list

Key word search

 by full words  by word morphemes

Quantitative characteristics

 number of key words (morphemes) in a dictionary

entry (absolute frequency)

 percentage of key words (morphemes) in

a dictionary entry (relative frequency)

slide-9
SLIDE 9

Nadezhda Lagutina 9

Candidate term list

 Stedman’s Medical Dictionary: 100 000

terms

 Candidate term list: 2 039 terms  Example:

trochocardia . A rotary displacement of the heart around its axis.

slide-10
SLIDE 10

Nadezhda Lagutina 10

Clustering

The CLOPE algorithm

[Y. Yang, X. Guan, J. You. CLOPE: a fast and effective clustering algorithm for transactional data]

T = {t1, t2,...,tn} – set of transactions ti = {xi1, xi2,...} – dictionary entry xij – word Clustering C = {C1, . . . ,Ck} Profitr (C) = → max

∑ S (C i)

∣D(C i) ∣

r ×∣

C i∣

∑∣

C i∣

slide-11
SLIDE 11

Nadezhda Lagutina 11

Small cluster (<10 terms)

dexiocardia / dextrocardia, pericardium / heart-sac, sphygmocardiocsope / sphygmocardiograph, bradycardia/ bradycardia / brachycardia / areocardia / araiocardia

slide-12
SLIDE 12

Nadezhda Lagutina 12

Larger cluster (> 50 terms)

[Standard anatomy of heart and blood vessels]: interatrial, interventricular, intravascular, intra-atrial, endocardium, intravenous, intramyocardial, periatrial, …, [Pathology of heart and blood vessels]: phlebocholosis, phlebectasia, vasoconstriction, cardiopalmus, cardiomegaly, capillarectasia, …, [Standard anatomy of heart and blood vessels]: angiogenesis, intra-auricular, …, [Tools and instruments]: hleborrhaphy, venesuture, … , [Pathology of heart and blood vessels]: telangiitis, cardiodynia, angiocarditis, omphalophlebitis

slide-13
SLIDE 13

Nadezhda Lagutina 13

Semantic clusters

 Standard anatomy of heart and blood vessels  Standard physiology of heart and blood

vessels

 Pathology of heart and blood vessels  Tools and instruments  Pharmacology  Surgical intervention and manipulations

slide-14
SLIDE 14

Nadezhda Lagutina 14

Results

 The proposed approach allows to construct

the thesaurus corpus that adequately represents the target domain

 The clustering results simplify the expert’s

work because the the terms from the same area are usually follow one another in the clusters