Comparative study between expert and non expert biomedical writings: - - PowerPoint PPT Presentation

comparative study between expert and non expert
SMART_READER_LITE
LIVE PREVIEW

Comparative study between expert and non expert biomedical writings: - - PowerPoint PPT Presentation

Introduction Material and Methods Results and Discussion Conclusion and Perspectives Comparative study between expert and non expert biomedical writings: their morphology and semantics Jolanta Chmielik 1 , Natalia Grabar 1 , 2 1 INSERM UMRS


slide-1
SLIDE 1

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Comparative study between expert and non expert biomedical writings: their morphology and semantics

Jolanta Chmielik1, Natalia Grabar1,2

1 INSERM UMRS 872, eq. 20 Universit´

e Ren´ e Descartes Paris France;

2DIH-HEGP - APHP - 20 rue Leblanc - Paris 15

(29/08/2009 - 02/09/2009 — MIE 2009)

Jolanta Chmielik, Natalia Grabar

slide-2
SLIDE 2

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Outline

Introduction Material and Method Results and Discussion Conclusions and Perspectives

Jolanta Chmielik, Natalia Grabar

slide-3
SLIDE 3

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Introduction

Internet:

favoured place for searching medical and health information (Fox, 2006)

Quality of health information: HON, CISMeF Technical heterogeneity:

expert and non-expert documents co-exist negative effect on communication between medical professionals and patients (AMA, 1999; McCray, 2005)

Distinction of discourses: HON, CISMeF, GoogleCoop health

categorization remains manual

= ⇒ Propose criteria for the automatic distinction of discourses

guide users (especially non experts) towards appropriate sources of information

Hypothesis: morpho-semantic level provides relevant criteria

Jolanta Chmielik, Natalia Grabar

slide-4
SLIDE 4

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Similar work

Readability formulae: mean length of words and sentences (Flesch, 1948; Gunning, 1973; Bj¨

  • rnsson et al. 1979)

Readability formulae and medical vocabulary (Kokkinakis & Gronostaj, 2006) Supervized learning with different features (Poprat et al., 2006; Zheng et al., 2002; Grabar et al., 2007; Goeuriot et al., 2007; Miller et al., 2007) Combination of various features:

linguistic features, readability, salient lexicon (Wang, 2006) readability, frequent grammatical categories, familiarity of terms (Zeng-Treiler et al., 2007)

More detailed linguistic analysis of discourses:

Consumer Health Vocabulary: aligned expert and non expert vocabularies in English (Zeng et al. 2006; Zeng & Tse, 2006) analysis of the syntactic level (Zeng-Trailer et al. 2007) acquisition and alignement of expert-non/expert paraphrases in French (Del´ eger & Zweigenbaum, 2008)

Jolanta Chmielik, Natalia Grabar

slide-5
SLIDE 5

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Objectives

Objectives: Analyze and exploit the morpho-semantic level of documents Define salient features for the distinction of expert and non expert medical discourses Facilitate the automatic recognition of discourses Framework of the study: Language: French Source of documents: CISMeF portal www.cismef.org Three thematics: cardiology, pneumology, diabetes Three discourses: expert, didactic, non expert

Jolanta Chmielik, Natalia Grabar

slide-6
SLIDE 6

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Objectives

Objectives: = ⇒ Analyze and exploit the morpho-semantic level of documents Define salient features for the distinction of expert and non expert medical discourses Facilitate the automatic recognition of discourses Framework of the study: Language: French Source of documents: CISMeF portal www.cismef.org Three thematics: cardiology, pneumology, diabetes = ⇒ Three discourses: expert, didactic, non expert

Jolanta Chmielik, Natalia Grabar

slide-7
SLIDE 7

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Building the corpora

Source of documents: CISMeF Over 43,000 health and medical documents in French Various characterizations of documents:

1

documents accessible through their URL links

2

documents indexed with MeSH key-words:

cardiology, pneumology, diabetes

3

documents profiled according their discourse:

for students, professionals, patients

Preparing the corpus: Automatic downloading of documents Filtering and selection of HTML and XML files Conversion to raw text format Application of NLP tools for accessing the morpho-semantic level of documents

Jolanta Chmielik, Natalia Grabar

slide-8
SLIDE 8

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Building the corpora

Size of corpora Specialties Numbers of documents Number of occurrences expert didactic non expert expert didactic non expert Cardiology 1,583 205 143 942,409 449,765 157,382 Pneumology 742 127 134 600,524 213,379 96,559 Diabetes 213 23 52 181,039 44,847 29,817

Cardiology provides the largest number of medical documents Expert corpora are the most complete

Jolanta Chmielik, Natalia Grabar

slide-9
SLIDE 9

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Accessing the morpho-semantic level

Applied NLP tools

TreeTagger: morpho-syntactic tagging (Schmid, 1994)

est VER:pres ˆ etre antiinflammatoire PRO:POS antiinflammatoire

FLEMM: lemmatizer and morphological checker (Namer, 2000)

est VER(pres):3p:s:pst:ind ˆ etre:3g antiinflammatoire NOM: :s antiinflammatoire

D´ eriF: morpho-semantic analyzer (Namer, 2003) angioblastique/ADJ [ [ angi N* ] [ blast N* ] ique ADJ ] (angioblastique/ADJ, [angi,N*]:blast/N*) Qui est en relation avec cellule embryonnaire et vaisseau Which is in relation with embryonic cell and vessel Constituants = /angi/blast/ique Type = anatomie

Jolanta Chmielik, Natalia Grabar

slide-10
SLIDE 10

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Accessing the morpho-semantic level

Selection of bases

For each speciality and discourse, the most productive bases are selected: size of morphological families number of lexems formed with a given base:

cardio- (cardie-, carde-) 57 (cardio-did); 26 (cardio-exp); 20(cardio-nexp)

Jolanta Chmielik, Natalia Grabar

slide-11
SLIDE 11

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Studying and contrasting discourses

Two approaches for contrasting expert, didactic and non expert discourses:

1 Productivity of bases within morphological families:

size of morphological families the more a base is productive the larger its families is

2 Frequencies of lexems within corpora

Two values for features: Raw values:

exactly the number of constructed lexems

Normalized values:

normalization by the size of corresponding corpus

Jolanta Chmielik, Natalia Grabar

slide-12
SLIDE 12

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Results and Discussion

1 Selected morphological material 2 Productivity of bases 3 Frequencies of bases Jolanta Chmielik, Natalia Grabar

slide-13
SLIDE 13

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Selected morphological material

Total number of the selected bases: n=45 38 suppletive bases:

32 Greek bases: angi(o)- 6 Latin bases: art´ erio-

7 autonomous bases:

bronches (bronchus), bact´ erie (bacterium), ...

2,295 lexems constructed with these 45 bases

Jolanta Chmielik, Natalia Grabar

slide-14
SLIDE 14

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Selected morphological material

Limitations of the NLP tools

TreeTagger:

Web-related problems:

concatenated words:

diab´ etiquesaide, diab´ etiquedidier, bronchopathieschronique

encoding: diab˜ A¨tique misspelling: diab` e´ ete, diad´ etique, cardiomoypathie

Conversion not processed:

{diab´ etique/Adj, diab´ etique/N} D´ eriF:

Lexems missing in the reference lexicon and not processed:

´ epin´ ephrine, r´ et´ eplase, phosphodiest´ erase

Some morphological components not processed:

  • logue: neurologue, diab´

etologue, ...

Erroneous morpho-semantic analyses:

gymnase: enzyme du nu (enzyme of the nude)

Non grouped morphological bases:

h´ em(o)-, h´ em(a)-, h´ emato-, -` em-

Jolanta Chmielik, Natalia Grabar

slide-15
SLIDE 15

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Productivity of bases

5 10 15 20 non expert didactic expert non expert didactic expert non expert didactic expert

raw productivity diabetes cardiology pneumology

0.02 0.04 0.06 0.08 0.1 0.12 non expert didactic expert non expert didactic expert non expert didactic expert

normalized productivity diabetes cardiology pneumology

(a) Raw values (b) Normalized values Didactic: the largest families (except diabetes):

didactic documents ⇒ more diversified vocabulary

Expert vs non expert:

(a) raw size of families: expert > non expert (b) normalized size of families: expert < non expert

Among the most productive morphological families:

h´ em(o)-: relatif au sang

  • pathie: en relation avec une maladie
  • ite: une maladie inflammatoire

Jolanta Chmielik, Natalia Grabar

slide-16
SLIDE 16

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Frequencies of bases

0.2 0.4 0.6 0.8 1 non expert didactic expert non expert didactic expert non expert didactic expert normalized frequency diabetes cardiology pneumology

Frequencies of bases within morphological families:

didactic > non expert > expert important differences:

didactic vs expert, didactic vs non expert

Jolanta Chmielik, Natalia Grabar

slide-17
SLIDE 17

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Conclusion

Two experiments realized:

Productivity within morphological families Frequencies of lexems within corpora

Detection of characteristics on the morpho-semantic level

trans-domanial proprieties

Didactic discourse has several specificities:

documents created for the teaching purposes:

productivity of morphological families frequencies of lexems

Distinction between expert and non expert documents:

Less salient expert non expert morphological families/diversified vocabulary +

  • frequencies in corpus/redundant vocabulary
  • +

Jolanta Chmielik, Natalia Grabar

slide-18
SLIDE 18

Introduction Material and Methods Results and Discussion Conclusion and Perspectives

Perspectives

Analysis of the complexity of medical lexems Analysis of the distribution of lexems Exploitation of expert medical words and of their paraphrases

go beyond the morphology take into account more semantics phl´ ebectasie – dilatation veineuse – varices

Inter-lingual comparison (Kokkinakis & Gronostaj, 2006) Exploitation of morphological criteria for the automatic categorization of documents

Jolanta Chmielik, Natalia Grabar