genomics
play

genomics the Caderige Project Philippe Bessires Gilles Bisson - PowerPoint PPT Presentation

Ontology learning for Information Extraction in genomics the Caderige Project Philippe Bessires Gilles Bisson Adeline Nazarenko Leibniz IMAG LIPN Universit Paris-Nord MIG -INRA CNRS Grenoble & CNRS Jouy-en-Josas


  1. Ontology learning for Information Extraction in genomics – the Caderige Project Philippe Bessières Gilles Bisson Adeline Nazarenko Leibniz – IMAG LIPN – Université Paris-Nord MIG -INRA CNRS Grenoble & CNRS Jouy-en-Josas gilles.bisson@imag.fr nazarenko@lipn.univ- philb@biotec.jouy.inria.fr paris13.fr Mohammed Ould Abdel Claire Nédellec Thierry Poibeau Vetah LRI Thalès Group LRI & Valigen Université Paris-Sud & CNRS Mohammed.Ould-Abdel- thierry.poibeau@thalesgr Vetah@lri.fr cn@lri.fr oup.com

  2. Outline 1. Overall approach: from scientific abstracts to gene interaction database 2. A knowledge-based extraction method 3. Building classes for semantic tagging 4. Learning extraction rules 5. Towards a conceptual representation of texts

  3. An Information Extraction problem Functional Genomics: gene interaction discovery - Experimental approaches (sequencing, functional analysis) - Information Extraction in Genomics literature Examples of bibliography databases MedLine FlyBase DB Size > 16 millions of refs. > 9500 genes recorded Abstract length 10 sentences 2 - 3 sentences

  4. Example: a MedLine abstract AB - GerE is a transcription factor produced in the mother cell compartment of sporulating Bacillus subtilis. It is a critical regulator of cot genes encoding proteins that form the spore coat late in development. Most cot genes, and the gerE gene, are transcribed by sigmaK RNA polymerase. Previously, it was shown that the GerE protein inhibits transcription in vitro of the sigK gene encoding sigmaK . Here, we show that GerE binds near the sigK transcriptional start site, to act as a repressor. A sigK-lacZ fusion containing the GerE-binding site in the promoter region was expressed at a 2-fold lower level during sporulation of wild-type cells than gerE mutant cells. Likewise, the level of SigK protein (i. e. pro-sigmaK and sigmaK) was lower in sporulating wild-type cells than in a gerE mutant. These results demonstrate that sigmaK-dependent transcription of gerE initiates a negative feedback loop in which GerE acts as a repressor to limit production of sigmaK. In addition, GerE directly represses transcription of particular cot genes. We show that GerE binds to two sites that span the -35 region of the cotD promoter. A low level of GerE activated transcription of cotD by sigmaK RNA polymerase in vitro, but a higher level of GerE repressed cotD transcription. The upstream GerE-binding site was required for activation but not for repression. These results suggest that a rising level of GerE in sporulating cells may first activate cotD transcription from the upstream site then repress transcription as the downstream site becomes occupied. Negative regulation by GerE, in addition to its positive effects on transcription, presumably ensures [..]

  5. Example of information extracted from a text fragment Fragment from a Medline abstract the GerE protein inhibits transcription in vitro of the sigK gene encoding sigmaK Filled form Interaction Type : negative Agent : GerE protein Target: Expression Source : gene sigK Product : protein sigmaK

  6. Information Extraction in Genomics Potentially relevant abstracts Keyword query Information Retrieval DataBase in Biology ( M ed L i ne , F l y B a s e ) Fragment Selection Information P o tentially relevant fragments Extraction NL q u ery / template

  7. Overall approach As information is scattered (around 3 % of the abstract sentences are relevant for the discovery of gene interactions), a full text analysis is too costly A two step approach: “selection first, then extraction” • Relevant fragment selection A fast and robust processing based on surface clues and key words • Knowledge extraction Apply extraction rules on “normalized” texts

  8. Limitations of keywords based approaches (1) Identifying the presence of interaction between 2 genes using word weights • 80 % Recall and precision for sentences including 2 gene names • Few information is extracted (classification based approach) Recall(Class i ) = Ex ∈ Class i and classified in Class i Ex ∈ Classe i Precision(Class i ) = Ex ∈ Class i and classified in Class i Ex classifed in Classe i

  9. Limitations of keywords based approaches (2) Identifying interaction triples (gene name/protein, interaction verb, gene name/protein) more information, but low precision GerE s timulates cotD transcription and y cotA transcription […], and, unexpectedly, inhibits […] transcription of the gene ( sigK ) […] Constraint on the number of words between the elements of the triple Distance ≤ 5 words: good precision but low recall Distance > 5 words: lower precision

  10. Combining different level of textual analysis For a good precision and a large recall, extraction rules should include conditions on different textual analysis levels 1. Sentence processing Parsing and semantic tagging lead to an enriched and normalized text representation Se m an t i c ca t ego ri es G e ne P ositive_ P rotein P r o duction interaction F r ag m en t [t h e e x p r essio n of t h e cs p A ge n e ] [ c s p A p ] [ direct ] N o u n V e r b D e t N o u n P r e p D e t N o un N o un s G P NP Sy n t a cti c c a t e g o r i e s N p r ep N S y n t a c t i c r e l a t i o n s S ub j e c t D i r e ct o b j e ct 2. Application of extraction rules (automata) on the resulting interpretation

  11. Automata examples: protein identification The automata use the syntactic and semantic information from the parsing phase to recognize interactions P ROTEIN Semantic Class : Protein <Gene_ expression> G ENE EXPRESSION ( ) ( 1 ( ] ) 3 ( ) ) expression Semantic Class : of Gene [ Pre p NP( $3 , $4 ) 1 2 3 4 4 2 NprepN( $1 , $2 )

  12. Automaton example: interaction identification and mark up P OSITIVE INTERACTION Subject( $2 , $1 ) </interaction> <interaction> Dobj( $2 , $3 ) [ N P ( <Gene ) [NP ( [ Ver b ( ) ) Semantic Class : <Protein> expression> Positive interaction ] ] ] </gene_ <gene_ 3 3 1 1 2 2 <protein> expression> </protein> expression>

  13. Syntactic and semantic knowledge needed Se m an t i c ca t ego ri es G e ne P ositive_ P rotein P r o duction interaction F r ag m en t [t h e e x p r essio n of t h e cs p A ge n e ] [ c s p A p ] [ direct ] N o u n V e r b D e t N o u n P r e p D e t N o un N o un s G P NP Sy n t a cti c c a t e g o r i e s N p r ep N S y n t a c t i c r e l a t i o n s S ub j e c t D i r e ct o b j e ct Types of knowledge needed How to get it Syntactic categories (parts of speech) Tools exist: • morphosyntactic taggers Syntactic relations (dependencies) • syntactic parsers (SP XRCE ) Semantic categories (conceptual hierarchies) Knowledge can be learned from Extraction rules the corpus Predicate schemata

  14. Architecture of Caderige D o c u m e n t c o ll e c t i on Ta gg i n g ( Me d line, F l y ba s e , e t c. ) S y n t a c t ic R e levan t f r ag - M ac h i n e m e n t sel e c t io n parsin g L e a r ni ng S y n t ac t i c p a r sin g D o m a in k no w l edge Le xic o n, T h esau r i S ema n t i c E x t r ac t i on r ul e s la b elin g S e m an t ic a nal y sis Q u e r y / E x t r ac t i on t e m p l a t e C on c ep t u al r ep r e s en t a t i o n P a t t e r n E x t r ac t i o n S t o r a g e a n s w e r t o t he qu e r y m a t c h i n g / f ill ed t e m p la t e

  15. Knowledge learning and exploitation (Information Extraction task) L ea r n i ng s t e p E x p l o i t a t i on s t e p Qu e ri e s Appli ca t ion M ac hin e Corpus Kno w l e d g e Ex t r ac t ion L ea rnin g B a s e Kno w l e d g e Do c u m e n t li b r a r y

  16. Learning conceptual hierarchies for semantic tagging Cell_cycle DNA sequence is_ a Protein is_ a Growth is _ a is_ a is _ a is_ a is_ a Sporulation is_ a is _a Promoter Differen Gene is_ a ciation Devt i s_ a is _a Enzym Dfd Hemoglobin bicD 1.28 Hierarchies of semantic classes can be learned if the following conditions are sastified: • from an homogeneous corpus, written in a specialized language • using a robust parser • with the help of an expert (or user)

  17. Classical approaches to word classes building Harris’ assumption of distributional semantics The semantics is reflected by the syntax in specific domain corpora Some semantics can be learned by observing syntactic regularities • The classes are based on the semantic proximity between words • The similarity measure of two words is based on the number of their common contexts of in the training corpus • Traditional context definitions Word co-occurrences within a window, or in a document. Co-occurrences of words relation of syntactic dependancy

  18. Similarity based on the syntactic context • Parsing gives syntactic relations between the predicates (verb/noun) and their arguments • Syntactic dependencies are represented as triplets (predicate, relation, argument) • These triplets are the learning examples NN [ c spAp ] [ d i r ec t s ] [th e e x p r e ss i o n o f t h e c spA g e n e ] D ire c t o b je c t NprepN S ubje c t Expression NprepN (of) N Transcription NprépN (of) N [ Expression ] [ of spoIIIG ]. [ Transcription ] [ SpoIIIG ]. [ Transcription ] [ comG ]. [ Expression ] [ of ykuD ]. [ Transcription ] [ ydhD ].

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend