genomics the Caderige Project Philippe Bessires Gilles Bisson - PowerPoint PPT Presentation

Ontology learning for Information Extraction in genomics – the Caderige Project Philippe Bessières Gilles Bisson Adeline Nazarenko Leibniz – IMAG LIPN – Université Paris-Nord MIG -INRA CNRS Grenoble & CNRS Jouy-en-Josas gilles.bisson@imag.fr nazarenko@lipn.univ- philb@biotec.jouy.inria.fr paris13.fr Mohammed Ould Abdel Claire Nédellec Thierry Poibeau Vetah LRI Thalès Group LRI & Valigen Université Paris-Sud & CNRS Mohammed.Ould-Abdel- thierry.poibeau@thalesgr Vetah@lri.fr cn@lri.fr oup.com

Outline 1. Overall approach: from scientific abstracts to gene interaction database 2. A knowledge-based extraction method 3. Building classes for semantic tagging 4. Learning extraction rules 5. Towards a conceptual representation of texts

An Information Extraction problem Functional Genomics: gene interaction discovery - Experimental approaches (sequencing, functional analysis) - Information Extraction in Genomics literature Examples of bibliography databases MedLine FlyBase DB Size > 16 millions of refs. > 9500 genes recorded Abstract length 10 sentences 2 - 3 sentences

Example: a MedLine abstract AB - GerE is a transcription factor produced in the mother cell compartment of sporulating Bacillus subtilis. It is a critical regulator of cot genes encoding proteins that form the spore coat late in development. Most cot genes, and the gerE gene, are transcribed by sigmaK RNA polymerase. Previously, it was shown that the GerE protein inhibits transcription in vitro of the sigK gene encoding sigmaK . Here, we show that GerE binds near the sigK transcriptional start site, to act as a repressor. A sigK-lacZ fusion containing the GerE-binding site in the promoter region was expressed at a 2-fold lower level during sporulation of wild-type cells than gerE mutant cells. Likewise, the level of SigK protein (i. e. pro-sigmaK and sigmaK) was lower in sporulating wild-type cells than in a gerE mutant. These results demonstrate that sigmaK-dependent transcription of gerE initiates a negative feedback loop in which GerE acts as a repressor to limit production of sigmaK. In addition, GerE directly represses transcription of particular cot genes. We show that GerE binds to two sites that span the -35 region of the cotD promoter. A low level of GerE activated transcription of cotD by sigmaK RNA polymerase in vitro, but a higher level of GerE repressed cotD transcription. The upstream GerE-binding site was required for activation but not for repression. These results suggest that a rising level of GerE in sporulating cells may first activate cotD transcription from the upstream site then repress transcription as the downstream site becomes occupied. Negative regulation by GerE, in addition to its positive effects on transcription, presumably ensures [..]

Example of information extracted from a text fragment Fragment from a Medline abstract the GerE protein inhibits transcription in vitro of the sigK gene encoding sigmaK Filled form Interaction Type : negative Agent : GerE protein Target: Expression Source : gene sigK Product : protein sigmaK

Information Extraction in Genomics Potentially relevant abstracts Keyword query Information Retrieval DataBase in Biology ( M ed L i ne , F l y B a s e ) Fragment Selection Information P o tentially relevant fragments Extraction NL q u ery / template

Overall approach As information is scattered (around 3 % of the abstract sentences are relevant for the discovery of gene interactions), a full text analysis is too costly A two step approach: “selection first, then extraction” • Relevant fragment selection A fast and robust processing based on surface clues and key words • Knowledge extraction Apply extraction rules on “normalized” texts

Limitations of keywords based approaches (1) Identifying the presence of interaction between 2 genes using word weights • 80 % Recall and precision for sentences including 2 gene names • Few information is extracted (classification based approach) Recall(Class i ) = Ex ∈ Class i and classified in Class i Ex ∈ Classe i Precision(Class i ) = Ex ∈ Class i and classified in Class i Ex classifed in Classe i

Limitations of keywords based approaches (2) Identifying interaction triples (gene name/protein, interaction verb, gene name/protein) more information, but low precision GerE s timulates cotD transcription and y cotA transcription […], and, unexpectedly, inhibits […] transcription of the gene ( sigK ) […] Constraint on the number of words between the elements of the triple Distance ≤ 5 words: good precision but low recall Distance > 5 words: lower precision

Combining different level of textual analysis For a good precision and a large recall, extraction rules should include conditions on different textual analysis levels 1. Sentence processing Parsing and semantic tagging lead to an enriched and normalized text representation Se m an t i c ca t ego ri es G e ne P ositive_ P rotein P r o duction interaction F r ag m en t [t h e e x p r essio n of t h e cs p A ge n e ] [ c s p A p ] [ direct ] N o u n V e r b D e t N o u n P r e p D e t N o un N o un s G P NP Sy n t a cti c c a t e g o r i e s N p r ep N S y n t a c t i c r e l a t i o n s S ub j e c t D i r e ct o b j e ct 2. Application of extraction rules (automata) on the resulting interpretation

Automata examples: protein identification The automata use the syntactic and semantic information from the parsing phase to recognize interactions P ROTEIN Semantic Class : Protein <Gene_ expression> G ENE EXPRESSION ( ) ( 1 ( ] ) 3 ( ) ) expression Semantic Class : of Gene [ Pre p NP( $3 , $4 ) 1 2 3 4 4 2 NprepN( $1 , $2 )

Automaton example: interaction identification and mark up P OSITIVE INTERACTION Subject( $2 , $1 ) </interaction> <interaction> Dobj( $2 , $3 ) [ N P ( <Gene ) [NP ( [ Ver b ( ) ) Semantic Class : <Protein> expression> Positive interaction ] ] ] </gene_ <gene_ 3 3 1 1 2 2 <protein> expression> </protein> expression>

Syntactic and semantic knowledge needed Se m an t i c ca t ego ri es G e ne P ositive_ P rotein P r o duction interaction F r ag m en t [t h e e x p r essio n of t h e cs p A ge n e ] [ c s p A p ] [ direct ] N o u n V e r b D e t N o u n P r e p D e t N o un N o un s G P NP Sy n t a cti c c a t e g o r i e s N p r ep N S y n t a c t i c r e l a t i o n s S ub j e c t D i r e ct o b j e ct Types of knowledge needed How to get it Syntactic categories (parts of speech) Tools exist: • morphosyntactic taggers Syntactic relations (dependencies) • syntactic parsers (SP XRCE ) Semantic categories (conceptual hierarchies) Knowledge can be learned from Extraction rules the corpus Predicate schemata

Architecture of Caderige D o c u m e n t c o ll e c t i on Ta gg i n g ( Me d line, F l y ba s e , e t c. ) S y n t a c t ic R e levan t f r ag - M ac h i n e m e n t sel e c t io n parsin g L e a r ni ng S y n t ac t i c p a r sin g D o m a in k no w l edge Le xic o n, T h esau r i S ema n t i c E x t r ac t i on r ul e s la b elin g S e m an t ic a nal y sis Q u e r y / E x t r ac t i on t e m p l a t e C on c ep t u al r ep r e s en t a t i o n P a t t e r n E x t r ac t i o n S t o r a g e a n s w e r t o t he qu e r y m a t c h i n g / f ill ed t e m p la t e

Knowledge learning and exploitation (Information Extraction task) L ea r n i ng s t e p E x p l o i t a t i on s t e p Qu e ri e s Appli ca t ion M ac hin e Corpus Kno w l e d g e Ex t r ac t ion L ea rnin g B a s e Kno w l e d g e Do c u m e n t li b r a r y

Learning conceptual hierarchies for semantic tagging Cell_cycle DNA sequence is_ a Protein is_ a Growth is _ a is_ a is _ a is_ a is_ a Sporulation is_ a is _a Promoter Differen Gene is_ a ciation Devt i s_ a is _a Enzym Dfd Hemoglobin bicD 1.28 Hierarchies of semantic classes can be learned if the following conditions are sastified: • from an homogeneous corpus, written in a specialized language • using a robust parser • with the help of an expert (or user)

Classical approaches to word classes building Harris’ assumption of distributional semantics The semantics is reflected by the syntax in specific domain corpora Some semantics can be learned by observing syntactic regularities • The classes are based on the semantic proximity between words • The similarity measure of two words is based on the number of their common contexts of in the training corpus • Traditional context definitions Word co-occurrences within a window, or in a document. Co-occurrences of words relation of syntactic dependancy

Similarity based on the syntactic context • Parsing gives syntactic relations between the predicates (verb/noun) and their arguments • Syntactic dependencies are represented as triplets (predicate, relation, argument) • These triplets are the learning examples NN [ c spAp ] [ d i r ec t s ] [th e e x p r e ss i o n o f t h e c spA g e n e ] D ire c t o b je c t NprepN S ubje c t Expression NprepN (of) N Transcription NprépN (of) N [ Expression ] [ of spoIIIG ]. [ Transcription ] [ SpoIIIG ]. [ Transcription ] [ comG ]. [ Expression ] [ of ykuD ]. [ Transcription ] [ ydhD ].

genomics the Caderige Project Philippe Bessires Gilles Bisson - PowerPoint PPT Presentation

Ontology learning for Information Extraction in genomics the Caderige Project Philippe Bessires Gilles Bisson Adeline Nazarenko Leibniz IMAG LIPN Universit Paris-Nord MIG -INRA CNRS Grenoble & CNRS Jouy-en-Josas

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Genomics Virtual Laboratory Mike Pheasant (UQ) Andrew Lonie (VLSCI) What is the Genomics

Comparative Genomics of Environmental Stress Responses in North American Hardwoods The

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Computational Challenges in Computational Challenges in Genomics and Molecular Biology Genomics

What is Genomics? The study of all of an organisms genes (the genome), including

Risk Assessment and Genomics Risk Assessment and Genomics Science and Policy: EPAs

Consideration of Recommendations from the Grants Working Group on Stem Cell Genomics Center of

Genomics and Bioenergy Genomics and Bioenergy Gerald A. Tuskan DOE Joint Genome Institute FAO

Email: s_venkataramanan@msu.edu.my ABSTRACT Analysis of DNA and protein has become a very

Soil-Water-Environment Interaction Sub-topics Soil-water-Environment Interaction The

Master program representatives Dr. Patricia Supply Director of RBINS Organizing

1 1.Variation in the Wedge between Social and Private Returns 1. Economists Solutions to

(SDH) 1 BACKGROUND FIVE PHASES OF MODERN ERA OF HEALTH Miasma phase (1850-1880)

Flux Balance Analysis 1 Images courtesy of T. Shlomi Solutions must obey stoichiometric (mass

Algorithms and Applications Zhiding Yu Department of Electrical and Computer Eng. Carnegie

Indirect Transmitted Infectious Diseases: from Microscopic Cycles to Macroscopic Cycles Jude D.

genomics the Caderige Project Philippe Bessires Gilles Bisson - PowerPoint PPT Presentation

Ontology learning for Information Extraction in genomics the Caderige Project Philippe Bessires Gilles Bisson Adeline Nazarenko Leibniz IMAG LIPN Universit Paris-Nord MIG -INRA CNRS Grenoble & CNRS Jouy-en-Josas

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &amp;

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Genomics Virtual Laboratory Mike Pheasant (UQ) Andrew Lonie (VLSCI) What is the Genomics

Comparative Genomics of Environmental Stress Responses in North American Hardwoods The

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Computational Challenges in Computational Challenges in Genomics and Molecular Biology Genomics

What is Genomics? The study of all of an organisms genes (the genome), including

Risk Assessment and Genomics Risk Assessment and Genomics Science and Policy: EPAs

Consideration of Recommendations from the Grants Working Group on Stem Cell Genomics Center of

Genomics and Bioenergy Genomics and Bioenergy Gerald A. Tuskan DOE Joint Genome Institute FAO

Email: s_venkataramanan@msu.edu.my ABSTRACT Analysis of DNA and protein has become a very

Soil-Water-Environment Interaction Sub-topics Soil-water-Environment Interaction The

Master program representatives Dr. Patricia Supply Director of RBINS Organizing

1 1.Variation in the Wedge between Social and Private Returns 1. Economists Solutions to

(SDH) 1 BACKGROUND FIVE PHASES OF MODERN ERA OF HEALTH Miasma phase (1850-1880)

Flux Balance Analysis 1 Images courtesy of T. Shlomi Solutions must obey stoichiometric (mass

Algorithms and Applications Zhiding Yu Department of Electrical and Computer Eng. Carnegie

Indirect Transmitted Infectious Diseases: from Microscopic Cycles to Macroscopic Cycles Jude D.

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &