An Ontology-Based Approach for Facilitating Information Retrieval - - PowerPoint PPT Presentation

an ontology based approach for facilitating information
SMART_READER_LITE
LIVE PREVIEW

An Ontology-Based Approach for Facilitating Information Retrieval - - PowerPoint PPT Presentation

An Ontology-Based Approach for Facilitating Information Retrieval from Disparate Sources: Patent System as an Exemplar Kincho H. Law Professor of Civil and Environmental Engineering Engineering Informatics Group Stanford University


slide-1
SLIDE 1

An Ontology-Based Approach for Facilitating Information Retrieval from Disparate Sources: Patent System as an Exemplar Kincho H. Law

Professor of Civil and Environmental Engineering Engineering Informatics Group Stanford University Collaborators: Jay P. Kesan, Professor , College of Law, UIUC Siddharth Taduri (Former Student), Stanford University Gloria Lau, Consulting Assoc. Professor, Stanford University Ontology Summit March 10, 2016

Ref: S. Taduri, Information Retrieval Across Multiple Information Sources Using Knowledge- Based Approach, Engineering Degree Thesis, Stanford University, March, 2012.

slide-2
SLIDE 2

Motivation

  • Patents: Can we obtain all relevant (validity, enforceability, and

infringement) information related to patent(s) in a particular sector/category/market segment and analyze that information?

  • In the patent context:
  • What are the issued patents in a given space?
  • What is the legal scope of protection for same/similar patents?
  • Who are the competitors?
  • Have any same/similar patents been challenged in court?
  • Are there any relevant scientific literature, prior court decisions,

laws and regulations that can potentially be used to challenge and to invalidate some patent claims?

  • Focus: Biomedical Patents
  • Other Similar Problems: integrating administrative agencies, courts,

technical/scientific literature, and technical product literature in a host of law and science areas (Pharmaceuticals; Biofuels;….)

slide-3
SLIDE 3

Problem Statement

  • Patent Validity and Infringement/Enforcement Questions involves

analysis of documents in various domains – Patents, USPTO File Wrappers, Court Documents, Scientific/Technical Publications, and Technical Product Literature

  • Owned by disparate public (government) and private sectors
  • The information is often available online, but siloed into several

diverse information sources

  • Today, the analysis is done manually and poorly by companies
  • ffering various patent research and strategy services

Issued Patents and Applications Court Cases File Wrappers Technical Publications Regulations and Laws

slide-4
SLIDE 4

Use-Case: Erythropoietin (Repository)

  • Synthetic production of the hormone has made it possible to treat

diseases such as Anemia

  • Core patents – U.S. Patents 5,621,080, 5,756,349, 5,955,422,

5,547,933, 5,618,698

  • 135 directly related patents and over 3000 related publications
  • Around 30 court cases, patent litigation involving major

companies including Amgen, Hoechst Marion Roussel, Inc., Transkaryotic Therapies, Inc.

  • Over 162,000 full-text scientific publications from 49 prominent

journals in biomedicine from the TREC 2007 Genome Dataset (http://ir.ohsu.edu/genomics/2007protocol.html)

  • Comprehensive domain knowledge available
slide-5
SLIDE 5

Domain Terminology is Everywhere

Excerpt from U.S. Patent# 5,441,868 Title: Production of recombinant erythropoietin Abstract Disclosed are novel polypeptides possessing part or all

  • f the primary structural conformation and one or more
  • f the biological properties of mammalian

erythropoietin ("EPO") which are characterized in preferred forms by being the product of procaryotic or eucaryotic host expression of an exogenous DNA

  • sequence. Illustratively, genomic DNA, cDNA and

manufactured DNA sequences coding for part or all of the sequence of amino acid residues of EPO or for analogs thereof are incorporated into autonomously replicating plasmid or viral vectors employed to transform or transfect suitable procaryotic or eucaryotic host cells such as bacteria, yeast or vertebrate cells in culture. Upon isolation from culture media or cellular lysates or fragments, products of expression of the … Excerpt from scientific publication

Regional variability in the incidence

  • f

end-stage renal disease: an epidemiological approach.

…. Regional variability in the incidence

  • f

end-stage renal disease (ESRD) in Austria is reported. Our aim was ….

low rates in the state of Tyrol. ….

ESRD incidence data were obtained from ….

…. Between 1995 and 1999, 4811 new cases of ESRD were recorded; the state of Tyrol (T) …. incidence of ESRD patients with type 2

diabetes mellitus …. the difference in the overall ESRD

incidence …. prevalence of DM, a highly significant correlation was found between ESRD incidence and DM. …. variability in the ESRD incidence in Austria is explained mainly by regional differences in DM-2. Data from similar studies …. allocation for ESRD …. ….

Excerpt from court case – Amgen, Inc. v/s Chugai Pharm. On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to Dr. Rodney Hewick U.S. Patent 4,677,195, entitled "Method for the Purification of Erythropoietin and Erythropoietin Compositions" (the '195 patent). The patent claims both homogeneous EPO and compositions thereof and a method for purifying human EPO using reverse phase high performance liquid chromatography. The method claims are not before us.

slide-6
SLIDE 6
  • Sources are diverse in structure, formats, semantics and syntax
  • How to retrieve patent information in a particular

technological space? A knowledge-driven (Ontology-based) approach

  • Knowledge of scientific/technical domain
  • Knowledge of patent system domain

Problem Statement

Issued Patents and Applications Court Cases File Wrappers Technical Publications Regulations and Laws

Specific Technical Domain

Knowledge Source 2: Bio Ontology Knowledge Source 1: Patent System Ontology

Integration

slide-7
SLIDE 7

Why Ontology?

  • An ontology is an explicit description of a domain:
  • concepts
  • properties and attributes of concepts
  • constraints on properties and attributes
  • An ontology defines
  • a common vocabulary
  • a shared understanding
slide-8
SLIDE 8

Domain (Bio) Ontologies

  • Bio Ontologies serve as standards for terminology in Bio-Medical

(Science) domain

(Ref: Bioportal.bioontology.org, accessed March 2012)

slide-9
SLIDE 9

Using Concept Hierarchy to Determine Relevancy

  • Direct term based matching cannot relate the two documents
  • Bio-ontology reveals that EPO and erythropoietin are synonymous
  • Class hierarchy provides concepts (such as colony simulating

factor) useful for determining relevance between documents (with appropriate weighting scheme)

Erythropoietin Colony Stimulating Factor Hematopoietic Growth Factor EPO Doc 1 … erythropoietin …colony stimulating factor … Doc 2 … EPO …growth factor …

Bio Ontology

No direct similarity

Use of super class concept for relevancy

slide-10
SLIDE 10

Origin inal al Term: m: Erythr hropoietin ietin Synonyms: Erythropoietin, Recombinant Erythropoietin, erythropoietin receptor binding, Hematopoietin, Recombinant EPO, Erythrocyte Colony Stimulating Factor, Epoetin, EPO … Children: Darbopoietin Alfa, Epoetin Alfa, Epoetin Beta … Parents: Colony Stimulating Factors, cytokine receptor binding, recombinant hematopoietic growth factors… Grand-Parents: hematopoietic growth factor, receptor binding, recombinant growth factor …

  • An appropriate ranking function is applied to balance the more general
  • terms. Heuristically, we assign a higher weight to synonyms, and a lower

weight as we traverse away from the concept node

  • Resulting Query: “original term” OR [synonyms]^weight OR

[children]^weight OR ….

Expanded Query (with domain ontology)

slide-11
SLIDE 11

Competency Questions

Patent Domain:

  • Return all patent documents which contain the phrase ‘recombinant

erythropoietin receptor’ in the claims

  • Return all the patent documents which contain the phrase ‘recombinant

erythropoietin receptor’, at least 3 claims, issued before 02-02-1999 and assigned to Genetics Inc. Court Case Domain:

  • Return all court cases which contain the term – ‘erythropoietin’
  • Return all court cases which involve the company Amgen Inc. either as the

plaintiff or defendant, and from the District Court of Massachusetts Multi-domain:

  • Return all patents which contain the term – ‘erythropoietin’ in their claims,

which are involved in at least one court litigation.

  • Return all court cases with the term ‘erythropoietin’. From these court cases,

return the patents involved. From these patents, follow the backward and forward citations to identify more important patents.

Patent System Ontology (patent documents, court cases, file wrappers)

slide-12
SLIDE 12

Patents Documents

  • Around 8+ million U.S. patents (2.2

million in force today)

  • In 2009, 485,312 patent applications

were filed

  • Information is contained in various

sections of the documents; a full-text search alone is not sufficient –- other metrics such as classification, citations etc... need to be considered

  • Documents are available in HTML

Format and can be easily parsed

slide-13
SLIDE 13

Conceptual View of Patent Documents

Patent System Ontology

slide-14
SLIDE 14

Court Cases

  • Court Cases are not very well

structured

  • Comparatively more difficult

to parse information

  • PACER – public access to court

electronic records (database) system for U.S. Courts - requires one to know judicial district, party/assignee name, case number/type, etc… which may not be known.

  • Bloomberg Law is better but

has limitations.

927 F.2d 1200 (1991) AMGEN, INC., Plaintiff/Cross-Appellant, v. CHUGAI PHARMACEUTICAL CO., LTD., and Genetics Institute, Inc., Defendants- Appellants.

  • Nos. 90-1273, 90-1275.

United States Court of Appeals, Federal Circuit. March 5, 1991. Suggestion for Rehearing Declined May 20, 1991. … … Before MARKEY, LOURIE and CLEVENGER, Circuit Judges. … THE PATENTS On June 30, 1987, the United States Patent and Trademark Office (PTO) issued to Dr. Rodney Hewick U.S. Patent 4,677,195, entitled "Method for the Purification of Erythropoietin and Erythropoietin Compositions" (the '195 patent). The patent claims both homogeneous EPO and compositions thereof and a method for purifying human EPO using reverse phase high performance liquid chromatography. The method claims are not before us. The relevant claims

  • f the '195 patent are:

1. Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance liquid chromatography and a specific activity of at least 160,000 IU per absorbance unit at 280 nanometers. * * * * * * 3. A pharmaceutical composition for the treatment of anemia comprising a therapeutically effective amount of the homogeneous erythropoietin of claim 1 in a pharmaceutically acceptable vehicle. 4. Homogeneous erythropoietin characterized by a molecular weight of about 34,000 daltons on SDS PAGE, movement as a single peak on reverse phase high performance liquid chromatography and a specific activity of at least about 160,000 IU per absorbance unit at 280 nanometers.

slide-15
SLIDE 15

Conceptual View of Court Cases

Patent System Ontology

slide-16
SLIDE 16

Patent File Wrappers

  • File Wrappers are folders which

contain all documents exchanged between a patent applicant and the patent office

  • Every File Wrapper is different!!

Limited standardized ordering

  • f events
  • The relevant information is

embedded within lots of irrelevant text

  • File Wrappers are available as

images requiring additional processing in order to extract the text Events Text

slide-17
SLIDE 17

Events Contained in a File Wrapper

Patent System Ontology

slide-18
SLIDE 18
  • There are many aspects of these documents which can be utilized;

especially the cross-referencing between the documents

PATENT United States Patent, 5,955,422 September 21, 1999 Production of erthropoietin Abstract: Disclosed are novel polypeptides possessing part or all of the primary structural conformation and one or more

  • f the biological properties of mammalian

erythropoietin ("EPO") … Inventors: Lin; Fu-Kuen (Thousand Oaks, CA) Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA)

  • Appl. No.: 08/100,197

Filed: August 2, 1993. COURT CASE 314 F.3d 1313 (2003) AMGEN INC., Plaintiff-Cross Appellant v. HOECHST MARION ROUSSEL, INC. (now known as Aventis Pharmaceuticals, Inc.) and Transkaryotic Therapies, Inc., Defendants- Appellants. … Plaintiff-Cross Appellant Amgen Inc. is the

  • wner of numerous patents directed to the

production

  • f

erythropoietin ("EPO"), …alleging that TKT's Investigational New Drug Application ("INDA") infringed United States Patent Nos. 5,547,933; 5,618,698; and 5,621,080. The complaint was amended in October 1999 to include United States Patent Nos. 5,756,349 and 5,955,422, which issued after suit was filed. FILE WRAPPER U.S. Patent 5,955,422 … Claims 61-63 are rejected under 35 U.S.C. § 103 as being unpatentable over any one of Miyake et al., 1977 (R) … In accordance with the provisions of 37 C.F.R. §1.607, the present continuation is being filed for the purpose of …

Publication Database

REGULATIONS: U.S. Code Title 35, C. F. R Title 37, M. P.

  • E. P. …

BIOPORTAL: DOMAIN KNOWLEDGE Erythropoietin, Epoetin, EPO …

Cross-Referencing

slide-19
SLIDE 19

Patent System Ontology

Top Level Ontology for the Patent System

slide-20
SLIDE 20

Parsing the Document to Instantiate the Ontology

Case 1

Amgen .. Chugai .. hasPlaintiff hasDefendant

  • Documents are automatically

parsed using a regular expression based script

  • Separate scripts needed for each

document domain

  • Ontology is automatically

instantiated using the Protégé- OWL API

slide-21
SLIDE 21

Patent System Ontology

  • Established semantics allow us to reason over the classes,

properties and instances to infer new facts

  • Documents can be connected to form a network similar to

citation networks. Only now we have not just citations, but

  • ther metadata such as co-inventorships, technological

classification and other cross-domain relevancy metrics between documents (ex: patents occurring in court cases etc…)

  • Can develop rules to perform additional inferences over the

knowledge

slide-22
SLIDE 22

Information Retrieval Framework

slide-23
SLIDE 23

Prototype System Implementation

(SWRL) (Virtuoso)

slide-24
SLIDE 24
  • Jena libraries and triple store integration for

modifying the patent system ontology through new constructs, cross-references, or rules

  • Solr and Lucene libraries to create, update, and

query the text indexes

  • Generic API for integration with sources of

domain knowledge such as BioPortal Automatic query generation, abstracting the syntactic details from the user

Summary of the Implementation

slide-25
SLIDE 25

Competency Questions SPARQL Query Return all court cases which involve the company Amgen Inc. as the plaintiff and from the District Court of Massachusetts SELECT ?case WHERE { ?case type CourtCase . ?case hasPlaintiff “Amgen Inc.” . ?case hasCourt “District Court…” } Return all patents which contain the phrase ‘recombinant erythropoietin receptor’ in the claims and IPC class “A61K” SELECT ?pat WHERE { ?pat type Patent . ?pat hasClaim ?clm . ?clm hasTerm “recombinant …” . ?pat hasIPCClass “A61K” . }

Expressing Competency Questions in SPARQL

Example Query

slide-26
SLIDE 26
  • Return all the patent documents which contain the keyword

“erythropoietin” in the Claims and Assigned to “Amgen_Inc”.

  • SPARQL Query:

Example Query

Patent Inventor 5856298 Strickland_Thomas_W 5885574 Elliott_Steven_G 7304150 Egrie_Joan_C 7304150 Elliott_Steven_G 7304150 Browne_Jeffrey_K 7304150 Sitney_Karen_C 7217689 Elliott_Steven_G 7217689 Byrne_Thomas_E 6319499 Elliott_Steven_G 5756349 Lin_Fu-Kuen

SELECT DISTINCT ?patent ?inventor FROM <http://localhost:8890/PatentOntologyInferred> WHERE{ ?patent a ont:Patent . ?patent ont:hasAbstract ?abs . ?abs ont:resourceVal ?val . ?val bif:contains "erythropoietin" . ?patent ont:hasAssignee ont:Amgen_Inc . ?patent ont:hasInventor ?inventor } Limit 10

slide-27
SLIDE 27

SELECT ?party ?pat ?class ?inventor ?assignee WHERE {

  • - Retrieve all cases related to erythropoietin

?case a CourtCase . ?case hasBody ?body . FILTER REGEX (?body, “erythropoietin”, “i”)

  • - retrieve plaintiff’s and defendants

{ {?case hasPaintiff ?party .} UNION {?case hasDefendant ?party .} } . ?case patentsInvolved ?pat . ?pat hasUSClass ?class . ?pat hasInventor ?inventor . ?pat hasAssignee ?assg . } LIMIT 4;

SPARQL Query to Retrieve Information Related to “erythropoietin”

Retrieve all cases related to erythropoietin Retrieve plaintiff’s and defendants Retrieve involved patents, US classification, inventors, assignees

Example Query

slide-28
SLIDE 28

Plaintiffs/ Defendants Patents Involved in Cases US Class Inventor Assignee Amgen Inc. 5,955,422 514/8 Lin, Fu-Kuen Kirin-Amgen, Inc. Chugai Pharmaceuticals 5,547,933 530/350 Hewick, Rodney, M. Amgen, Inc. Hoescht Marion Roussel 5,621,080 536/23.51 Seehra, Jasbir, S. Kiren-Amgen, Inc. Genetics Inc. 5,618,698 435/325 Seenra, Jasbir, S. Genetics Institute, Inc.

Summary of Extracted Information Example Query

slide-29
SLIDE 29

SPARQL Query to Retrieve Information Related to U.S. Patent 5,955,422

SELECT ?pat1 ?pat2 ?case ?pub ?inv ?assg ?class WHERE { ?case a CourtCase . ?case patentsInvolved US5955422. ?case patentsInvolved ?pat1 {?pat1 hasCitation ?pat2 .} {?pat1 hasInventor ?inv .} {?pat1 hasAssignee ?assg .} {?pat1 hasUSClass ?class .} ?pat1 hasClaim ?claim . ?pub a Publication . ?pub hasBody ?body . FILTER REGEX (?body, ?claim, “i”) } Query court case Retrieve patents Retrieve information across domains

Example Query

slide-30
SLIDE 30

Actual Documents Retrieved by Querying Patent System Ontology

Example Query

slide-31
SLIDE 31

Multi-Domain Information Retrieval

Drug Ontology: Initial Features: {{erythropoietin, epo}, {epoetin alfa, epogen, procrit …} …}. Disease Ontology: Query: [DISEASE] AND {{erythropoietin, epo}, {epoetin alfa, epogen, procrit …} …} Extracted Features: {anemia, {aplastic anemia, hemolytic anemia, …},{esrd, chronic kidney disease…}…}

Drug Ontology Disease Ontology MEDLINE Metadata Initial Features: {{erythropoietin, epo},{epoetin alfa, epogen, procit…}…} Step I Search TREC dataset Step II Search TREC dataset Step III Patent Database (USPTO) Acquired Features-I:{anemia, {aplastic anemia…},{esrd, chronic kidney disease…}…} Acquired Features-II: {Miyake, {Goldwasser E., Eugene Goldwasser} …} New Features

  • U. S. Patent Classification

Symptom Ontology Patent System Ontology

TREC corpus (patent system ontology):

Query: [AUTHOR] AND {anemia, {aplastic anemia, hemolytic anemia, …} AND { {erythropoietin, epo} … } Extracted Features: {Miyake, {Goldwasser, Eugene Goldwasser}…} Query: [PATENT] AND {{Goldwasser, Eugene G, …}, …} AND {{anemia, …}, …}

Results: the 5 core patents that originated from Amgen Inc. (U.S. Patents 5,621,080, 5,756,349, 5,955,422, 5,547,933, 5,618,698)

slide-32
SLIDE 32

Knowledge Source: Patent System Ontology (Business/Legal Domain) Court Cases File Wrappers Technical Publications Regulations and Laws

Siloed Patent System Information Bioportal

(bioportal.bioontology.org)

Scientific Publication Court Case Patent Document

Knowledge Source: Bio Ontology (Technical Domain) Issued Patents and Applications

Summary: BIO-REGNET

slide-33
SLIDE 33
  • IP informatics: from research/development, patent filings to

infringement and IP protection

  • Knowledge-Driven Ontology-Based Approach
  • Technological ontologies
  • Patent system ontology
  • Generalization – Linking to other information sources –

technical/scientific publications, product literature

  • User Interface – Efficient presentation of relevant (semantic)

information

  • Comparative analysis of documents
  • Scalability (Graph Database?)
  • Experiment with more use cases in other technical domains
  • utside of the biomedical domain

Summary and Discussion

slide-34
SLIDE 34

Acknowledgement

This research is partially supported by

  • NSF Grant Number IIS-0811975 awarded to the University of

Illinois at Urbana-Champaign

  • NSF Grant Number IIS-0811460 to Stanford University
  • NIST Award Number 60NANAB11D129 to Stanford

University Any opinions and findings are those of the authors, and do not necessarily reflect the views of the National Science Foundation (NSF) or the National Institute of Standards and Technology (NIST). Certain identification of public or commercial systems in the paper/presentation does not imply recommendation or endorsement by NSF or NIST; nor does it imply that the products identified are necessarily the best available for the purpose.