NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie - PowerPoint PPT Presentation

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie and Lixin Shi Department of Computer Science and Operations Research University of Montreal, Canada 1

Outline • Introduction • Our Approaches • Issues Investigated • Experiments • Conclusion 2

Introduction • Patent Mining Project – Each patent has an IPC code • Task – Query: abstract of a research paper – Document collection: patents with IPC code – Task: assign IPC codes to each research paper according to the relevance • Possible solution – View it as a text categorization problem 3

Introduction (Cont.) • Difference in writing style for patent and research paper – Patent: more general terms to cover more related things – Research paper: more precise and technical Eg. Music player VS Apple iPod • Complexity in classification problem – More than 50,000 IPC codes – Very unbalanced – Cannot be tackled with traditional text classification approaches 4

Distribution of IPC codes in US patents #Patent #IPC #Patent #IPC 1~10 25944 2001~3000 5 11~100 10911 3001~4000 3 101~500 1430 4001~5000 0 501~1000 129 >5000 23 1001~2000 46 5

Outline • Introduction • Our Approaches – Basic approach – System Description • The Issues Investigated • Experiments • Conclusion 6

Basic Approach • Classify the research paper with K-NN classifier – The patents are labeled instances – Measure the distance between patents and research paper according to relevance • Finding closest documents with information retrieval – Language modeling approach for information retrieval – Measuring relevance by query likelihood 7

Language Modeling Approach for Information Retrieval • Documents are represented with unigram models, i.e., P(w|D) – P(w|D) is smoothed to avoid zero probablity (Zhai and Lafferty, 2001) tf ( w , D ) = λ + − λ P ( w | D ) ( 1 ) P ( w | C ) | D | • A query is represented as a sequence of words • Relevance is measured by the likelihood of query with respect to the document model ∏ = P ( q | D ) P ( q | D ) i q i 8

System Description • The whole system is implemented using INDRI system (Strohman et al, 2005) • INDRI system – Language modeling approach for IR – Allowing retrieval using different fields • Classification algorithm K ∑ ( ) = δ = score ( c , q ) ipc ( d ) c P ( q | d ) i i = i 1 ( ) : indicator function δ i = ipc ( d ) c 9

Outline • Introduction • Our Approaches • The Issues Investigated • Experiments • Conclusion 10

Investigations • Term Distillation – Aiming to solve different styles between research paper and patent description • Some common words in research paper are not common words in patent description e.g. paper, study, propose • Introducing noises to patent retrieval • Out approach – Selected a set of common terms in research paper according to document frequency – Filtering out the common words in query time 11

Common terms lt propose prepare shows gt proposed prepares showing paper based preparing shown papers obtain prepared report method obtains carry reported methods obtained carries study find carrying studies found carried studying result show studied results showed 12

Mining Patent Structures • Patent: structured documents • Different fields have different impacts • Four main fields – Title, abstract, specification and claim • Specification can be divided into fours sub-fields – Background, description, summary and drawing • Experiments: – Using some of the fields – Aggregating occurrence of query terms in different fields with linear interpolation • With equal weights 13

Query Expansion • An effective technique to enrich query with terms from top-ranked documents • Pseudo-relevance feedback • Number of feedback documents and query terms is a key issue • More effective for short queries • Is it effective for the Patent Mining task (quite long query)? 14

Outline • Introduction • Our Approaches • The Issues Investigated • Experiments • Conclusion 15

Experiments • Query and document processing: in standard way – Porter stemmer – Removing stop words • Evaluation metrics – Mean average precision – Precision at top N documents ( P@N ) 16

Term Distillation Results Model P@30 P@100 MAP Original 0.0277 0.0047 0.1502 Term Distillation 0.0282 0.0046 0.1491 Does not seem to be effective. Is it due to the terms selected? 17

The Effectiveness of Query Expansion Top 20 documents #Exp. Terms P@30 P@100 MAP 0 0.0271 0.0047 0.1488 20 0.0274 0.0029 0.1470 40 0.0274 0.0030 0.1451 60 0.0277 0.0029 0.1447 80 0.0277 0.0030 0.1439 100 0.0276 0.0030 0.1456 Observation: Not very effective. Possibly due to the fact that queries (paper abstracts) are already quite long. 18

The Impact of Different Fields T: title A: abstract S: specification C: claim B: background D: description M: summary R: drawing Fields P@30 P@100 MAP T+A+S+C 0.0277 0.0047 0.1502 T+A+B 0.0270 0.0041 0.1470 T+A+B+D 0.0281 0.0049 0.1489 T+A+B+D+M 0.0276 0.0047 0.1495 No significant differences 19

The Impact of Different K Values 20

Formal Run Results rali_baseline: Title+Abstract+Specification+Claim Rali_short_doc: Title+Abstract+Description Run ID P@30 P@100 MAP rali_baseline 0.0234 0.0050 0.1423 rali_short_doc 0.0241 0.0048 0.1437 Marginal effect. Need to carry out more experiments using different fields. 21

Conclusion • Classification of research abstracts into IPC – K-NN classifier • Investigated several issues – Only the value of K has some impact on classification effectiveness – The other factors do not seem to affect the classification accuracy: • Different fields • pseudo-relevance feedback • Term distillation • Questions: – Exploiting more characteristics of patents? – Term relationships? 22

Thanks! 23

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie - PowerPoint PPT Presentation

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie and Lixin Shi Department of Computer Science and Operations Research University of Montreal, Canada 1 Outline Introduction Our Approaches Issues Investigated

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

Patent family - background Patent family - background Patent family - background 1883

5/25/2011 Patent Reform Topics Law & economic model for understanding patent law

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Unitary Patent in Europe & Unified Patent Court (UPC ) An overview and a comparison to the

c12) United States Patent US 8,054,952 Bl (10) Patent No.: Or-Bach et al. (45) Date of Patent:

c12) United States Patent US 8,548,135 Bl (10) Patent No.: Lavian et al. (45) Date of Patent:

Forecasting the number of European patent applications at the European Patent Office Marc Nicolas

CBER Plans for Monitoring COVID-19 Vaccine Safety and Effectiveness Steve Anderson, PhD, MPP

Agenda Risk Management In Software Who am I and why am I teaching you this Intensive Projects

Lecture 5.2 Parallel Memory Models EN 600.320/420/620 Instructor: Randal Burns 12 February 2018

CO406H: Concurrent Processes -calculus: dynamic reconfiguration of communication links

Goodbye Seagate , Hello Halo : Effects of the Evolving Willfulness Standard on Life Science

802.1 Plenary - 03/2013 Orlando Opening Agenda General information... See:

leave Jo Broadbent 11 March 2015 Getting to grips with shared parental leave How will shared

Parents, Poverty and the State #LSECare Naomi Eisenstadt JRF Practitioner Fellow at the

Sambuz

Useful Links

Newsletter

Mail Us

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie - PowerPoint PPT Presentation

NTCIR-7 Patent Mining Experiments at RALI Guihong Cao, Jian-Yun Nie and Lixin Shi Department of Computer Science and Operations Research University of Montreal, Canada 1 Outline Introduction Our Approaches Issues Investigated

Kyoto-U: Syntactical EBMT System for NTCIR 7 Patent System for NTCIR-7 Patent Translation Task

NTCIR-9 Kick-Off Event ff 2010.10.05 : 13:30- English Session: 15:30-

Neuchatel at NTCIR-4 From CLEF to NTCIR Jacques Savoy University of Neuchatel, Switzerland

I t Introduction to NTCIR-7 d ti t NTCIR 7 N Noriko Kando k K d National Institute of

Patent family - background Patent family - background Patent family - background 1883

5/25/2011 Patent Reform Topics Law &amp; economic model for understanding patent law

Revisiting Document Length Hypotheses NTCIR-4 CLIR and Patent Experiments at Patolis 4 June 2004

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Overview of the Sixth NTCIR Workshop Noriko Kando National Institute of Informatics

NTCIR 2014 Slides - TUW-IMP at the NTCIR-11 Math-2 Presentation February 2015 CITATIONS READS

KSU Teams QA System for World History Exams at the NTCIR-13 QA Lab-3 Task Tasuku Kimura, Ryo

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Unitary Patent in Europe &amp; Unified Patent Court (UPC ) An overview and a comparison to the

c12) United States Patent US 8,054,952 Bl (10) Patent No.: Or-Bach et al. (45) Date of Patent:

c12) United States Patent US 8,548,135 Bl (10) Patent No.: Lavian et al. (45) Date of Patent:

Forecasting the number of European patent applications at the European Patent Office Marc Nicolas

CBER Plans for Monitoring COVID-19 Vaccine Safety and Effectiveness Steve Anderson, PhD, MPP

Agenda Risk Management In Software Who am I and why am I teaching you this Intensive Projects

Lecture 5.2 Parallel Memory Models EN 600.320/420/620 Instructor: Randal Burns 12 February 2018

CO406H: Concurrent Processes -calculus: dynamic reconfiguration of communication links

Goodbye Seagate , Hello Halo : Effects of the Evolving Willfulness Standard on Life Science

802.1 Plenary - 03/2013 Orlando Opening Agenda General information... See:

leave Jo Broadbent 11 March 2015 Getting to grips with shared parental leave How will shared

Parents, Poverty and the State #LSECare Naomi Eisenstadt JRF Practitioner Fellow at the

Sambuz

Useful Links

Newsletter

Mail Us

5/25/2011 Patent Reform Topics Law & economic model for understanding patent law

Unitary Patent in Europe & Unified Patent Court (UPC ) An overview and a comparison to the