KNN and re ranking models for English KNN and re-ranking models for - PowerPoint PPT Presentation

KNN and re ranking models for English KNN and re-ranking models for English patent mining at NTCIR-7 p g Tong Xiao, Feifei Cao, Tianning Li, Guolong Song, Ke Zhou, Jingbo Zhu and Huizhen Wang Zhu and Huizhen Wang Natural Language Processing Lab, Northeastern University (P. R. China) xiaotong@mail.neu.edu.cn

Outline Outline • Overview i • Basic idea • Methodology – KNN based method – KNN-based method – Re-ranking • Experiment E i • Discussion • Summary

Introduction of our group Introduction of our group • Natural Language Processing Laboratory, College of N t l L P i L b t C ll f information science and engineering, Northeastern University • Working on a variety of problems related to Natural Language Working on a variety of problems related to Natural Language Processing – Statistical machine translation – Syntactic parsing S i i – Applied semantics ontology learning – Text mining • Focus on patent mining from 2007 • Welcome to our homepage http://www.nlplab.com Welcome to our homepage http://www.nlplab.com

Patent mining task at NTCIR 7 Patent mining task at NTCIR-7 • Patent mining task k <TITLE>End-ventilating adjustable pitch arcuate roof ventilator</TITLE> <ABSTRACT>A roof ridge ventilator is provided, comprising preferably a molded ventilator, with openings – Mapping research papers into along the sides thereof for passage of air therethrough and with openings at ends thereof for passage of air therethrough via gaps provided in pluralities of rows of tabs …</ABSTRACT> patent taxonomy < IPC> F24F_7_02, F24F_7_007 </IPC> <CLAIM>What is claimed is: 1. A roofing ridge ventilator for venting a roof for …</CLAIM> (International Patent …… Classification) • Three sub-tasks patent data – English patent mining – Japanese patent mining title and ranked abstract of the Patent mining list of paper to be input output – Cross language patent mining g g p g system IPC codes searched • We participated in the English patent mining <TITLE> I PC code Rank Score Study on a Natural Ventilation System Using a Pitched E04B_1_70 1 14.23 Roof with Breathing Walls Part 1 Proposal of the g p F24F 7 10 _ _ 2 13.06 sub-task sub task System and Its Design for Ventilation F24F_7_007 3 12.76 </TITLE> F24F_1_00 4 11.70 <ABSTRACT> F24F_7_08 5 11.51 We proposed a natural ventilation system using a F24F_7_013 6 11.38 pitched roof with Breathing Walls, … F24F_7_06 7 9.923 </ABSTRACT> F24F_1_02 8 7.686 …

Challenges Challenges • Huge amount of training Huge amount of training USPTO data patents – over 3 million training g PAJ Millions of patents samples – how to train a supervised …… classifier or ranker l ifi k • Huge label set and multi- patent patent IPC taxonomy IPC taxonomy label label Label (IPC) … E F G F24F_7_08 – IPC is a hierarchical F24F_7_10 E06B_7_02 … … classification system classification system F24F_7 … which consists of more … … than 60,000 IPC codes. F24F_7_10 F24F_7_08 F24F_7_06 Very large number of IPC codes

Challenges Challenges • Class imbalance problem of • Class imbalance problem of number of number of patents IPC – The distribution of IPC codes The distribution of IPC codes is skewed IPC code • Different writing styles • Different writing styles IPC1 IPC1 IPC2 IPC2 IPC3 IPC3 IPC4 IPC4 IPC5 IPC5 IPC6 IPC6 between research papers and patents and patents The same topic The same topic – conflicts with the Research Research foundational hypothesis of foundational hypothesis of patent patent patent patent paper paper supervised document classification theory y ? ? Similarity = 1 0 Similarity = 1 0 Similarity = 1.0 Similarity = 1.0

Motivation Motivation • Difficult to apply sophisticated machine learning methods such as maximum Difficult to apply sophisticated machine learning methods such as maximum entropy methods and support vector machines on patent mining – great deal of memory space and time cost is required task – no good solutions to multi-label classification on very large class set d l ti t lti l b l l ifi ti l l t • Test sample Test sample K N K-Nearest Neighboring (KNN) t N i hb i (KNN) method is a comparatively easy Sample in class1 solution Sample in class2 – extracting similar examples and no training process is required – KNN is itself a ranking

KNN based method KNN-based method • Key components • Key components Pre Pre- -processing processing p p g g Extracting – KNN-based ranking Research title and abstract paper Tokenization and removing – Re-ranking case info. R ki stemming • Each document is represented as a d KNN KNN- -based ranking based ranking vector in our system English patents Similarity calculation (for training) ranking ranking Re Re- -ranking ranking Rank Rank combination SVM

Similarity calculation Similarity calculation • Calculate the similarity between y Test Sample Test Sample the test sample (research paper) and training samples and the training samples (patents) • State-of-the-art methods BM25 … cosine SMART – Cosine + tfidf – BM25 (Robertson et al, 1998) ( , ) … sim1 sim2 sim3 – SMART (Buckley et al, 1996) – PIV (Singhal et al, 1996) – Or some other … Or some other … ∑ • M Log-linear method λ ⋅ exp( Score ( )) c m m = = m 1 Score ( ) c log-linear ∑ ∑ – Combine different similarities M λ ⋅ exp( Score ( )) c m m = c m 1 (features) to generate a refined (features) to generate a refined similarity – Different weights to different Combined features similarity

Ranking Ranking • • 1. Original KNN ranking method: 4. Listweak/ListweakAver – – Score each IPC code by the number of its to emphasize the patents ranked in the frontier occurrence in the extracted top-k documents part of the list, a new factor is introduced • • 2. Naïve method 5. Weak/WeakAver – – the order of IPC codes follows the order of their A drawback of KNN is the prediction of the input first occurrences in the extracted top-k p document tends to be dominated by the classes y documents with the more frequent examples due to the class imbalance problem – Punish the classes which contain more training • 3. Sum/SumAver 3. Sum/SumAver samples samples – score is calculated by summing up the similarities of all the extracted documents containing the given IPC code – F For SumAver, we average the similarity for each S A h i il i f h sample

Ranking Ranking – method 1 method 1 • 1. Original KNN ranking method: Suppose that we obtain the following list (top-5) after similarity calculation – Score each IPC code by the number of its occurrence in the extracted top-k documents sim Rank Patent(id) IPC 1 p02 IPC1, IPC2 0.21 2 p03 IPC3, IPC4 0.11 • 2. Naïve method 3 p04 0.09 IPC2 4 p05 – IPC2 0.09 the order of IPC codes follows the order of their 5 p01 0.07 IPC1 first occurrences in the extracted top-k p documents Occurred 3 times • 3. Sum/SumAver 3. Sum/SumAver IPC IPC score score – score is calculated by summing up the similarities IPC2 3 of all the extracted documents containing the IPC1 2 given IPC code IPC3 1 – For SumAver, we average the similarity for each F S A h i il i f h IPC4 1 sample IPC list after ranking

Ranking Ranking – method 2 method 2 • 1. Original KNN ranking method: Suppose that we obtain the following list (top-5) after similarity calculation – Score each IPC code by the number of its occurrence in the extracted top-k documents sim Rank Patent(id) IPC 1 p02 IPC1, IPC2 0.21 2 p03 IPC3, IPC4 0.11 • 2. Naïve method 3 p04 0.09 IPC2 4 p05 – IPC2 0.09 the order of IPC codes follows the order of their 5 p01 0.07 IPC1 first occurrences in the extracted top-k p documents first occurrence • 3. Sum/SumAver 3. Sum/SumAver IPC IPC score score – score is calculated by summing up the similarities IPC1 0.21 second of all the extracted documents containing the occurrence IPC2 0.21 given IPC code IPC3 0.11 – For SumAver, we average the similarity for each F S A h i il i f h IPC4 0.11 sample IPC list after ranking

KNN and re ranking models for English KNN and re-ranking models for - PowerPoint PPT Presentation

KNN and re ranking models for English KNN and re-ranking models for English patent mining at NTCIR-7 p g Tong Xiao, Feifei Cao, Tianning Li, Guolong Song, Ke Zhou, Jingbo Zhu and Huizhen Wang Zhu and Huizhen Wang Natural Language Processing

Machine Learning Probabilistic KNN. Mark Girolami girolami@dcs.gla.ac.uk Department of

4 English I CP or Honors Credits English II CP or Honors of English III CP or

Final Project Specifications CMPE 650 kNN Overview K-N earest N eighbors (kNN) is a

10-701 Fall 2017 Recitation 3 Agenda Q1 - Decision Tree to KNN A1 Q2.1 - KNN to Decision

ENGLISH CHOICES AT WHEATLEY AN INTRODUCTION FOR NINTH GRADERS AND THEIR PARENTS ENGLISH

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

CS 445 Introduction to Machine Learning Features and the KNN Classifier Instructor: Dr. Kevin

ENGLISH ENGLISH quali qualify me f fy me for? or? They graduated in English Emma Watson

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

GCSE English Language Year 10 Entry 1 Key Information English Language and English

CORE PRESENTATION EVENING English Language and English Literature ENGLISH LANGUAGE GCSE What

English Learner Reclassification to Fluent English Proficient Parent Presentation 2014-2015 1

Retrieval Models Probability Ranking Principle Web Search Slides based on the books: 1

Ventilation in Pediatric Acute Respiratory Distress Syndrome Study by PACCMAN collaboration

RECE RECENT NT TREN TRENDS DS Ventilator use lower than projected New hospitalizations for

Grant Review Summary Round 1 Summary of Project Scores and Funding allocation by JOGL Project

COVID-19 AND CRITICAL CARE: WHA T PROVIDERS NEED TO KNOW MAY 1, 2020 UPDATE Sue Hansen, MSN

FLORIDAS PLAN AGAINST COVID -19 1. Protect the Vulnerable 2. Increase Testing 3. Promote

CSIM annual meeting -2018 Acute respiratory failure Dr. John Ronald, FRCPC Int Med, Resp, CCM.

Webinar # 10 Updated 4/16/2020 Agenda Latest Numbers Andrew Cohen, MD Virtua Surge

Video Slide https://youtu.be/jLZw2mu2_t4 OUTCOMES FOR COVID+ INPATIENTS TO DATE 4/28/2020