CS344: Introduction to Artificial CS344: Introduction to Artificial - PowerPoint PPT Presentation

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR

Road Map Road Map � Cross Lingual IR � Motivation � CLIA architecture � CLIA demo � CLIA demo � Ranking � Various Ranking methods Various Ranking methods � Nutch/lucene Ranking � Learning a ranking function � Experiments and results

Cross Lingual IR Cross Lingual IR � Motivation � Information unavailability in some languages � Language barrier � D fi iti � Definition: � Cross-language information retrieval (CLIR) is a subfield of information retrieval dealing with retrieving g g information written in a language different from the language of the user's query (wikipedia) � Example: E l � A user may ask query in Hindi but retrieve relevant documents written in English. g

Wh CLIR? Why CLIR? Que Query in Que Que y in Tamil Tamil Syst System search Eng E E ngli li h l lish Documen Document Marathi Marathi Marathi Marathi Documen Document Snippe Snippet t Gene Generation ration English English and and Documen Document Trans Translat ation ion 4

Cross Lingual Information Access Cross Lingual Information Access � Cross Lingual Information Access (CLIA) � A web portal supporting monolingual and cross lingual IR in 6 Indian languages and English � Domain : Tourism � Domain : Tourism � It supports : � Summarization of web documents � Snippet translation into query language � Temple based information extraction � The CLIA system is publicly available at y p y � http://www.clia.iitb.ac.in/clia-beta-ext

CLIA Demo

Various Ranking methods Various Ranking methods � Vector Space Model � Lucene, Nutch , Lemur , etc � Probabilistic Ranking Model � Classical spark John’s ranking (Log ODD ratio) Cl i l k J h ’ ki (L ODD i ) � Language Model � Ranking using Machine Learning Algo � Ranking using Machine Learning Algo � SVM, Learn to Rank, SVM-Map, etc � Link analysis based Ranking y g � Page Rank, Hubs and Authorities, OPIC , etc

Nutch Ranking Nutch Ranking � CLIA is built on top on Nutch – A open source web search engine. � It is based on Vector space model �฀ � ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ , ฀ ฀ � 2 �฀ � � ฀ �฀ � � ฀ �฀ � � ฀ �฀ � ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ , ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀฀ ฀฀ ฀ ฀ �฀ � � ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ . ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀฀ ฀ ฀ ฀฀ ฀ � � �� ฀ � ฀ ฀ �� ฀ �฀ � � ฀ ฀ ฀ ฀ ฀ ฀ ฀ , ฀ ฀ � � ��||฀ |฀ ฀ � � ��| ฀

Link analysis Link analysis � Calculates the importance of the pages using web graph � Node: pages d � Edge: hyperlinks between pages � Motivation: link analysis based score is hard to manipulate Motivation: link analysis based score is hard to manipulate using spamming techniques � Plays an important role in web IR scoring function � Page rank � Hub and Authority � Online Page Importance Computation (OPIC) g p p ( ) � Link analysis score is used along with the tf-idf based score � We use OPIC score as a factor in CLIA.

Learning a ranking function Learning a ranking function � How much weight should be given to different part of the web documents while ranking the documents? b d t hil ki th d t ? � A ranking function can be learned using following method � Machine learning algorithms: SVM, Max-entropy g g , py � Training � A set of query and its some relevant and non-relevant docs for each query � A set of features to capture the similarity of docs and query � A set of features to capture the similarity of docs and query � In short, learn the optimal value of features � Ranking � U T i � Use a Trained model and generate score by combining different feature d d l d t b bi i diff t f t score for the documents set where query words appears � Sort the document by using score and display to user

Extended Features for Web IR Extended Features for Web IR Content based features 1. Tf IDF l Tf, IDF, length, co-ord, etc th d t – Link analysis based features 2. OPIC score – Domains based OPIC score – Standard IR algorithm based features 3. BM25 score BM25 score – Lucene score – LM based score – L Language categories based features i b d f 4. Named Entity – Phrase based features –

Content based Features Feature Formulation Descriptions C1 Term frequency (tf) � ฀ ฀ �฀ ฀ , ฀ ฀ � ฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ C2 SIGIR feature � log�฀ ฀ �฀ ฀ , ฀ ฀ � � 1� ฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ ฀ ฀ �฀ ฀ , ฀ ฀ � C3 Normalized tf ฀ ฀ � |฀ | ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ log �1 � ฀ ฀ ฀ ฀ �฀ �฀ ฀ ฀฀ , ฀ ฀ ฀ �� C4 C4 SIGIR feature SIGIR feature ฀ ฀ � � � � |฀ ฀ | ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ |฀ ฀ | C5 Inverse doc frequency (IDF) � log � �� ฀ ฀ ฀ ฀ �฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ ฀ ฀ | | |฀ | ฀ C6 SIGIR feature � � log �log � l �l � �� ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ ฀ ฀ |฀ | ฀ C7 SIGIR feature � log �1 � �� ฀ ฀ �฀ ฀ , ฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ ฀ ฀ C8 Tf*IDF ฀ ฀ ฀ ฀ � ฀ �฀ ฀ ฀ ฀ , � ฀ � ฀ ฀ � � log �1 � � ฀ ฀ ฀ ฀ ฀ ฀ � �฀ ฀ , �� ฀ ฀ |฀ ฀ | ฀ ฀ �1 log �1 � ฀ ฀ �฀ ฀ , ฀ ฀ � |฀ | ฀ C9 SIGIR feature ฀ ฀ � log �� |฀ ฀ | ฀ ฀ ฀ ฀ �฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ ฀ ฀ log �1 � ฀ ฀ ฀ ฀ �฀ �฀ ฀ ฀ ฀ ฀ , ฀ ฀ ฀ � � |฀ |฀ ฀ ฀ | | C10 C10 SIGIR feature SIGIR feature ฀ ฀ ฀ � � � �� |฀ | ฀ ฀ ฀ �฀ ฀ , ฀ ฀ ฀ ฀ �฀ ฀ �฀ ฀ ฀ ฀ ฀ ฀

Details of features Details of features Feature No Descriptions 1 1 Length of body Length of body 2 length of title 3 length of URL 4 length of Anchor 5-14 C1-C10 for Title of the page 15-24 C1-C10 for Body of the page y p g 25-34 C1-C10 for URL of the page 35-44 C1-C10 for Anchor of the page 45 45 OPIC score OPIC score 46 Domain based classification score

Details of features(Cont) Details of features(Cont) Feature No Descriptions 48 48 BM25 Score BM25 Score 49 Lucene score 50 Language Modeling score 51 -54 Named entity weight for title, body , anchor , url 55-58 Multi-word weight for title, body , anchor , url 59-62 Phrasal score for title, body , anchor , url y 63-66 Co-ord factor for title, body , anchor , url 71 Co-ord factor for H1 tag of web document

Experiments and results Experiments and results MAP Nutch Ranking 0.2267 0.2267 0.2667 0.2137 DIR with Title + content 0.6933 0.64 0.5911 0.3444 DIR with URL+ content 0.72 0.62 0.5333 0.3449 DIR with Title + URL + content 0.72 0.6533 0.56 0.36 DIR with Title+URL+content+anchor DIR i h Ti l +URL+ + h 0.73 0 73 0 66 0.66 0 58 0.58 0 3734 0.3734 DIR with Title+URL+ content + 0.76 0.63 0.6 0.4 anchor+ NE feature

Thanks Thanks

CS344: Introduction to Artificial CS344: Introduction to Artificial - PowerPoint PPT Presentation

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR Road Map Road Map Cross Lingual IR Motivation CLIA architecture CLIA demo

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence g (associated

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence g (associated

CS344: Introduction to Artificial CS344: Introduction to Artificial Intelligence g (associated

CS344: Introduction to CS344: Introduction to Artificial Intelligence g Pushpak Bhattacharyya

CS344: Introduction to CS344: Introduction to Artificial Intelligence g Pushpak Bhattacharyya

CS344: Introduction to CS344: Introduction to Artificial Intelligence g Pushpak Bhattacharyya

CS344: Introduction to CS344: Introduction to Artificial Intelligence g Pushpak Bhattacharyya

CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya

CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya

CS344: Introduction to Artificial Intelligence Intelligence (associated lab: CS386) Pushpak

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B IIT

CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT B IIT

CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya

Louisiana Artificial Reef Program Update Artificial Reef Council | June 4, 2018 Louisiana

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Event Model for Auto Video Search TRECVID 2005 Search by NUS PRIS Tat-Seng Chua, Shi-Yong Neo,

COMMUNITY FORUMS: PROVIDENCE PUBLIC SCHOOLS REVIEW Commissioner Infante-Green OBJECTIVES

Cross-lingual NLP Sara Stymne Uppsala University Department of Linguistics and Philology

How to get started in L A T EX Florence Bouvet 2 Introduction L A T EX is a document

FacetE: Exploiting Web Tables for Domain-Specific Word Embedding Evaluation Michael Gnther ,

Reactive Programming Models for IoT Todd L. Montgomery @toddlmontgomery Psst! Already Here! Not

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation

Korean 9/20/2010 Speakers spoken in North and South Korean, each with various dialects and a