Focussed Web Crawling Using RL Searching web for pages relevant to a - PowerPoint PPT Presentation

1 Focussed Web Crawling Using RL • Searching web for pages relevant to a specific subject • No organised directory of web pages Reinforcement Learning Web Crawling : start at one root page, follow links to other pages, follow their Lecture 18a links to further pages, etc. Focussed Web Crawling : specific topic. Find maximum set of relevant pages having traversed minimum number of irrelevant pages. Gillian Hayes Why try this? : Less bandwidth, storage time (can take weeks for exhaustive 7th March 2007 search – billions of web pages) Good for dynamic content – can do frequent updates Can get indexing for a particular topic Alexandros Grigoriadis, MSc AI, Edinburgh 2003 + CROSSMARC project – extracting multilingual info from web on specific domains e.g. laptop retail info, job adverts on companies’ web pages Gillian Hayes RL Lecture 18a 7th March 2007 Gillian Hayes RL Lecture 18a 7th March 2007 2 3 • Evaluate page this link points to: based on set of text/content attributes. If Web Crawler relevant, store on Good Pages • Get links from page Retrieve Evaluate Good • Evaluate links, add to link queue. Does does the link point to a relevant page? base set pages pages will it lead to relevant pages in future? www • Where can we use RL? In the link scorer Link Evaluate Extract queue links links RL link scorer • Link Queue: current set of links that have to be visited. Fetch link with highest score on queue Gillian Hayes RL Lecture 18a 7th March 2007 Gillian Hayes RL Lecture 18a 7th March 2007

4 5 RL Crawling • Reward when it finds relevant pages How to Characterise a State? • Needs to recognise important attributes and follow most promising links first • Aim is to get π ∗ • How to formulate problem? What are states? What are actions? • Use text analyser to come up with keywords for domain – these words typically Alternatives: appear on web pages on this subject area • State = a link, Action = { follow, don’t follow } • Feature vector of 500 binary attributes: existence or not of a keyword • State space: 2 500 states ∼ 10 150 – too large for a table • State = web page, Action = links • Use a neural network for function approximation to give V(s) • Learn V? Must do local search to get policy • Learn weights of network using temporal difference learning • Learn Q? More training examples needed since Q(s,a). But faster to use • Eligibility trace on weights instead of states Choice: Action–links and learn V using TD( λ ) • Reward is 1/0 if page is/is not relevant Gillian Hayes RL Lecture 18a 7th March 2007 Gillian Hayes RL Lecture 18a 7th March 2007 6 7 State Learning Procedure Values V Tabular • Use a number of training sets of web pages, e.g. different companies’ web sites containing numbers of pages with job adverts and start with a random policy S V V(s) • Learn V π , need to do GPI to get V ∗ table • Then incorporate into a regular crawler: the RL neural net evaluates each page – the V value is its score • Which link to choose? Must do one-step lookahead – follow all links in current Feature V(f) = f(s) S page, evaluate the pages they lead to V(s) vector V(f(s)) encoding network • Place new pages on link queue according to score • Follow link at front of link queue to next page with highest likely relevance Gillian Hayes RL Lecture 18a 7th March 2007 Gillian Hayes RL Lecture 18a 7th March 2007

8 9 Performance: Finds relevant pages (if > 1) following fewer links but searches Issues more pages in the 1-step lookahead vs. CROSSMARC non-RL web crawler. Not Depends on: graphical structure of pages so good at finding a single relevant page on a site. • Features chosen: many attributes were == 0 so not discriminating enough • Datasets: up to 2000 pages, 16000 links, tiny number of relevant pages in each dataset, English and Greek, 1000 training episodes • Need to try on bigger datasets • Paper outlines alternative learning procedures Andrew McCallum’s CORA – searching computer science research papers • Treated roughly as a bandit problem learning Q(a). Action a = link on a web page and words in its neighbourhood • Choose the link expected to give highest future discounted reward • 53,000 documents, half a million links, 3x increase in efficiency (no. links followed before 75% of docs found vs. breadth-first search) Gillian Hayes RL Lecture 18a 7th March 2007 Gillian Hayes RL Lecture 18a 7th March 2007 10 Alexandros Grigoriadis, Georgios Paliouras: Focused crawling using temporal difference-learning. Proceedings of the Panhellenic Conference in Artificial Intelligence (SETN), Lecture Notes in Artificial Intelligence 3025, 142–153, Springer-Verlag, 2004. Andrew McCallum et al.: Building domain-specific search engines with ML techniques. Proc AAAI-99 Spring Symposium on Intelligent Agents in Cyberspace Gillian Hayes RL Lecture 18a 7th March 2007

Focussed Web Crawling Using RL Searching web for pages relevant to a - PowerPoint PPT Presentation

1 Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No organised directory of web pages Reinforcement Learning Web Crawling : start at one root page, follow links to other pages, follow their Lecture

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Trace Focussed and Data Focussed Specification: Complementary, Competing, Combined? Wolfgang

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

* A new open source language * A concurrent garbage collected language * Builds large programs

Sambuz

Useful Links

Newsletter

Mail Us

Focussed Web Crawling Using RL Searching web for pages relevant to a - PowerPoint PPT Presentation

1 Focussed Web Crawling Using RL Searching web for pages relevant to a specific subject No organised directory of web pages Reinforcement Learning Web Crawling : start at one root page, follow links to other pages, follow their Lecture

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

1 A Crawler Architecture Web Crawler Starts with a set of seeds Seeds are added to a URL

Web Crawling Najork and Heydon, High-Performance Web Crawling , Compaq SRC Research Report

Pitfalls of Crawling Crawling, session 7 CS6200: Information Retrieval Slides by: Jesse Anderton

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Web Dynamics Part 3 Searching the Dynamic Web 3.1 Crawling and recrawling policies 3.2

Crawling Structured Data Crawling, session 10 CS6200: Information Retrieval Slides by: Jesse

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

Crawling-based Web Application Testing Jun-Wei Lin (UC-Irvine) Farn Wang (National Taiwan

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

Crawling HTML Query processing Content Analysis Indexing Crawling Document Layer Network

HTTP Crawling Crawling, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton A

Trace Focussed and Data Focussed Specification: Complementary, Competing, Combined? Wolfgang

StormCrawler Low Latency Web Crawling on Apache Storm Julien Nioche julien@digitalpebble.com

A C r a w l i n g A p p l i c a t i o n w i t h R Wh a t a b o u t

Frontera: Large-Scale Open Source Web Crawling Framework Alexander Sibiryakov, 20 July 2015

Set11 Search Engines &amp; SEO Outline How do search engines work? Basic operation

jk: Using Dynamic Analysis to Crawl and Test Modern Web Applications Giancarlo Pellegrino (1) ,

Mining Second Life: Characterizing User Mobility in a Popular Virtual World Chi-Anh La - Pietro

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &amp;

Inference in OSNs via Lightweight Partial Crawls Jithin K. Sreedharan Inria, France Konstantin

* A new open source language * A concurrent garbage collected language * Builds large programs

Sambuz

Useful Links

Newsletter

Mail Us

Set11 Search Engines & SEO Outline How do search engines work? Basic operation

Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities &