The Web Server Architecture Servers + Crawlers Connecting on the - PDF document

Outline • HTTP • Crawling The Web • Server Architecture Servers + Crawlers Connecting on the WWW What happens when you click? • Suppose – You are at www.yahoo.com/index.html – You click on www.grippy.org/mattmarg/ • Browser uses DNS => IP addr for www.grippy.org • Opens TCP connection to that address • Sends HTTP request: Internet Request Get /mattmarg/ HTTP/1.0 User-Agent: Mozilla/2.0 (Macintosh; I; PPC) Request Accept: text/html; */* Headers Cookie: name = value Referer: http://www.yahoo.com/index.html Host: www.grippy.org Expires: … If-modified-since: ... HTTP Response Response Status Lines Status • 1xx Informational HTTP/1.0 200 Found • 2xx Success Date: Mon, 10 Feb 1997 23:48:22 GMT Server: Apache/1.1.1 HotWired/1.0 – 200 Ok Content-type: text/html Last-Modified: Tues, 11 Feb 1999 22:45:55 GMT • 3xx Redirection Image/jpeg, ... – 302 Moved Temporarily • 4xx Client Error • One click => several responses – 404 Not Found • 5xx Server Error • HTTP1.0: new TCP connection for each elt/page • HTTP1.1: KeepAlive - several requests/connection

Logging Web Activity HTTP Methods • GET • Most servers support “common logfile format” or “extended – Bring back a page logfile format” • HEAD 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 – Like GET but just return headers • POST • Apache lets you customize format – Used to send data to server to be processed (e.g. CGI) • Every HTTP event is recorded – Different from GET: – Page requested – Remote host • A block of data is sent with the request, in the body, – Browser type usually with extra headers like Content-Type: and – Referring page Content-Length: – Time of day • Request URL is not a resource to retrieve; • Applications of data-mining logfiles ?? it's a program to handle the data being sent • HTTP response is normally program output, not a static file. • PUT, DELETE, ... HTTPS Cookies • Secure connections • Small piece of info – Sent by server as part of response header • Encryption: SSL/TLS – Stored on disk by browser; returned in request header – May have expiration date (deleted from disk) • Fairly straightforward: • Associated with a specific domain & directory – Agree on crypto protocol – Only given to site where originally made – Many sites have multiple cookies – Exchange keys – Some have multiple cookies per page! – Create a shared key • Most Data stored as name=value pairs • See – Use shared key to encrypt data – C:\Program Files\Netscape\Users\default\cookies.txt • Certificates – C:\WINDOWS\Cookies Standard Web Search Engine Architecture store documents, check for duplicates, extract links crawl the web DocIds create an user inverted index query CRAWLERS… Search inverted show results engine index To user servers Slide adapted from Marti Hearst / UC Berkeley]

Your Project Architecture? Your Project Architecture? store documents, store documents, check for duplicates, check for duplicates, extract links extract links crawl the crawl the web DocIds web DocIds Classify? user Standard Crawler query Extract show results Front end To user Relational DB Slide adapted from Marti Hearst / UC Berkeley] Open-Source Crawlers How Inverted Files are Created • GNU Wget – Utility for downloading files from the Web. Forward Crawler Scan Repository – Fine if you just need to fetch files from 2-3 sites. Index • Heritix ptrs – Open-source, extensible, Web-scale crawler to – Easy to get running. docs NF – Web-based UI (docs) Sort • Nutch Lexicon – Featureful, industrial strength, Web search package. – Includes Lucene information retrieval part • TF/IDF and other document ranking Sorted • Optimized, inverted-index data store Inverted Scan Index – You get complete control thru easy programming. File List 4/28/2009 4:57 PM Thinking about Efficiency Search Engine Architecture • Crawler (Spider) • Clock cycle: 2 GHz – Searches the web to find pages. Follows hyperlinks. – Typically completes 2 instructions / cycle Never stops • ~10 cycles / instruction, but pipelining & parallel execution – Thus: 4 billion instructions / sec • Indexer • Disk access: 1-10ms – Produces data structures for fast searching of all – Depends on seek distance, published average is 5ms words in the pages – Thus perform 200 seeks / sec • Retriever – (And we are ignoring rotation and transfer times) – Query interface – Database lookup to find hits • Disk is 20 Million times slower !!! • 300 million documents • 300 GB RAM, terabytes of disk • Store index in Oracle database? – Ranking, summaries • Store index using files and unix filesystem? • Front End 4/28/2009 4:57 PM 18

Spiders (Crawlers, Bots) Spiders = Crawlers • Queue := initial page URL 0 • Do forever • 1000s of spiders – Dequeue URL • Various purposes: – Fetch P – Search engines – Parse P for more URLs; add them to queue – Digital rights management – Pass P to (specialized?) indexing program – Advertising • Issues… – Spam – Which page to look at next? – Link checking – site validation • keywords, recency, focus, ??? – Avoid overloading a site – How deep within a site to go? – How frequently to visit pages? – Traps! Crawling Issues Robot Exclusion • Storage efficiency • Search strategy • Person may not want certain pages indexed. – Where to start • Crawlers should obey Robot Exclusion Protocol. – Link ordering – Circularities – But some don’t – Duplicates • Look for file robots.txt at highest directory level – Checking for changes – If domain is www.ecom.cmu.edu, robots.txt goes in • Politeness www.ecom.cmu.edu/robots.txt – Forbidden zones: robots.txt – CGI & scripts • Specific document can be shielded from a crawler – Load on remote servers by adding the line: – Bandwidth (download what need) • Parsing pages for links <META NAME="ROBOTS” CONTENT="NOINDEX"> • Scalability • Malicious servers: SEOs Robots Exclusion Protocol Outgoing Links? • Parse HTML… • Format of robots.txt – Two fields. User-agent to specify a robot • Looking for…what? – Disallow to tell the agent what to ignore • To exclude all robots from a server: User-agent: * Disallow: / • To exclude one robot from two directories: ? anns html foos Bar baz hhh www A href = www.cs Frame font zzz User-agent: WebCrawler ,li> bar bbb anns html foos Bar baz hhh www Disallow: /news/ A href = ffff zcfg www.cs bbbbb z Frame font zzz ,li> bar bbb Disallow: /tmp/ • View the robots.txt specification at http://info.webcrawler.com/mak/projects/robots/norobots.html

Web Crawling Strategy Which tags / attributes hold URLs? • Starting location(s) Anchor tag: <a href=“URL” … > … </a> • Traversal order Option tag: <option value=“URL”…> … </option> – Depth first (LIFO) – Breadth first (FIFO) Map: <area href=“URL” …> – Or ??? Frame: <frame src=“URL” …> • Politeness • Cycles? Link to an image: <img src=“URL” …> • Coverage? Relative path vs. absolute path: <base href= …> Bonus problem: Javascript In our favor: Search Engine Optimization Structure of Mercator Spider URL Frontier (priority queue) • Most crawlers do breadth-first search from seeds. • Politeness constraint: don’t hammer servers! Document fingerprints – Obvious implementation: “live host table” – Will it fit in memory? – Is this efficient? • Mercator’s politeness: – One FIFO subqueue per thread. – Choose subqueue by hashing host’s name. – Dequeue first URL whose host has NO outstanding requests. 1. Remove URL from queue 5. Extract links 2. Simulate network protocols & REP 6. Download new URL? 3. Read w/ RewindInputStream (RIS) 7. Has URL been seen before? 4. Has document been seen before? 8. Add URL to frontier (checksums and fingerprints) Fetching Pages Duplicate Detection • Need to support http, ftp, gopher, .... – Extensible! • URL-seen test: has URL been seen before? • Need to fetch multiple pages at once. – To save space, store a hash • Need to cache as much as possible • Content-seen test: different URL, same doc. – DNS – Supress link extraction from mirrored pages. – robots.txt – Documents themselves (for later processing) • What to save for each doc? • Need to be defensive! – 64 bit “document fingerprint” – Need to time out http connections. – Minimize number of disk reads upon retrieval. – Watch for “crawler traps” (e.g., infinite URL names.) – See section 5 of Mercator paper. – Use URL filter module – Checkpointing!

The Web Server Architecture Servers + Crawlers Connecting on the - PDF document

Outline HTTP Crawling The Web Server Architecture Servers + Crawlers Connecting on the WWW What happens when you click? Suppose You are at www.yahoo.com/index.html You click on www.grippy.org/mattmarg/ Browser

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Web Services Serge Abiteboul INRIA-Futurs Web services 2002 1 Abstract Web services

CS 410/510: Web Basics Basics Web Clients HTTP Web Servers PC running Firefox Web

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Responsive Web Design Introduction to Web Design Responsive Web Design Introduction to Web

Web Management and Maintenance Roles Student Web Presence Guidelines Overview of Student Web

Web Caching and Content Delivery Web Caching and Content Delivery Caching for a Better Web

Overview 1 Agenda Evolution of network computing What is Web Services? Why Web

Web Mining Web Mining to automatically discover and extract information from Web

LOCK FREE RUNTIME SYSTEM 251 Literature Maurice Herlihy and Nir Shavit. The Art of Multiprocessor

1 1 MILES SPAN MINIMUM SPANNING TREES Important: Before reading MILES SPAN , please read or

Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 2: From MapReduce to

CS137: Today Electronic Design Automation SAT Davis-Putnam Data Structures Day

CLASSIC SYSTEMS: Key developer of the B programming lanuage, Unix, Multics, and Plan 9 UNIX

Welcome To Virtual Marke,ng Experts Virtual Marke,ng Blueprint Concept

Leads National Meeting The Royal Society, London 25 th November 2016 Regional Representation

Introducing IPv6-only in the Internet: Balkanisation or Translation? Alain.Durand@sun.com