1 Fetcher WebDB/Fetcher Updates Fetcher is very stupid. Not a - PDF document

Meta-details � Built to encourage public search work � Open-source, w/pluggable modules All About Nutch � Cheap to run, both machines & admins � Goal: Search more pages, with better quality, than any other engine Michael J. Cafarella � Pretty good ranking � Currently can do ~ 200M pages CSE 454 April 14, 2005 I nject Indexer 2 of N WebDB Indexer 1 of N Outline Indexer 0 of N Fetchlist 2 of N Update 2 of N Update 1 of N Fetchlist 1 of N Index 2 of N Update 0 of N � Nutch design Index 1 of N Fetchlist 0 of N Index 0 of N � Link database, fetcher, indexer, etc… Fetcher 2 of N Fetcher 1 of N � Supporting parts Fetcher 0 of N � Distributed filesystem, job control Searcher 2 of N Content 0 of N Searcher 1 of N Content 0 of N Content 0 of N � Nutch for your project Searcher 0 of N WebServer 2 of M WebServer 1 of M WebServer 0 of M Moving Parts WebDB � Acquisition cycle � Contains info on all pages, links � WebDB � URL, last download, # failures, link score, � Fetcher content hash, ref counting � Index generation � Source hash, target URL � Must always be consistent � Indexing � Link analysis (maybe) � Designed to minimize disk seeks � Serving results � 19ms seek time x 200m new pages/mo = ~ 44 days of disk seeks! 1

Fetcher WebDB/Fetcher Updates � Fetcher is very stupid. Not a “crawler” URL: http://www.cs.washington.edu/index.html Edit: DOWNLOAD_CONTENT � Divide “to-fetch list” into k pieces, one LastUpdated: 3/22/05 URL: http://www.yahoo/index.html for each fetcher machine ContentHash: MD5_sdflkjweroiwelksd ContentHash: MD5_toewkekqmekkalekaa URL: http://www.cnn.com/index.html URL: http://www.cnn.com/index.html Edit: DOWNLOAD_CONTENT � URLs for one domain go to same list, LastUpdated: Never LastUpdated: Today! URL: http://www.cnn.com/index.html ContentHash: None ContentHash: MD5_balboglerropewolefbag ContentHash: MD5_balboglerropewolefbag otherwise random URL: http://www.yahoo/index.html URL: http://www.flickr/com/index.html Edit: NEW_LINK � “Politeness” w/o inter-fetcher protocols LastUpdated: 4/07/05 LastUpdated: Never URL: http://www.flickr.com/index.html ContentHash: MD5_toewkekqmekkalekaa ContentHash: None ContentHash: None � Can observe robots.txt similarly URL: http://www.yahoo.com/index.html Fetcher edits WebDB � Better DNS, robots caching LastUpdated: Today! ContentHash: MD5_toewkekqmekkalekaa � Easy parallelism 4. Repeat for other tables 3. Read streams in parallel, emitting new database 1. Write down fetcher edits 2. Sort edits (externally, if necessary) � Two outputs: pages, WebDB edits Indexing Link analysis � Iterate through all k page sets in parallel, � A page’s relevance depends on both constructing inverted index intrinsic and extrinsic factors � Creates a “searchable document” of: � Intrinsic: page title, URL, text � URL text � Extrinsic: anchor text, link graph � Content text � PageRank is most famous of many � Incoming anchor text � Other content types might have a different � Others include: document fields � HITS � Eg, email has sender/receiver � Simple incoming link count � Any searchable field end-user will want � Uses Lucene text indexer � Link analysis is sexy, but importance generally overstated Query Processing Link analysis (2) Docs 0-1M Docs 1-2M Docs 2-3M Docs 3-4M Docs 4-5M � Nutch performs analysis in WebDB � Emit a score for each known page Ds 2.3M, 2.9M Ds 1.2M, 1.7M ” ” y M e � At index time, incorporate score into “ y b “britney” 2 n r . t Ds 1, 29 i t e 3 i “britney” n r e , b Ds 4.4M, 4.5M y n M “ ” 1 t . i 3 inverted index r s b D “ � Extremely time-consuming � In our case, disk-consuming, too (because 1.2M, 4.4M, 29, … “britney” we want to use low-memory machines) � 0.5 * log(# incoming links) 2

Administering Nutch Administering Nutch (2) � Admin costs are critical � Admin sounds boring, but it’s not! � It’s a hassle when you have 25 machines � Really � Google has maybe > 100k � I swear � Files � Large-file maintenance � WebDB content, working files � Google File System (Ghemawat, Gobioff, � Fetchlists, fetched pages Leung) � Link analysis outputs, working files � Nutch Distributed File System � Inverted indices � Jobs � Job Control � Emit fetchlists, fetch, update WebDB � Map/Reduce (Dean and Ghemawat) � Run link analysis � Build inverted indices Nutch Distributed File System NDFS (2) � Similar, but not identical, to GFS � Data divided into blocks � Requirements are fairly strange � Blocks can be copied, replicated � Extremely large files � Datanodes hold and serve blocks � Most files read once, from start to end � Namenode holds metainfo � Low admin costs per GB � Filename � block list � Equally strange design � Block � datanode-location � Write-once, with delete � Datanodes report in to namenode every � Single file can exist across many machines few seconds, � Wholly automatic failure recovery NDFS File Read NDFS Replication Datanode 0 Datanode 1 Datanode 2 Datanode 0 Datanode 1 Datanode 2 (Blk 90 to dn 0) (33, 95) (46, 95) (33, 104) Namenode Namenode Datanode 3 Datanode 4 Datanode 5 Datanode 3 Datanode 4 Datanode 5 (21, 33, 46) (90) (21, 90, 104) “crawl.txt” (block-33 / datanodes 1, 4) (block-95 / datanodes 0, 2) (block-65 / datanodes 1, 4, 5) 1. Always keep at least k copies of each blk 1. Client asks datanode for filename info 2. Imagine datanode 4 dies; blk 90 lost 2. Namenode responds with blocklist, and 3. Namenode loses heartbeat, decrements blk location(s) for each block 90’s reference count. Asks datanode 5 to 3. Client fetches each block, in sequence, from replicate blk 90 to datanode 0 a datanode 4. Choosing replication target is tricky 3

Map/Reduce Map/Reduce (2) � Map/Reduce is programming model � Task: count words in docs from Lisp (and other places) � Input consists of (url, contents) pairs � Easy to distribute across nodes � map(key= url, val= contents): � Nice retry/failure semantics � For each word w in contents, emit (w, “1”) � map(key, val) is run on each item in set � reduce(key= word, values= uniq_counts): � emits key/val pairs � Sum all “1”s in values list � reduce(key, vals) is run for each unique � Emit result “(word, sum)” key emitted by map() � emits final output � Many problems can be phrased this way Map/Reduce (3) Map/Reduce (4) � Task: grep How is this distributed? � � Input consists of (url+ offset, single line) Partition input key/value pairs into 1. chunks, run map() tasks in parallel � map(key= url+ offset, val= line): After all map()s are complete, consolidate � If contents matches regexp, emit (line, “1”) 2. all emitted values for each unique � reduce(key= line, values= uniq_counts): emitted key � Don’t do anything; just emit line Now partition space of output map keys, 3. � We can also do graph inversion, link and run reduce() in parallel analysis, WebDB updates, etc If map() or reduce() fails, reexecute! � Map/Reduce Job Processing Searching webcams � Index size will be small TaskTracker 0 TaskTracker 1 TaskTracker 2 � Need all the hints you can get JobTracker � Page text, anchor text � URL sources like Yahoo or DMOZ entries TaskTracker 3 TaskTracker 4 TaskTracker 5 � Webcam-only content types “grep” 1. Client submits “grep” job, indicating code � Avoid processing images at query time and input files 2. JobTracker breaks input file into k chunks, � Take a look at Nutch pluggable content (in this case 6). Assigns work to ttrackers. 3. After map(), tasktrackers exchange map- types (current examples include PDF, output to build reduce() keyspace MS Word, etc.). Might work. 4. JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. 5. reduce() output may go to NDFS 4

Searching webcams (2) Conclusion � Annotate Lucene document with new � http://www.nutch.org/ fields � Partial documentation � “Image qualities” might contain “indoors” � Source code or “daylight” or “flesh tones” � Developer discussion board � Parse text for city names to fill “location” � “Lucene in Action” by Hatcher, field � Multiple downloads to compute “lattitude” Gospodnetic (you can borrow mine) field � Questions? � Others? � Will require new search procedure, too 5

1 Fetcher WebDB/Fetcher Updates Fetcher is very stupid. Not a - PDF document

Meta-details Built to encourage public search work Open-source, w/pluggable modules All About Nutch Cheap to run, both machines & admins Goal: Search more pages, with better quality, than any other engine Michael J. Cafarella

OUT OF THE PIPELINE: NOVEL TARGETS AND TREATMENTS FOR SCHIZOPHRENIA Learning Objectives

IN3170/4170, spring 2020, mandatory labratory exercise 2: Gate delay and nFET intrinsic gain

Joint Learning of Syntactic and Semantic Dependencies Xavier Llu s and Llu s M` arquez

Technology Considered Harmful? Case Study: Facial Recognition and Bias Mar 6, 2020 Quiz Time

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 290N Table of Content

2017 DNSSEC KSK Rollover Carlos Martnez | LACNIC | LACNIC 27, Foz Do Iguass Purpose of this

Speeding things up: Getting sloooower The TLB Memory Exception Every new level of paging no p

MAPREDUCE INFORMATION RETRIEVAL EXPERIMENTS CLEF 2010, Tuesday 21 September 2010 Djoerd Hiemstra

Mak Karim IT Director Summary 3 Recruitment Challenges Effective Candidate Sourcing

Handout 1 Webinar: An Hour of Code with Artificial Intelligence! Topic: Artificial intelligence

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Douglas Bates

On the development, connections, and opportunities of incorporating RFT within AI recommender

The Economics of Market Scarcity vs. The Economics of Natural Abundance The Necessary Paradigm

Agenda for our 6 th AGM 1. Welcome 2. Approve Minutes of AGM for Year 5. 3. Matters arising from

Building Blocks for Privacy- Preserving Decentralized Online Social Networks iSocial Summer

CPSC 121: Models of Computation 2018S Propositional Logic: A First Model of Computation Meghan

Identification of Configurational Features for Authorship Attribution by Intrinsic Evaluation

Zero to ten million daily users in four weeks: sustainable speed is king Jodi Moran, CTO, Plumbee

Effective autocomplete Patrik Ackland The problem Autocomplete today is very simple

CloudTalk: Programming with Search and Wikis Sean McDirmid Microsoft Research Asia Beijing

E ffi cient and E ff ective Query Auto-Completion Giulio Ermanno Pibiri Simon Gog Rossano

UC.yber; News, Networking, Bash, and CTFs Announcements Robert Bathlter has reached out

Language Models Fall 2020 2020-09-11 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik

Improving KeYmaera Less clicking, more proving Motivation & Initial Idea improve the

1 Fetcher WebDB/Fetcher Updates Fetcher is very stupid. Not a - PDF document

Meta-details Built to encourage public search work Open-source, w/pluggable modules All About Nutch Cheap to run, both machines & admins Goal: Search more pages, with better quality, than any other engine Michael J. Cafarella

OUT OF THE PIPELINE: NOVEL TARGETS AND TREATMENTS FOR SCHIZOPHRENIA Learning Objectives

IN3170/4170, spring 2020, mandatory labratory exercise 2: Gate delay and nFET intrinsic gain

Joint Learning of Syntactic and Semantic Dependencies Xavier Llu s and Llu s M` arquez

Technology Considered Harmful? Case Study: Facial Recognition and Bias Mar 6, 2020 Quiz Time

Offline Data Processing: Tasks and Infrastructure Support T. Yang, UCSB 290N Table of Content

2017 DNSSEC KSK Rollover Carlos Martnez | LACNIC | LACNIC 27, Foz Do Iguass Purpose of this

Speeding things up: Getting sloooower The TLB Memory Exception Every new level of paging no p

MAPREDUCE INFORMATION RETRIEVAL EXPERIMENTS CLEF 2010, Tuesday 21 September 2010 Djoerd Hiemstra

Mak Karim IT Director Summary 3 Recruitment Challenges Effective Candidate Sourcing

Handout 1 Webinar: An Hour of Code with Artificial Intelligence! Topic: Artificial intelligence

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Douglas Bates

On the development, connections, and opportunities of incorporating RFT within AI recommender

The Economics of Market Scarcity vs. The Economics of Natural Abundance The Necessary Paradigm

Agenda for our 6 th AGM 1. Welcome 2. Approve Minutes of AGM for Year 5. 3. Matters arising from

Building Blocks for Privacy- Preserving Decentralized Online Social Networks iSocial Summer

CPSC 121: Models of Computation 2018S Propositional Logic: A First Model of Computation Meghan

Identification of Configurational Features for Authorship Attribution by Intrinsic Evaluation

Zero to ten million daily users in four weeks: sustainable speed is king Jodi Moran, CTO, Plumbee

Effective autocomplete Patrik Ackland The problem Autocomplete today is very simple

CloudTalk: Programming with Search and Wikis Sean McDirmid Microsoft Research Asia Beijing

E ffi cient and E ff ective Query Auto-Completion Giulio Ermanno Pibiri Simon Gog Rossano

UC.yber; News, Networking, Bash, and CTFs Announcements Robert Bathlter has reached out

Language Models Fall 2020 2020-09-11 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik

Improving KeYmaera Less clicking, more proving Motivation &amp; Initial Idea improve the

Improving KeYmaera Less clicking, more proving Motivation & Initial Idea improve the