[537] Search Engines Tyler Harter 12/10/14 Flash Review Flash - - PowerPoint PPT Presentation

537 search engines
SMART_READER_LITE
LIVE PREVIEW

[537] Search Engines Tyler Harter 12/10/14 Flash Review Flash - - PowerPoint PPT Presentation

[537] Search Engines Tyler Harter 12/10/14 Flash Review Flash Hierarchy Plane : 1024 to 4096 blocks - planes accessed in parallel Block : 64 to 256 pages - unit of erase Page : 2 to 8 KB - unit of read and program Block 1111 1111


slide-1
SLIDE 1

[537] Search Engines

Tyler Harter 12/10/14

slide-2
SLIDE 2

Flash Review

slide-3
SLIDE 3

Flash Hierarchy

Plane: 1024 to 4096 blocks

  • planes accessed in parallel
  • Block: 64 to 256 pages
  • unit of erase
  • Page: 2 to 8 KB
  • unit of read and program
slide-4
SLIDE 4

Block

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

slide-5
SLIDE 5

Block

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

  • ne block
slide-6
SLIDE 6

Block

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

  • ne page
slide-7
SLIDE 7

Block

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

slide-8
SLIDE 8

Block

1111 1111 1111 1111 1111 1111 1001 1111 1111 1111 1111 1111 1111 1111 1111 1111 program

slide-9
SLIDE 9

Block

1111 1111 1111 1111 1111 1111 1001 1111 1111 1111 1111 1111 1111 1111 1111 1111

slide-10
SLIDE 10

Block

1111 1111 1111 1111 1111 1111 1001 1100 1111 1111 1111 1111 1111 1111 1111 1111 program

slide-11
SLIDE 11

Block

1111 1111 1111 1111 1111 1111 1001 1100 1111 1111 1111 1111 1111 1111 1111 1111

slide-12
SLIDE 12

Block

1111 1111 1111 1111 1111 1111 1001 1100 1111 1111 1111 1111 1110 0001 1111 1111 program

slide-13
SLIDE 13

Block

1111 1111 1111 1111 1111 1111 1001 1100 1111 1111 1111 1111 1110 0001 1111 1111

slide-14
SLIDE 14

Block

1111 1111 1111 1111 1111 1111 1001 1100 1111 1111 1111 1111 1110 0001 1111 1111 erase

slide-15
SLIDE 15

Block

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 erase

slide-16
SLIDE 16

Block

1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111

slide-17
SLIDE 17

Traditional File Systems

File System Storage Device

Traditional API:

  • read sector
  • write sector

not same as flash

slide-18
SLIDE 18

Flash Translation Layer

00 01

block 0

00 10 00 11 00 00 10 01

block 1

11 11 11 11 11 11

1 2 3 4 5 6 7 physical: logical:

slide-19
SLIDE 19

Flash Translation Layer

00 01

block 0

00 10 00 11 00 00 10 01

block 1

11 11 11 11 11 11

1 2 3 4 5 6 7 physical: logical: write 1101

slide-20
SLIDE 20

Flash Translation Layer

00 01

block 0

00 10 00 11 00 00 10 01

block 1

11 01 11 11 11 11

1 2 3 4 5 6 7 physical: logical: write 1101

slide-21
SLIDE 21

Flash Translation Layer

00 01

block 0

00 10 00 11 00 00 10 01

block 1

11 01 11 11 11 11

1 2 3 4 5 6 7 physical: logical: write 1101

slide-22
SLIDE 22

Flash Translation Layer

00 01

block 0

00 10 00 11 00 00 10 01

block 1

11 01 11 11 11 11

1 2 3 4 5 6 7 physical: logical:

slide-23
SLIDE 23

Flash Translation Layer

00 01

block 0

00 10 00 11 00 00 10 01

block 1

11 01 11 11 11 11

1 2 3 4 5 6 7 physical: logical:

must eventually be garbage collected

slide-24
SLIDE 24

MapReduce Review

slide-25
SLIDE 25

ZIP Sale 53715 100 92245 10 53703 15 93422 45 99210 9 54622 20

slide-26
SLIDE 26

ZIP Sale 53715 100 92245 10 53703 15 93422 45 99210 9 54622 20

mapper 1

53715 100 92245 10 53703 15

mapper 2

93422 45 99210 9 54622 20

slide-27
SLIDE 27

ZIP Sale 53715 100 92245 10 53703 15 93422 45 99210 9 54622 20

mapper 1

53715 100 92245 10 53703 15

mapper 2

93422 45 99210 9 54622 20 WI 100,15 CA 10 CA 45,9 WI 20

slide-28
SLIDE 28

ZIP Sale 53715 100 92245 10 53703 15 93422 45 99210 9 54622 20

mapper 1

53715 100 92245 10 53703 15

mapper 2

93422 45 99210 9 54622 20 WI 100,15 CA 10 CA 45,9 WI 20

reducer 1 reducer 2

Reduce WI Reduce CA

slide-29
SLIDE 29

ZIP Sale 53715 100 92245 10 53703 15 93422 45 99210 9 54622 20

mapper 1

53715 100 92245 10 53703 15

mapper 2

93422 45 99210 9 54622 20 WI 100,15 CA 10 CA 45,9 WI 20

reducer 1 reducer 2

Reduce WI Reduce CA

WI 135 CA 64

slide-30
SLIDE 30

public void map(LongWritable key, Text value) { String line = value.toString(); StringToke st = new StringToke(line); while (st.hasMoreTokens())

  • utput.collect(st.nextToken(), 1);

}

  • public void reduce(Text key,

Iterator<IntWritable> values) { int sum = 0; while (values.hasNext()) sum += values.next().get();

  • utput.collect(key, sum);

}

WordCount

slide-31
SLIDE 31

Search Engines

slide-32
SLIDE 32

Search Engine Goal

Users should be able to enter search phrases.

  • Want to return results that are:
  • high quality (how to judge?)
  • relevant
  • It’s ok to do a lot of processing online, but

searches must be fast!

slide-33
SLIDE 33

Internet Search Engine

slide-34
SLIDE 34

Searchers

Web Servers

Internet Search Engine

slide-35
SLIDE 35

Crawler Web Servers

Internet Search Engine

Webpages Searchers

slide-36
SLIDE 36

Crawler Web Servers Snapshot

  • f Pages

Internet Search Engine

Webpages Searchers

slide-37
SLIDE 37

Crawler Web Servers Snapshot

  • f Pages

Indexes Indexing

Internet Search Engine

Webpages Searchers

slide-38
SLIDE 38

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality?

Indexing

Internet Search Engine

Webpages Searchers

slide-39
SLIDE 39

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality? MapReduce Jobs

Internet Search Engine

Webpages Searchers

slide-40
SLIDE 40

Outline

Web Crawling

  • Indexing
  • PageRank
  • Inverted Indexes
  • Searching

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality? MapReduce Jobs

Internet Search Engine Webpages Searchers

slide-41
SLIDE 41

Outline

Web Crawling

  • Indexing
  • PageRank
  • Inverted Indexes
  • Searching

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality? MapReduce Jobs

Internet Search Engine Webpages Searchers

slide-42
SLIDE 42

Web Crawler

Maintain list of pages to crawl.

  • Grabbing/saving a copy removes work from list.
  • Fetched pages may have more links, leading to

more work.

slide-43
SLIDE 43

Fetching a Page

  • 1. convert domain name to IP address.
  • 2. fetch page from server at IP address.
  • High-performance crawlers maintain a very large

DNS cache to minimize step 1.

slide-44
SLIDE 44

Spider Traps

Server returns data so that page example.com/N has a link to example.com/(N+1).

  • From crawler’s perspective, web is infinite!
  • Prioritize via heuristics (avoid dynamic content)

and quality rankings (later).

slide-45
SLIDE 45

robots.txt

robots.txt file can tell crawlers not to crawl. Example:

  • User-agent: googlebot # all Google services

Disallow: /private/ # disallow this directory User-agent: googlebot-news # only the news service Disallow: / # disallow everything User-agent: * # any robot Disallow: /something/ # disallow this directory

  • Some web developers set up intentional spider traps

to punish crawlers that ignore these.

example source: http://en.wikipedia.org/wiki/Robots_exclusion_standard

slide-46
SLIDE 46

“Almost daily, we receive an email something like, ‘Wow, you looked at a lot of pages from my web

  • site. How did you like it?’”
  • Sergey Brin + Lawrence Page

Source: The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

slide-47
SLIDE 47

Outline

Web Crawling

  • Indexing
  • PageRank
  • Inverted Indexes
  • Searching

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality? MapReduce Jobs

Internet Search Engine Webpages Searchers

slide-48
SLIDE 48

Quality Problem

Web pages “proliferate free of quality control”.

  • Contrast with peer-reviewed academic papers.
  • Need to infer quality from the web graph.
slide-49
SLIDE 49

Quality Problem

Web pages “proliferate free of quality control”.

  • Contrast with peer-reviewed academic papers.
  • Need to infer quality from the web graph.
  • Give every page a singe PageRank score

representing quality.

slide-50
SLIDE 50

Strategy: Count Backlinks

Importance: A = 1 B = 4 C = 1 D = 0 E = 1 F = 1

A B C D E F

slide-51
SLIDE 51

Strategy: Count Backlinks

Importance: A = 1 B = 4 C = 1 D = 0 E = 1 F = 1

A B C D E F should A get 2 “votes”?

slide-52
SLIDE 52

Strategy: Count Backlinks

Importance: A = 1 B = 3.5 C = 0.5 D = 0 E = 0.5 F = 0.5

A B C D E F

0.5 0.5 0.5 0.5

slide-53
SLIDE 53

Strategy: Count Backlinks

Importance: A = 1 B = 3.5 C = 0.5 D = 0 E = 0.5 (from A’s vote) F = 0.5

A B C D E F

0.5 0.5 0.5 0.5

slide-54
SLIDE 54

Strategy: Count Backlinks

Importance: A = 1 B = 3.5 C = 0.5 (from B’s vote) D = 0 E = 0.5 (from A’s vote) F = 0.5

A B C D E F

0.5 0.5 0.5 0.5

slide-55
SLIDE 55

Strategy: Count Backlinks

Importance: A = 1 B = 3.5 C = 0.5 (from B’s vote) D = 0 E = 0.5 (from A’s vote) F = 0.5

A B C D E F

0.5 0.5 0.5 0.5

Why do A and B get same votes? B is more important.

slide-56
SLIDE 56

Circular Votes

Want: number of votes you get determines number

  • f votes you give.
  • Problem: changing A’s votes changes B’s votes

changes A’s votes…

slide-57
SLIDE 57

Circular Votes

Want: number of votes you get determines number

  • f votes you give.
  • Problem: changing A’s votes changes B’s votes

changes A’s votes…

  • Fortunately, if you just keep updating every

PageRank, it eventually converges.

slide-58
SLIDE 58

Convergence Goal (Simplified)

Rank(x) = “sum of all votes for x” “x” is a page, Rank(x) is its PageRank.

slide-59
SLIDE 59

Convergence Goal (Simplified)

y∈LinksTo(x)

Rank(x) = Σ “y’s vote for x” LinksTo(x) is the set of all pages linking to x.

slide-60
SLIDE 60

Convergence Goal (Simplified)

y∈LinksTo(x)

Rank(x) = Σ

Rank(y)

Ny is the number of links from y to other pages.

Ny

slide-61
SLIDE 61

Convergence Goal (Simplified)

Rank(x) = Normalize with “c” to get desired amount of “rank” in system. c

y∈LinksTo(x)

Σ

Rank(y) Ny

slide-62
SLIDE 62

Convergence Goal (Simplified)

Rank(x) = c

y∈LinksTo(x)

Σ

Rank(y) Ny

keep updating rank for every page until ranks stop changing much

slide-63
SLIDE 63

Intuition: Random Surfer

Imagine!

  • 1. a bunch of web surfers start on various pages
  • 2. they randomly click links, forever
  • 3. you measure webpage visit frequency
slide-64
SLIDE 64

Intuition: Random Surfer

Imagine!

  • 1. a bunch of web surfers start on various pages
  • 2. they randomly click links, forever
  • 3. you measure webpage visit frequency
  • Visit frequency will be proportional to PageRank.
slide-65
SLIDE 65

Graph 1

A B C

slide-66
SLIDE 66

Graph 1

A B C 0.5

0.25 0.25

slide-67
SLIDE 67

Graph 1

A B C 0.5

0.25 0.25 Rank(B) = (0.25 / 1) + (0.25 / 1) = 0.5 Rank(A) = (0.5 / 2) = 0.25 Rank(C) = (0.5 / 2) = 0.25

Rank(x) = c

y∈LinksTo(x)

Σ

Rank(y) Ny

slide-68
SLIDE 68

Graph 2

A B C Problem: random surfers without links die. (and take the rank with them!)

slide-69
SLIDE 69

Graph 3

A B Problem: ??? C D

slide-70
SLIDE 70

Graph 3

A B Problem: Surfers get stuck in C and D. C+D called a rank “sink”. A and B get 0 rank. C D

slide-71
SLIDE 71

Problems

Problem A: dangling links

  • Problem B: rank sinks
  • Solution?
slide-72
SLIDE 72

Problems

Problem A: dangling links

  • Problem B: rank sinks
  • Solution?
  • Surfers should jump to new random page with

some probability.

slide-73
SLIDE 73

Computation

ranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);

slide-74
SLIDE 74

Computation

ranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);

Many MapReduce jobs can be used.

slide-75
SLIDE 75

Computation

ranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);

Many MapReduce jobs can be used.

slide-76
SLIDE 76

Mappers Send Votes From Pages

public void map(…) { double rank = value.get(); String linkstring = dataval.toString();

  • utput.collect(key, RETAINFAC);

String[] links = linkstring.split(" "); double delta = rank * DAMPINGFAC / links.length; for(String link : links)

  • utput.collect(link, delta);

}

Adapted from: https://code.google.com/p/i-mapreduce/wiki/PagerankExample

slide-77
SLIDE 77

Reducers Sum Votes for Each Page

public void reduce(…) { double rank = 0.0; while(values.hasNext()) rank += values.next().get();

  • utput.collect(key, new DoubleWritable(rank));

}

Adapted from: https://code.google.com/p/i-mapreduce/wiki/PagerankExample

slide-78
SLIDE 78

Computation

ranks = INIT_RANKS; //rank for each page do { new_ranks = compute_ranks(ranks, edges); change = compute_diff(new_ranks, ranks); ranks = new_ranks; } while (change > threshold);

What is “change” over time?

slide-79
SLIDE 79

The PageRank Citation Ranking: Bringing Order to the Web (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)

slide-80
SLIDE 80

Personalized Search

Quality is subjective, and different measures may be best for different people.

  • Currently, our random surfer occasionally jumps to

a random page. PageRank reflects this.

  • Personalized strategy: bias random jumps towards

pages relevant to type of user.

slide-81
SLIDE 81

“To test the utility of PageRank for search, we built a web search engine called Google”

  • Larry Page etal.

The PageRank Citation Ranking: Bringing Order to the Web (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf)

slide-82
SLIDE 82

Outline

Web Crawling

  • Indexing
  • PageRank
  • Inverted Indexes
  • Searching

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality? MapReduce Jobs

Internet Search Engine Webpages Searchers

slide-83
SLIDE 83

Relevance Problem

A website may be important, but is it relevant to the user’s current query?

  • Infer relevance by page contents, such as:
  • html body
  • title
  • meta tags
  • headers
  • etc
slide-84
SLIDE 84

Indexing

Strategy: indexing.

  • Generate files organize by topic, keyword, or some
  • ther criteria that organize documents.
  • For a given word, we want to be able to find all

related documents.

slide-85
SLIDE 85

Representation

For fast processing, assign:

  • docID to each unique page
  • wordID to each unique word on the web

Lorem ipsum dolor sit amet, lorem soluta delicata no

  • vim. Te vel facete ornatus,

mei aeque maiestatis te.

http://www.example.com/…

slide-86
SLIDE 86

Representation

For fast processing, assign:

  • docID to each unique page
  • wordID to each unique word on the web

Lorem ipsum dolor sit amet, lorem soluta delicata no

  • vim. Te vel facete ornatus,

mei aeque maiestatis te.

http://www.example.com/…

5 922 2 66 42 5 15 79 1431 21 3 22 68 12 47 887 244 3

docID=1442

slide-87
SLIDE 87

Forward Index

5 922 2 66 42 5 15 79 1431 21 3 22 68 12 47 887 244 3

docID=1442

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

forward index

522 141 553 999 243 66 42 5 15 79 15 79 1431 21 3 22

docID=9977 …

slide-88
SLIDE 88

Inverted Index

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

forward index

slide-89
SLIDE 89

Inverted Index

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

forward index

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

slide-90
SLIDE 90

Inverted Index

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

forward index

wordID docID 5 1442 922 1442 2 1442 66 1442 42 1442 5 1442 … …

swap columns

slide-91
SLIDE 91

Inverted Index

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

forward index

wordID docID 1 244 2 1442 5 1442 5 1442 5 999 6 133 … …

sort by wordID

slide-92
SLIDE 92

Inverted Index

docID wordID 1442 5 1442 922 1442 2 1442 66 1442 42 1442 5 … …

forward index inverted index

wordID docID 1 244 2 1442 5 1442,1442,999 6 133,411 7 1442,133,999 9 411,875 … …

slide-93
SLIDE 93

Pages without Text

What if pages have no text?

  • When computing the inverted index for a page,

include text of hyperlinks referring to that page.

slide-94
SLIDE 94

Extra Metadata

Extra information makes inverted index more

  • useful. E.g., word position, text type, etc.

wordID docID 1 244 2 1442 5 1442, 1442, 999 … …

slide-95
SLIDE 95

Extra Metadata

Extra information makes inverted index more

  • useful. E.g., word position, text type, etc.

wordID docID 1 (244,14,h1) 2 (1442,56,h4) 5 (1442,32,b), (1442,10,i), (999,80,h4) … …

slide-96
SLIDE 96

Computing Inverted Index with MapReduce

Mapper: read words from files

  • out key: word
  • out val: file name
  • Reducer: make list of file names
  • out key: word
  • out val: list of file names
slide-97
SLIDE 97

Inverted Index: Mapper

public void map(…) { FileSplit fileSplit = reporter.getInputSplit(); String fileName = fileSplit.getPath().getName();

  • StringToke itr = new StringToke(val);

while (itr.hasMoreTokens())

  • utput.collect(itr.nextToken(), fileName);

}

Adapted from: https://developer.yahoo.com/hadoop/tutorial/module4.html#solution

slide-98
SLIDE 98

Inverted Index: Reducer

public void reduce(…) { StringBuilder toReturn = new StringBuilder(); while (values.hasNext()){ toReturn.append(values.next().toString() + “ “);

  • utput.collect(key, toReturn));

}

Adapted from: https://developer.yahoo.com/hadoop/tutorial/module4.html#solution

slide-99
SLIDE 99

Outline

Web Crawling

  • Indexing
  • PageRank
  • Inverted Indexes
  • Searching

Crawler Web Servers Snapshot

  • f Pages

Relevance? Quality? MapReduce Jobs

Internet Search Engine Webpages Searchers

slide-100
SLIDE 100

One-word Queries

Inverted index may be split into “posting files” across many machines. wordID => machine is known.

  • Front-end server takes query, converts to wordID.
  • Front-end fetches docID’s from server with posting file.
  • docID’s are sorted based on PageRank and relevance

and returned to user.

slide-101
SLIDE 101

Multi-Word Queries

Query is converted into list of wordIDs.

  • docID’s from the posting files for each wordID are retrieved.
  • The lists of docID’s can be unioned (OR)
  • r intersected (AND).
  • Position metadata is useful: documents with words near

each other are preferred.

slide-102
SLIDE 102

Phrase Search

Again use position metadata from posting list.

  • Only look for documents with adjacent query words.

wordID docID hello (244,14,h1), (999,2,h1), (999,103,b) world (244,56,h4), (999,104,b) … …

slide-103
SLIDE 103

Phrase Search

Again use position metadata from posting list.

  • Only look for documents with adjacent query words.

wordID docID hello (244,14,h1), (999,2,h1), (999,103,b) world (244,56,h4), (999,104,b) … …

Search for “hello world” return docID 999, but not 244.

slide-104
SLIDE 104

Search is Resource Intense

Indexes greatly reduce data that must be considered relative to the grep approach.

  • However! Most of the data read from the posting lists

won’t be relevant, so a lot of data must be scanned.

slide-105
SLIDE 105

Summary

Crawler: watch for robots.txt

  • PageRank: simulate random surfer
  • Inverted Index: list of docs containing a word
  • Search: take intersection of posting lists
slide-106
SLIDE 106

Announcements

Last class. :(

  • Feedback forms: volunteer?
  • Office hours after class in lab.
  • p5a and p5b due Fri. Hard deadline on Dec 17th.
  • T-Shirts ordered for malloc winners.
  • Final @ 10:05am next Tue. Review to be planned.