Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2004 - - PowerPoint PPT Presentation

algorithms for web indexing and searching
SMART_READER_LITE
LIVE PREVIEW

Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2004 - - PowerPoint PPT Presentation

Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2004 1 The Internet Very large amount of information. Unstructured. How do we find relevant info? 2 The Internet Very large amount of information. Unstructured.


slide-1
SLIDE 1

Algorithms for Web Indexing and Searching

Rolf Fagerberg Fall 2004

1

slide-2
SLIDE 2

The Internet

  • Very large amount of information.
  • Unstructured.

How do we find relevant info?

2

slide-3
SLIDE 3

The Internet

  • Very large amount of information.
  • Unstructured.

How do we find relevant info? Search Engines!

2

slide-4
SLIDE 4

The Internet

  • Very large amount of information.
  • Unstructured.

How do we find relevant info? Search Engines! History: 94: Lycos,. . . : First search engines 96: Alta Vista: many pages indexed . 99: Google: many pages indexed and good ranking.

2

slide-5
SLIDE 5

Modern Search Engines

Impressive performance. E.g. :

  • Searches 4.3 · 109 pages (Sept 04).
  • Response time ≈ 0,1 seconds.
  • 1000 queries per second.
  • Finds relevant pages (Do you feel lucky. . . ?)

3

slide-6
SLIDE 6

Modern Search Engines

Impressive performance. E.g. :

  • Searches 4.3 · 109 pages (Sept 04).
  • Response time ≈ 0,1 seconds.
  • 1000 queries per second.
  • Finds relevant pages (Do you feel lucky. . . ?)

Who uses bookmarks any more?

3

slide-7
SLIDE 7

Not So Modern Search Engines

Advanced methods do make a difference (example, circa 1998):

  • M. Henzinger

Web Information Retrieval 4

princess diana

Engine 1 Engine 2 Engine 3

Relevant but low quality Not relevant index pollution Relevant and high quality

4

slide-8
SLIDE 8

Course Motivation

How does work?

5

slide-9
SLIDE 9

Course Motivation

How does work? ⇓ How do search engines work?

5

slide-10
SLIDE 10

Course Motivation

How does work? ⇓ How do search engines work? ⇓ Algorithms for web indexing and searching

5

slide-11
SLIDE 11

Subjects – Search Engines

Aquiring data

  • Web crawling

Processing data

  • Parsing
  • Indexing
  • Sorting
  • Duplicate removal

Storing data

  • Data structures storing:

– Keywords – URLs – links – full pages

  • Distribution of data storage
  • Compression of data.

6

slide-12
SLIDE 12

Subjects – Search Engines

Searching in data

  • Query types
  • Algorithms

Ranking results

  • Word based (number and

position of occurences)

  • Link based (PageRank,
  • thers)
  • Query dependent
  • Query independent
  • Other heuristics (e.g. re-

cognition of home pages, news, . . . )

7

slide-13
SLIDE 13

Related Subjects

  • String algorithms and data structures.
  • Techniques for massive data sets.
  • Internet protocols
  • Classical IR (high-dimensional vector spaces).
  • Search engine evaluation.
  • Graph models of the web.
  • Data mining.
  • Similarity measures (nearest neighbor, clustering, latent

semantic indexing).

  • Web caching.
  • Web applications of game theory (auctions, mechanism

design).

8

slide-14
SLIDE 14

Formal Course Description

Prerequisites: DM02 Literature: Research papers Evaluation: Implementation project, oral exam Credits: 7.5 ECTS Course language: Danish or English

9

slide-15
SLIDE 15

Project

Implement a search engine Goal: Search engine for domain

  • ✁✂
  • Large scale project.
  • Programming groups (crawling, indexing, ranking, query

interface).

  • Cooperation.

10

slide-16
SLIDE 16

Informal Course Description

In the course you will meet:

  • Real life search engines
  • Algorithms and data structures
  • Mathematical models
  • Hands-on experience
  • Teamwork

11