algorithms for web indexing and searching
play

Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2004 - PowerPoint PPT Presentation

Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2004 1 The Internet Very large amount of information. Unstructured. How do we find relevant info? 2 The Internet Very large amount of information. Unstructured.


  1. Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2004 1

  2. The Internet • Very large amount of information. • Unstructured. How do we find relevant info? 2

  3. The Internet • Very large amount of information. • Unstructured. How do we find relevant info? Search Engines! 2

  4. The Internet • Very large amount of information. • Unstructured. How do we find relevant info? Search Engines! History: 94: Lycos,. . . : First search engines 96: Alta Vista: many pages indexed . 99: Google: many pages indexed and good ranking. 2

  5. Modern Search Engines Impressive performance. E.g. : • Searches 4 . 3 · 10 9 pages (Sept 04). • Response time ≈ 0,1 seconds. • 1000 queries per second. • Finds relevant pages ( Do you feel lucky. . . ? ) 3

  6. Modern Search Engines Impressive performance. E.g. : • Searches 4 . 3 · 10 9 pages (Sept 04). • Response time ≈ 0,1 seconds. • 1000 queries per second. • Finds relevant pages ( Do you feel lucky. . . ? ) Who uses bookmarks any more? 3

  7. Not So Modern Search Engines Advanced methods do make a difference (example, circa 1998): princess diana Engine 3 Engine 1 Engine 2 Relevant and Relevant but Not relevant high quality low quality index pollution M. Henzinger Web Information Retrieval 4 4

  8. Course Motivation How does work? 5

  9. Course Motivation How does work? ⇓ How do search engines work? 5

  10. Course Motivation How does work? ⇓ How do search engines work? ⇓ Algorithms for web indexing and searching 5

  11. Subjects – Search Engines Aquiring data Storing data • Web crawling • Data structures storing: – Keywords – URLs Processing data – links – full pages • Parsing • Distribution of data storage • Indexing • Compression of data. • Sorting • Duplicate removal 6

  12. Subjects – Search Engines Searching in data Ranking results • Query types • Word based (number and position of occurences) • Algorithms • Link based (PageRank, others) • Query dependent • Query independent • Other heuristics (e.g. re- cognition of home pages, news, . . . ) 7

  13. Related Subjects • String algorithms and data structures. • Techniques for massive data sets. • Internet protocols • Classical IR (high-dimensional vector spaces). • Search engine evaluation. • Graph models of the web. • Data mining. • Similarity measures (nearest neighbor, clustering, latent semantic indexing). • Web caching. • Web applications of game theory (auctions, mechanism design). 8

  14. Formal Course Description Prerequisites: DM02 Literature: Research papers Evaluation: Implementation project, oral exam Credits: 7.5 ECTS Course language: Danish or English 9

  15. � ✁✂ Project Implement a search engine Goal: Search engine for domain • Large scale project. • Programming groups (crawling, indexing, ranking, query interface). • Cooperation. 10

  16. Informal Course Description In the course you will meet: • Real life search engines • Algorithms and data structures • Mathematical models • Hands-on experience • Teamwork 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend