algorithms for web indexing and searching
play

Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2007 - PowerPoint PPT Presentation

Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2007 1 The Internet Very large amount of information. Unstructured. How do we find relevant info? 2 The Internet Very large amount of information. Unstructured.


  1. Algorithms for Web Indexing and Searching Rolf Fagerberg Fall 2007 1

  2. The Internet • Very large amount of information. • Unstructured. How do we find relevant info? 2

  3. The Internet • Very large amount of information. • Unstructured. How do we find relevant info? Search Engines! 2

  4. The Internet • Very large amount of information. • Unstructured. How do we find relevant info? Search Engines! History: 94: Lycos, World Wide Web Worm, . . . : First search engines 96: Alta Vista: many pages indexed . Google: many pages indexed and good ranking. 98: 2

  5. Modern Search Engines Impressive performance. E.g. : • Searches 10 1 0 pages. • Response time ≈ 0,1 seconds. • 1000+ queries per second. • Finds relevant pages ( Do you feel lucky. . . ? ) 3

  6. Modern Search Engines Impressive performance. E.g. : • Searches 10 1 0 pages. • Response time ≈ 0,1 seconds. • 1000+ queries per second. • Finds relevant pages ( Do you feel lucky. . . ? ) Who uses bookmarks any more? 3

  7. Not So Modern Search Engines Advanced methods do make a difference (example, circa 1998): princess diana Engine 3 Engine 1 Engine 2 Relevant and Relevant but Not relevant high quality low quality index pollution 4

  8. Course Motivation How does work? 5

  9. Course Motivation How does work? ⇓ How do search engines work? 5

  10. Course Motivation How does work? ⇓ How do search engines work? ⇓ Algorithms for web indexing and searching 5

  11. Subjects – Search Engines Aquiring data Storing data • Web crawling • Data structures storing: – Keywords – URLs Processing data – links – full pages • Parsing • Distribution of data storage • Indexing • Compression of data. • Sorting • Duplicate removal 6

  12. Subjects – Search Engines Searching in data Ranking results • Query types • Word based (number and position of occurences) • Algorithms • Link based (PageRank, others) • Query dependent • Query independent • Other heuristics (e.g. re- cognition of home pages, news, . . . ) 7

  13. Related Subjects • String algorithms and data structures. • Techniques for massive data sets. • Internet protocols • Classical Information Retrieval (vector space models). • Search engine evaluation. • Graph models of the web. • Similarity measures (nearest neighbor, clustering, latent semantic indexing). • Web applications of game theory (auctions, mechanism design). 8

  14. Formal Course Description Prerequisites: DM02/DM507 Algorithms and Data Structures Literature: Research papers Evaluation: Implementation project, oral exam (??) Credits: 7.5 ECTS Course language: English 9

  15. Project Implement a search engine Goal: Search engine for domain .dk • Large scale project in several parts (crawling, indexing, ranking, query interface). • Larger programming groups than normal (4 persons?). Train cooperation and project planning. 10

  16. Informal Course Description In the course you will meet: • Real life search engines – a showcase of the direct impact computer science can have on everybodys daily life. • Algorithms and data structures • Mathematical models • Hands-on experience and team- work 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend