Exploring the Academic Invisible Web Das wissenschaftliche Invisible - - PowerPoint PPT Presentation

exploring the academic invisible web
SMART_READER_LITE
LIVE PREVIEW

Exploring the Academic Invisible Web Das wissenschaftliche Invisible - - PowerPoint PPT Presentation

Exploring the Academic Invisible Web Das wissenschaftliche Invisible Web erkunden Dr. Dirk Lewandowski Heinrich-Heine-Universitt Dsseldorf, Information Science Research done in collaboration with Philipp Mayr, Bonn Agenda 1. Introduction


slide-1
SLIDE 1

Exploring the Academic Invisible Web

Das wissenschaftliche Invisible Web erkunden

  • Dr. Dirk Lewandowski

Heinrich-Heine-Universität Düsseldorf, Information Science Research done in collaboration with Philipp Mayr, Bonn

slide-2
SLIDE 2

Agenda

  • 1. Introduction
  • 2. The (Academic) Invisible Web defined
  • 3. The size of the (Academic) Invisible Web
  • 4. AIW relevant to...
  • 5. Opening the AIW – different models
slide-3
SLIDE 3

1 Introduction

  • Users expect their search services to be comprehensive

and integrated.

  • Up-to-dateness and completeness are important factors

in research.

slide-4
SLIDE 4

2 The Invisible Web defined

Definitions for Invisible/Deep Web

  • “Text pages, files, or other often high-quality authoritative

information available via the World Wide Web that general- purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice, add to their indices of Web pages" (Sherman u. Price 2001).

  • “The deep Web - those pages do not exist until they are

created dynamically as the result of a specific search“ (Bergman 2001).

slide-5
SLIDE 5

Type of Invisible Web Content Why It's Invisible Disconnected page No links for crawlers to find the page Pages consisting primarily of images, audio, or video Insufficient text for the search engine to "understand" what the page is about Pages consisting primarily of PDF or Postscript, Flash, Shockwave, Executables (programs) or Compressed files (.zip, .tar, etc.) Technically indexable, but usually ignored, primarily for business or policy reasons Content in relational databases Crawlers can't fill out required fields in interactive forms Real-time content Ephemeral data; huge quantities; rapidly changing information Dynamically generated content Customized content is irrelevant for most searchers; fear of "spider traps"

Sherman u. Price 2001

slide-6
SLIDE 6

From the Invisible Web to the Academic Invisible Web

  • Nowadays, the IW problem is mainly the problem with

the contents of databases.

  • For the academic sector, sources from the surface Web

are relevant as well as sources from the Invisible Web.

  • The Academic Invisible Web (AIW) consists of the

databases relevant to academia.

  • Or narrower: The AIW consists of the databases that

libraries should index (using search engine technology).

slide-7
SLIDE 7

3 The size of the Invisible Web

slide-8
SLIDE 8

Bergman‘s calculation

  • Average size of IW databases:

– 5,43 million documents (mean) – 4.950 documents (median)

  • Total size:

100.000 databases * 5,43 Mio. documents = total of 543 billion documents.

  • Size of the surface Web: 1 billion documents (2001).

! The Invisible/Deep Web is 550 times larger than the surface Web.

slide-9
SLIDE 9

Bergman’s calculation

But:

  • Use of the mean, although

distribution of sizes is highly skewed.

– 5,43 million documents (mean) – 4.950 documents (median)

  • Top60 contain 85 billion

documents, 748.504 GB.

  • Top2 contain 585.400 GB

(>75% of Top60).

Bergman top 60 file sizes

50.000 100.000 150.000 200.000 250.000 300.000 350.000 400.000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 Size in GB

slide-10
SLIDE 10

Contents of Bergman’s Top 60

Basis: Database sizes in GB

Contents of Bergman's Top 60

Scientific 90%

  • ther

10%

Contents of Bergman's Top 60 Raw data 86%

  • ther

10% Scientific without raw data 4%

slide-11
SLIDE 11

Summary Bergman criticism

  • Database selection

– Database types – Database content

  • Calculation
slide-12
SLIDE 12

Size comparison: Gale Directory of Databases

  • Contains approx. 16.000 databases (2003); covers all

major academic databases.

  • Total size estimate for all databases: 18,55 billion

documents (includes CD-ROM databases).

  • Estimate is based on less than 10 percent of all

databases.

  • 5 percent of all databases contain >1 million

documents, some more than 100 million.

  • Some of the databases included in Bergman’s top 60

are missing in Gale.

slide-13
SLIDE 13

Will AIW show also an exponential distribution?

Dialog File Sizes

20.000.000 40.000.000 60.000.000 80.000.000 100.000.000 120.000.000 140.000.000 160.000.000 180.000.000 200.000.000 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 Files filesizes

slide-14
SLIDE 14

Will AIW show also an exponential distribution?

Dialog File Sizes

1 10 100 1.000 10.000 100.000 1.000.000 10.000.000 100.000.000 1.000.000.000 1 13 25 37 49 61 73 85 97 109 121 133 145 157 169 181 193 205 217 229 241 253 265 277 289 301 313 325 337 Files filesizes

slide-15
SLIDE 15

Conclusion: Size of the Invisible Web

  • Bergman’s size of 550 billion documents is highly
  • verestimated.
  • An exact calculation from the distribution of Bergman’s

top 60 is not possible.

  • The size estimate from Gale directory includes

databases beyond the web, but does not include all web databases.

  • The estimate from Gale is probably too low.
slide-16
SLIDE 16

4 AIW relevant for scholars, searchers, librarians, information professionals

slide-17
SLIDE 17

4 AIW relevant for scholars, searchers, librarians, information professionals

  • Everything relevant for the scientific process

– Literature (articles, dissertations, reports, books, …) – Data – Pure Online content (e.g. OA)

  • Providers of AIW content

– Database vendors (meta data) + human indexing – Library content (OPACs, collections) + human indexing – Publishers content (full text) + mixed indexing – Other repositories

  • A lot of these materials are not necessarily AIW, but in

fact uncovered by the main search engines and tools.

slide-18
SLIDE 18

5 Opening the AIW – different models

  • Commercial search engines

– Google Scholar – Scirus

  • Libraries & database vendors

– BASE (Bielefeld Academic Search Engine) – Vascoda (Integration of library and database collections)

  • Open Access repositories

– Citebase – OpenROAR

slide-19
SLIDE 19

Conclusion

slide-20
SLIDE 20

Summary

  • Existing search tools and approaches show potential to

make AIW visible

  • All protagonists should work together

– Commercial search engine providers with their machine and financing power – Librarians with their experience in collection building and subject access (e.g. thesauri, classification, taxonomies) – Publishers and database vendors via opening their collections

slide-21
SLIDE 21

Future research

  • Building an AIW sample for further tests.
  • Better size estimates from this sample.
  • Classification of AIW content.
  • Distinction between Academic Surface Web and AIW.
slide-22
SLIDE 22

Vielen Dank.

dirk.lewandowski@uni-duesseldorf.de www.durchdenken.de/lewandowski

slide-23
SLIDE 23

References

  • Bergman, M.K. (2001). The Deep Web: Surfacing

Hidden Value. Journal of Electronic Pub-lishing, 7(1).

  • Sherman, C., & Price, G. (2001). The Invisible Web:

Uncovering Information Sources Search Engines Can't

  • See. Medford, NJ: Information Today.