An updated Portrait of the Portuguese Web Joo Miranda, Daniel - - PowerPoint PPT Presentation

an updated portrait of the portuguese web
SMART_READER_LITE
LIVE PREVIEW

An updated Portrait of the Portuguese Web Joo Miranda, Daniel - - PowerPoint PPT Presentation

An updated Portrait of the Portuguese Web Joo Miranda, Daniel Gomes {joao.miranda,daniel.gomes} @ fccn.pt http://arquivo.pt Summary Introduction Methodology Metrics Conclusions 2/28 3/28 Introduction The Web


slide-1
SLIDE 1

An updated Portrait

  • f

the Portuguese Web

João Miranda, Daniel Gomes {joao.miranda,daniel.gomes} @ fccn.pt http://arquivo.pt

slide-2
SLIDE 2

2/28

Summary

  • Introduction
  • Methodology
  • Metrics
  • Conclusions
slide-3
SLIDE 3

3/28

Introduction

slide-4
SLIDE 4

4/28

The Web

  • The Web is a huge source of information

– Information published exclusively on the Web – Information disappears

  • Preservation started by the Web Archives

– Access for future generations

  • First initiative: Internet Archive
slide-5
SLIDE 5

5/28

1996 1997 1998 1999 2001 2003 2004 2009

Web Archives

  • Altavista across time
slide-6
SLIDE 6

6/28

The Portuguese Web Archive

  • The Portuguese Web Archive

Web Crawler Indexer User Interface

Web archive

Content Storage

slide-7
SLIDE 7

7/28

Crawler

  • What is a crawler?

– Collects contents from the Web – Starts from an initial set of addresses

  • How does it work?

– Iteratively downloads contents – Extracts links to find new ones

slide-8
SLIDE 8

8/28

Methodology

slide-9
SLIDE 9

9/28

Methodology

  • Crawl of the Portuguese Web (March-May, 2008)

– .PT domain – Heritrix crawler – 180 000 initial addresses – 48 million contents – 2.5 TB

  • No content analysis, only log analysis
slide-10
SLIDE 10

10/28

Metrics

slide-11
SLIDE 11

11/28

Sites hosted per IP address – Why?

  • Politeness policies for crawling

crawler siteA siteB siteC server1 server2 server3 crawler siteA siteB siteC server1

slide-12
SLIDE 12

12/28

Sites hosted per IP address - Results

  • 75% of the IP addresses host 1 site

75% 1 site 23% ]1,10] sites 2% >10 sites

slide-13
SLIDE 13

13/28

Successful responses – Why?

  • Quality indicator
  • Large % of broken links mines trust of users
slide-14
SLIDE 14

14/28

Successful responses - Results

  • 18% of the sites returned 100% OK responses
slide-15
SLIDE 15

15/28

Media types – Why?

  • Browsers or document viewers for cellphones
  • Parsing and indexing for search engines
slide-16
SLIDE 16

16/28

Media types - Results

  • 90% of the number
  • f contents are html,

jpeg, gif

Media type % contents 1 Text/html 57.8% 2 Image/jpeg 22.8% 3 Image/gif 9.4% 4 Text/xml 1.9% ‐ Other 8.1% Media type % amount data 1 Text/html 35.4% 2 App’n/pdf 17.9% 3 Image/jpeg 16.1% 4 Text/plain 4.2% ‐ Other 26.4%

  • 69% of the amount of

data are html, pdf, jpeg

slide-17
SLIDE 17

17/28

Content size – Why?

  • Estimate the storage resources required to

create Web data repositories

slide-18
SLIDE 18

18/28

Content size - Results

  • 96% lower than 128 KB
slide-19
SLIDE 19

19/28

Dynamically generated contents – Why?

  • Identify technological trends in Web publishing
slide-20
SLIDE 20

20/28

Dynamically generated contents - Results

  • At least 46.3 % of the contents were dynamically

generated

53.7% Not dynamically generated 22.4% PHP 10% ASP 0.9% JSP 13% Other with parameters

slide-21
SLIDE 21

21/28

URL length – Why?

  • Influences interaction design
  • Determine adequate length for input boxes that

receive URLs

  • How many characters should be presented on a

search engine results page

slide-22
SLIDE 22

22/28

URL length - Results

  • 84% lower than 100 characters
slide-23
SLIDE 23

23/28

Conclusions

slide-24
SLIDE 24

24/28

Conclusions I

  • Long URL addresses
  • Half of the contents are dynamically

generated (mainly PHP)

  • 90% of the contents are HTML, JPEG

and GIF

  • 69% of the amount of data are HTML,

PDF and JPEG

slide-25
SLIDE 25

25/28

Conclusions II

  • 96% of the contents are smaller than

128 KB

  • Half of the sites present a successful

response rate below 80%

  • Most IP addresses host a single site
slide-26
SLIDE 26

26/28

“Future” work

  • Study trends in the evolution of web

characteristics

–João Miranda, Daniel Gomes, Trends in Web characteristics, 7th Latin American Web Congress

  • Analyze metrics extracted from content

and link analysis

slide-27
SLIDE 27

27/28

Contribute to preserve the Web

  • Anyone can contribute to preserve the

Web

  • Lend disk space to keep backup copies

– Just need to install rARC – http://arquivo.pt/rarc

  • Help required to test beta version
slide-28
SLIDE 28

28/28

Thank you.

Logs used in this study are available for research

  • purposes. Please contact us.

http://arquivo.pt