WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. - - PowerPoint PPT Presentation

web archiving lip6
SMART_READER_LITE
LIVE PREVIEW

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. - - PowerPoint PPT Presentation

WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA ,N. T HOME , M. L AW French ANR project Cartec (ended) European project Scape (ends 9/2014) Web Archives Web is ephemeral and constantly


slide-1
SLIDE 1

WEB ARCHIVING @ LIP6

STÉPHANE GANÇARSKI

  • Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW

French ANR project Cartec (ended) European project Scape (ends 9/2014)

slide-2
SLIDE 2

Web Archives

 Web is ephemeral and constantly evolving  Need to preserve information

WEB WEB ARCHIVES 50.000.000.000 (Google@2012) 165.000.000.000 (Internet Archive@2012) non-cumulative Index Cumulative Index Freshness Coherence, Completeness, Preservation

2 / 54

slide-3
SLIDE 3

Issues

3 / 54

(Crawling) Temporal completeness, Coherence Spatial completeness discovery Avoid duplicates Queries, IR, indexing, navigation, coherence Emulation, migration, Cloud computing

slide-4
SLIDE 4
  • 1. Efficient crawling using

segmentation and patterns

 Efficient :  Maximize temporal completeness and coherence  Under limited resources (bandwith, politeness, storage…)  Temporal completeness :  how well the archive captures the history of a page.  Relevant for medium size archives (e.g. INA legal deposit)  For large size archives, spatial completeness  Temporal coherence  Capture versions of different pages that were present at

the same time on the Web

4 / 54

slide-5
SLIDE 5

Temporal completeness

5 / 54

) , ( * )

1 i i j i

υ υ impCh ω(P = ) ω(υ

Changes importance

How to measure it ?

Page importance

Importance of captured versions Importance of all the versions appeared on the Web

  • temporal completeness =
  • Importance of a page version
slide-6
SLIDE 6

Version n- 1 Version n

What are the changes? Are they important ? Depends on where the change occurs

update insert

Change importance is somehow related with what users see

  • n the page

Render pages before analysis Users see blocks of information Use web page segmentation

18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

Version n-1 Version n

Measuring changes importance

6

slide-7
SLIDE 7

Web Crawler Archive

Version V(n) Version V(n) Version V(n-1)

Change Detection Patterns Importanc e Estimation Pattern Discovery Time series

www

18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

7

Global overview

slide-8
SLIDE 8

Page B1 B2 B3 B2.1 B2.2 B2.3 Page B1 B2 B3 B2.1 B2.2 B2.3

VIPS Extension Links Images Texts

<xml> <Page url=‘ ‘ version=‘’ ….> <Block ref=‘B1’ pos=‘ ‘ <Links id=‘ ‘ > <link name= ‘ ‘ adr =‘ ’ /> <link name=‘ ‘ adr= ‘ ‘ /> <Links/> <Images id=‘ ‘ > <img name =‘ ‘ src=‘ ‘ /> <Images/> <Texts id=‘ ‘ text=‘ ‘ /> <Block/> <Block ref=‘B2’ id=‘ ‘ …> … <Block/> <Page/> <xml/> Vi-XML Document

Link Link Img Img

Name=‘Link1’ Adr=‘ ‘ Name=‘Link2’ Adr=‘ ‘ Name=‘img1’ Src=‘ ‘ Name=‘img2’ Src=‘ ‘ Txt=‘Text1’ ID ID ID ID ID ID ID

B1 B2.1 B2.2 B2.3 B3

ID ID ID ID ID ID IDList IDList IDList IDList

VIPS

[Cai03]

Segmentation

18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation

8

BOM BOM

slide-9
SLIDE 9

Vi-XML V(n-1) Version n

Page B1 B2 B3 B2.1 B2.2 B2.3 ID ID ID ID ID Page B1 B2 B3 B2.1 B2.2 B2.3 ID ID ID ID ID

Vi-XML V(n) Version n-1

  • structural changes
  • Content changes

Vi-DIFF

Changes detection

(O(n2))

18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation 9

(O (n. log n))

slide-10
SLIDE 10

10

From changes to crawl scheduling

 From successive delta files we can compute change patterns for

pages

 Blocks are weighted

According to [Song&al@WWW04]

 change importance

between page versions = (normalized) weighted sum

  • f the block changes

 From page patterns and last crawl date, we can compute an

urgency function that simulate the change importance accumulated on the page since the last crawl

 We crawl the page that has maximum urgency

slide-11
SLIDE 11

11

Experimentations

Page versions are crawled from a « complete » archive, so that we can compute the completeness of each strategy

slide-12
SLIDE 12
  • 2. Accessing Web archives

What if we click on this link ?

slide-13
SLIDE 13

Coherent navigation

Related Work

 Recent: it returns the closest

version before tq: P2[t1]

 Nearest: it returns the nearest

version by minimizing |tq – tx|:P2[t2] Our approach

 Choose the version of P2 that

has the maximum probability to be coherent with P1[tq] according to P1 and P2s patterns

13 / 54

P2[t1] P2[t2] ? ? P1[tq] [tq] Crawl time Ø

slide-14
SLIDE 14

Experiments

 Count how many times coherent version chosen  Dataset

 60 France TV channels  1000 hourly crawl  Simulation

 Links between pages  Crawling  Results

 15 % better than Nearest  40 % better than Recent

14 / 54

slide-15
SLIDE 15

Access to Wacs Today

 Wayback Machine, Internet Archive

15 / 54

slide-16
SLIDE 16

Access to Wacs Today

 Full text search

16 / 54

slide-17
SLIDE 17

Why Query Language?

Web Users ≠ WAC Users

Historians Journalists Researcher Web Philologist Web Archaeologists

17 / 54

Full-tex and navigation usually sufficient Not sufficient

slide-18
SLIDE 18

Operators

 Classic operators  Search only blocks :

InBlock

 Get version of page p at

t : Wayback

 Incomplete: If p is not

crawled at t : Nearest/Recent/Cohere nt

 Navigational Operators :

in, out, jump

Page A Page C Page B Page D Page E

18/54

slide-19
SLIDE 19

Static Index Pruning

 Index compressing: discard postings (term,doc) less

likely to affect the retrieval performance

 So that (part of) the index fits in main memory  Off-line

  • State of the art
  • Give a score to each posting
  • Based on a threshold filter out a part of postings
  • Obtain prune ratio ( % of removed postings)

Random, TCP (Carmel et al. SIGIR '01), IP-u (Chen et al. CIKM '12), 2N2P (Thota et al. ECIR '11), PRPP (Blanco et al. ACM '10)

19 / 54

slide-20
SLIDE 20

Introducing diversification

 Existing pruning techniques not

designed with time in mind

 Temporal dimension should be

preserved while pruning

ANOVA test shows that there is a link between temporal coverage and retrieval performance

We design 3 methods that takes diversification into account for pruning

Temporal aspects model : windows of fixed size (simple and sliding) and dynamic (gaussian mixture model)

 Query « Iraq War »

20 / 54

slide-21
SLIDE 21

Experiments - Dataset

 The English Wikipedia of July 7, 2009 with

temporal queries and relevance judgements [Berberich et al. ECIR'10]

 2,300,277 Wikipedia articles with temporal

expressions

 40 queries with temporal dimension and related

relevance judgements

 Our methods overpasses the existing ones, mostly

when the prune ratio is high

 Currently we try to experiment on larger archives

(Portuguese Web Archive)

21 / 54

slide-22
SLIDE 22

Results

22 / 54

slide-23
SLIDE 23
  • 3. Web page segmentation

 Detect blocks of information in a page  Many application in Data Preservation and access  Crawl scheduling (cf. first part of the talk)  Emulation control: check if an archived page can be properly rendered

with a new browser (if not, keep the old browser)

 Migration control : archived pages must be migrated (e.g. change

archive file format), control if same rendering after/before migration

 Mobile devices (small screen): display blocks, not whole page  HTML4 to HTML5 migration (map blocks to HTML5 tags, current work)  Etc…

23 / 54

slide-24
SLIDE 24

24

Block-o-Matic: web page segmentation

W DOM Content Logic Flow W W'

Content Categories : Root, tabular, forms,links, ... Labels : Header, navigation, article, content,…

rendering analysis understanding reconstruction

slide-25
SLIDE 25

25

Evaluation Method

Ground Truth Segmentation

  • F. Shafait, D. Keysers, and T. Breuel. Performance evaluation and benchmarking of six-page segmentation algorithms. Pattern

Analysis and Machine Intelligence, IEEE Transactions on, 30(6):941–954, 2008.

# of elements in common MoB tool

slide-26
SLIDE 26

26

Experiments

slide-27
SLIDE 27

Future/current work

27  Block based information extraction…  Detecting blocks is an extraction task  Information extraction = add semantics to blocks  Automatic classification : specific rules based on geometry, text

proportion…

 Extract information contained by blocks  Leverage the segmentation to optimize the extraction process

(ex: objet extracted from the same block, adjacent blocks,…)

 Related with linked object  ML/image processing techniques (cf. N. Thome presentation)  Enhance the comparison algorithm (classifier: similar/different)  Learning blocks weight (go beyond Song’s approach)

slide-28
SLIDE 28

Obrigado.

28