WEB ARCHIVING @ LIP6
STÉPHANE GANÇARSKI
- Z. PEHLIVAN, M. CORD, M. BEN-SAAD, A. SANOJA,N. THOME, M. LAW
French ANR project Cartec (ended) European project Scape (ends 9/2014)
WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. - - PowerPoint PPT Presentation
WEB ARCHIVING @ LIP6 S TPHANE G ANARSKI Z. P EHLIVAN , M. C ORD , M. B EN -S AAD , A. S ANOJA ,N. T HOME , M. L AW French ANR project Cartec (ended) European project Scape (ends 9/2014) Web Archives Web is ephemeral and constantly
STÉPHANE GANÇARSKI
French ANR project Cartec (ended) European project Scape (ends 9/2014)
Web is ephemeral and constantly evolving Need to preserve information
WEB WEB ARCHIVES 50.000.000.000 (Google@2012) 165.000.000.000 (Internet Archive@2012) non-cumulative Index Cumulative Index Freshness Coherence, Completeness, Preservation
2 / 54
3 / 54
(Crawling) Temporal completeness, Coherence Spatial completeness discovery Avoid duplicates Queries, IR, indexing, navigation, coherence Emulation, migration, Cloud computing
Efficient : Maximize temporal completeness and coherence Under limited resources (bandwith, politeness, storage…) Temporal completeness : how well the archive captures the history of a page. Relevant for medium size archives (e.g. INA legal deposit) For large size archives, spatial completeness Temporal coherence Capture versions of different pages that were present at
4 / 54
5 / 54
1 i i j i
Changes importance
Page importance
Importance of captured versions Importance of all the versions appeared on the Web
Version n- 1 Version n
What are the changes? Are they important ? Depends on where the change occurs
update insert
Change importance is somehow related with what users see
Render pages before analysis Users see blocks of information Use web page segmentation
18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
Version n-1 Version n
6
Web Crawler Archive
Version V(n) Version V(n) Version V(n-1)
Change Detection Patterns Importanc e Estimation Pattern Discovery Time series
www
18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
7
Page B1 B2 B3 B2.1 B2.2 B2.3 Page B1 B2 B3 B2.1 B2.2 B2.3
VIPS Extension Links Images Texts
<xml> <Page url=‘ ‘ version=‘’ ….> <Block ref=‘B1’ pos=‘ ‘ <Links id=‘ ‘ > <link name= ‘ ‘ adr =‘ ’ /> <link name=‘ ‘ adr= ‘ ‘ /> <Links/> <Images id=‘ ‘ > <img name =‘ ‘ src=‘ ‘ /> <Images/> <Texts id=‘ ‘ text=‘ ‘ /> <Block/> <Block ref=‘B2’ id=‘ ‘ …> … <Block/> <Page/> <xml/> Vi-XML Document
Link Link Img Img
Name=‘Link1’ Adr=‘ ‘ Name=‘Link2’ Adr=‘ ‘ Name=‘img1’ Src=‘ ‘ Name=‘img2’ Src=‘ ‘ Txt=‘Text1’ ID ID ID ID ID ID ID
B1 B2.1 B2.2 B2.3 B3
ID ID ID ID ID ID IDList IDList IDList IDList
VIPS
[Cai03]
18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation
8
BOM BOM
Vi-XML V(n-1) Version n
Page B1 B2 B3 B2.1 B2.2 B2.3 ID ID ID ID ID Page B1 B2 B3 B2.1 B2.2 B2.3 ID ID ID ID ID
Vi-XML V(n) Version n-1
Vi-DIFF
(O(n2))
18 Novembre 2011 Qualité des Archives Web : Modélisation et Optimisation 9
(O (n. log n))
10
From successive delta files we can compute change patterns for
pages
Blocks are weighted
According to [Song&al@WWW04]
change importance
between page versions = (normalized) weighted sum
From page patterns and last crawl date, we can compute an
urgency function that simulate the change importance accumulated on the page since the last crawl
We crawl the page that has maximum urgency
11
Page versions are crawled from a « complete » archive, so that we can compute the completeness of each strategy
What if we click on this link ?
Related Work
Recent: it returns the closest
version before tq: P2[t1]
Nearest: it returns the nearest
version by minimizing |tq – tx|:P2[t2] Our approach
Choose the version of P2 that
has the maximum probability to be coherent with P1[tq] according to P1 and P2s patterns
13 / 54
P2[t1] P2[t2] ? ? P1[tq] [tq] Crawl time Ø
Count how many times coherent version chosen Dataset
60 France TV channels 1000 hourly crawl Simulation
Links between pages Crawling Results
15 % better than Nearest 40 % better than Recent
14 / 54
Wayback Machine, Internet Archive
15 / 54
Full text search
16 / 54
Historians Journalists Researcher Web Philologist Web Archaeologists
17 / 54
Full-tex and navigation usually sufficient Not sufficient
Classic operators Search only blocks :
Get version of page p at
Incomplete: If p is not
Navigational Operators :
Page A Page C Page B Page D Page E
18/54
Index compressing: discard postings (term,doc) less
So that (part of) the index fits in main memory Off-line
Random, TCP (Carmel et al. SIGIR '01), IP-u (Chen et al. CIKM '12), 2N2P (Thota et al. ECIR '11), PRPP (Blanco et al. ACM '10)
19 / 54
Existing pruning techniques not
designed with time in mind
Temporal dimension should be
preserved while pruning
ANOVA test shows that there is a link between temporal coverage and retrieval performance
We design 3 methods that takes diversification into account for pruning
Temporal aspects model : windows of fixed size (simple and sliding) and dynamic (gaussian mixture model)
Query « Iraq War »
20 / 54
The English Wikipedia of July 7, 2009 with
2,300,277 Wikipedia articles with temporal
40 queries with temporal dimension and related
Our methods overpasses the existing ones, mostly
Currently we try to experiment on larger archives
21 / 54
22 / 54
Detect blocks of information in a page Many application in Data Preservation and access Crawl scheduling (cf. first part of the talk) Emulation control: check if an archived page can be properly rendered
with a new browser (if not, keep the old browser)
Migration control : archived pages must be migrated (e.g. change
archive file format), control if same rendering after/before migration
Mobile devices (small screen): display blocks, not whole page HTML4 to HTML5 migration (map blocks to HTML5 tags, current work) Etc…
23 / 54
24
W DOM Content Logic Flow W W'
Content Categories : Root, tabular, forms,links, ... Labels : Header, navigation, article, content,…
rendering analysis understanding reconstruction
25
Ground Truth Segmentation
Analysis and Machine Intelligence, IEEE Transactions on, 30(6):941–954, 2008.
# of elements in common MoB tool
26
27 Block based information extraction… Detecting blocks is an extraction task Information extraction = add semantics to blocks Automatic classification : specific rules based on geometry, text
proportion…
Extract information contained by blocks Leverage the segmentation to optimize the extraction process
(ex: objet extracted from the same block, adjacent blocks,…)
Related with linked object ML/image processing techniques (cf. N. Thome presentation) Enhance the comparison algorithm (classifier: similar/different) Learning blocks weight (go beyond Song’s approach)
28