The Evolution of Web Content and Search Engines
Ricardo Ricardo Baeza Baeza-
- Yates
The Evolution of Web Content and Search Engines Ricardo Baeza - - PowerPoint PPT Presentation
The Evolution of Web Content and Search Engines Ricardo Baeza Baeza- -Yates Yates Ricardo Yahoo! Research, DCC/UChile, UPF/Spain lvaro lvaro Pereira Pereira Jr Jr DCC/UFMG/Brazil Nivio Ziviani Ziviani Nivio DCC/UFMG/Brazil
WEBKDD’06, August 20, 2006, Philadelphia, USA
To state the following hypothesis:
When pages have sources (content originated from other
engines
To study how new content is generated in the web
How old content is used to compose new pages Definition of genealogical trees for the web
WEBKDD’06, August 20, 2006, Philadelphia, USA
WEBKDD’06, August 20, 2006, Philadelphia, USA
Objective:
from old documents, returned by the same query
For that we simulate a user performing a query in the search
We used a set of the most frequent queries of each query log
Algorithm divided into two steps
First step: finding new documents that has content from the old documents
Second step: filter the documents
WEBKDD’06, August 20, 2006, Philadelphia, USA
WEBKDD’06, August 20, 2006, Philadelphia, USA
Number of paragraphs in both old and new
New document composed by two old documents
At least two distinct paragraphs from each old document
The new document URL cannot exist in the old
Duplicates are not allowed for both old and new
WEBKDD’06, August 20, 2006, Philadelphia, USA
Log 2002 Log 2004 Log 2003
Log 2002 Log 2004 Log 2003
Real query log
WEBKDD’06, August 20, 2006, Philadelphia, USA
50 100 150 200 250 300 5 10 15 20 25 30 New documents found Minimal number of identical paragraphs log 2003 on col. 2003 log 2004 on col. 2003 log 2002 on col. 2003
Different query logs on old collection 2003 and new
WEBKDD’06, August 20, 2006, Philadelphia, USA
Main components of the tree considering collection
Sample of 120,000 documents
WEBKDD’06, August 20, 2006, Philadelphia, USA
We have presented evidences that a portion of the
A significant portion of the Web has evolved from old
The number of copies from previously copied web
Do search engines contribute to this situation?
WEBKDD’06, August 20, 2006, Philadelphia, USA
WEBKDD’06, August 20, 2006, Philadelphia, USA
WEBKDD’06, August 20, 2006, Philadelphia, USA
20 30 40 50 60 70 80 90 5 10 15 20 25 30 New documents found Minimal number of identical paragraphs bimonthly log 5 bimonthly log 4 bimonthly log 3 bimonthly log 2 bimonthly log 1 20 30 40 50 60 70 80 90 5 10 15 20 25 30 New documents found Minimal number of identical paragraphs bimonthly log 5 bimonthly log 4 bimonthly log 3 bimonthly log 2 bimonthly log 1
Bimonthly logs 2002 Bimonthly logs 2004
WEBKDD’06, August 20, 2006, Philadelphia, USA
40 60 80 100 120 140 160 5 10 15 20 25 30 New documents found Minimal number of identical paragraphs log 2002 on col. 2002 log 2004 on col. 2002 20 40 60 80 100 120 140 160 5 10 15 20 25 30 New documents found Minimal number of identical paragraphs log 2004 on col. 2004 log 2002 on col. 2004
Bimonthly logs 4 and
Bimonthly logs 4 and
WEBKDD’06, August 20, 2006, Philadelphia, USA
Main component of the tree considering collection
Sample of 120,000 documents