SLIDE 1 Scalable, Generic, and Adaptive Systems for Focused Crawling
Georges Gouriten* - georges@netiru.fr Silviu Maniu° Pierre Senellart*°
* Télécom Paristech – Institut Mines-Télécom – LTCI CNRS ° Hong Kong University
SLIDE 2
What is focused crawling?
SLIDE 3
A directed graph
SLIDE 4
Web Social network P2P etc.
SLIDE 5 Weighted
3 5 4 3 5 3 4 2 2 3
SLIDE 6
Let u be a node, β(u) = count of the word Bhutan in all the tweets of u
SLIDE 7 Even more weighted
2 3 1 1 1 3
SLIDE 8
Let (u, v) be an edge, α(u) = count of the word Bhutan in all the tweets of u mentioning v
SLIDE 9 3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3
The total graph
SLIDE 10 3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3
A seed list
SLIDE 11 The frontier
3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3
SLIDE 12 Crawling one node
3 5 4 3 5 3 4 2 2 3 3 5 4 3 5 3 4 2 2 3 2 3 1 1 1 3
SLIDE 13
A crawl sequence
Let V0 be the seed list, a set of nodes, a crawl sequence, starting from V0, is
{ vi, vi in frontier(V0 U {v0, v1, .. , vi-1}) }
SLIDE 14
Goal of a focused crawler
Produce crawl sequences with global scores (sum) as high as possible
SLIDE 15
A high-level algorithm
Estimate scores at the frontier Pick a node from the frontier Crawl the node
SLIDE 16
Supposing a perfect estimator
SLIDE 17
Finding an optimal crawl sequence offline: NP-hard Greedy wins for a crawled graph > 1000 nodes Refresh rate of 1 is better
SLIDE 18
Estimation in practice
SLIDE 19
Different kinds of estimators
SLIDE 20 bfs
3 5 4 3 5 3 4 2 2 3
SLIDE 21 bfs
3 5 4 3 5 3 4 2 2 3
SLIDE 22
bfs
SLIDE 23
nr
navigational rank score propagation from the ancestors of a node then to the children of a node
SLIDE 24
nr
SLIDE 25
- pic
- nline page importance computation
~ online pageRank computation
SLIDE 27 Open spaces in the state-of-the-art
nr has a quadratic complexity
the rest is about how to score
SLIDE 28
First-level neighboorhood
SLIDE 29
Second-level neighboorhood
SLIDE 30
Neighborhood-based estimators
SLIDE 31
deg, e, n, ne
deg: number of neighbors e: sum of incoming edges n: sum of incoming nodes ne: sum of incoming (node*edge)s
SLIDE 32
Linear regressions
SLIDE 33 Multi-armed bandits (1)
slot machine 1 slot machine 2 slot machine 3 slot machine 4 ...
SLIDE 34
Multi-armed bandits (2)
Budget n, how to maximize the reward? Balance exploration and exploitation
SLIDE 35
Applied to focused crawling
Slot machines: estimators Reward: score of the top node
SLIDE 36
mab_ε
probability 1-ε: slot machine with the highest average reward probability ε: random slot machine
SLIDE 37
mab_ε-first
steps [0, └ε x N┘]: random slot machine steps [└ε x N┘ +1, N]: slot machine with the highest average reward
SLIDE 38
mab_var
Succession of ε-first strategies, with a reset every r steps, r varying with the context
SLIDE 39
Their running times
SLIDE 40 Expected running times
Twitter API for one week:
One domain website for one week:
SLIDE 41
Experimental framework (1)
SLIDE 42
Experimental framework (2)
─ Graph score 10 seed graphs 1 seed graph: 50 seeds picked randomly among non-zero β Arithmetic average of the crawl scores (sum) ─ Global score Normalization with a baseline -- relative score Geometric average among the five graphs
SLIDE 43
Datasets and code are online
http://netiru.fr/research/14fc
SLIDE 44
To measure the running times
Same crawl sequence: the oracle Storage in RAM (20G) 3.6 GHz
SLIDE 45
The running times (ms)
SLIDE 46
nr
Quadratic complexity, with large constant factors
SLIDE 47
Their precision
SLIDE 48
The precision
Same crawl sequence: the oracle Precision: distance of the top node to the actual top node Arithmetically averaged over a window of 1000 steps
SLIDE 49
For bretagne
SLIDE 50
Their ability to lead crawls
SLIDE 51
Leading the crawl
Different crawl sequences: defined by the top estimated nodes
SLIDE 52
Average graph scores for France
SLIDE 53
The multi armed-bandits
SLIDE 54
All the estimators
SLIDE 55
Conclusion
SLIDE 56
What we learnt
Generic model NP-hardness offline Refresh rate of 1 Greedy Neighborhood features Linear regressions Multi-armed bandit strategy
SLIDE 57
Future work
Approximation of the optimal score Distributed crawl Recrawling nodes Further multi-armed bandits comparisons
SLIDE 58
Thank you.
georges@netiru.fr
SLIDE 59
Finding the optimal crawl sequences in a known graph
SLIDE 60
PTime many-one reduction from the LST-Graph problem Problem remains hard if nodes, not edges, are weighted A subtree rooted at r is seen as a crawl sequence starting from r Free edges are added to the graph to allow free crawls from he seed to any potential root of a subtree
SLIDE 61
Rich friends will make you richer
SLIDE 62
The greedy strategy
Node picked = argmax(β(v)), v in frontier
SLIDE 63 Is not always optimal
2 1 2 3 4 20 12
SLIDE 64
The altered greedy strategy
Node picked = probability q: argmax(β(v)) probability 1-q: random v so that, max(β(u)) - β(v) <= ζ x max(β(u))
SLIDE 65
Altered greedy vs greedy for jazz
SLIDE 66
The refresh rate disadvantage
SLIDE 67
When estimation takes too long
SLIDE 68
The score degradation (%) at different steps