scalable generic and adaptive systems for focused crawling
play

Scalable, Generic, and Adaptive Systems for Focused Crawling - PowerPoint PPT Presentation

Scalable, Generic, and Adaptive Systems for Focused Crawling Georges Gouriten* - georges@netiru.fr Silviu Maniu Pierre Senellart* * Tlcom Paristech Institut Mines-Tlcom LTCI CNRS Hong Kong University What is focused


  1. Scalable, Generic, and Adaptive Systems for Focused Crawling Georges Gouriten* - georges@netiru.fr Silviu Maniu° Pierre Senellart*° * Télécom Paristech – Institut Mines-Télécom – LTCI CNRS ° Hong Kong University

  2. What is focused crawling?

  3. A directed graph

  4. Web Social network P2P etc.

  5. Weighted 5 3 0 2 5 0 0 4 3 3 3 2 4

  6. Let u be a node, β(u) = count of the word Bhutan in all the tweets of u

  7. Even more weighted 0 0 2 0 3 0 1 0 1 0 0 0 0 0 1 3

  8. Let ( u , v ) be an edge, α(u) = count of the word Bhutan in all the tweets of u mentioning v

  9. The total graph 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  10. A seed list 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  11. The frontier 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  12. Crawling one node 0 5 5 0 3 3 0 0 2 0 3 0 2 2 5 5 0 0 1 0 1 0 0 0 0 0 4 4 0 3 3 0 3 3 3 3 1 3 2 2 4 4

  13. A crawl sequence Let V 0 be the seed list, a set of nodes, a crawl sequence, starting from V 0 , is { v i , v i in frontier(V 0 U {v 0 , v 1 , .. , v i-1 }) }

  14. Goal of a focused crawler Produce crawl sequences with global scores (sum) as high as possible

  15. A high-level algorithm Estimate scores at the frontier Pick a node from the frontier Crawl the node

  16. Supposing a perfect estimator

  17. Finding an optimal crawl sequence offline: NP-hard Greedy wins for a crawled graph > 1000 nodes Refresh rate of 1 is better

  18. Estimation in practice

  19. Different kinds of estimators

  20. bfs 5 3 0 2 5 0 0 4 3 3 3 2 4

  21. bfs 5 3 0 2 5 0 0 4 3 3 3 2 4

  22. bfs

  23. nr navigational rank score propagation from the ancestors of a node then to the children of a node

  24. nr

  25. opic online page importance computation ~ online pageRank computation

  26. opic 2. ->

  27. Open spaces in the state-of-the-art nr has a quadratic complexity opic focus on popularity the rest is about how to score

  28. First-level neighboorhood

  29. Second-level neighboorhood

  30. Neighborhood-based estimators

  31. deg, e, n, ne deg: number of neighbors e: sum of incoming edges n: sum of incoming nodes ne: sum of incoming (node*edge)s

  32. Linear regressions

  33. Multi-armed bandits (1) slot slot slot slot machine machine machine machine 1 2 3 4 ...

  34. Multi-armed bandits (2) Budget n, how to maximize the reward? Balance exploration and exploitation

  35. Applied to focused crawling Slot machines: estimators Reward: score of the top node

  36. mab_ε probability 1-ε: slot machine with the highest average reward probability ε: random slot machine

  37. mab_ε-first steps [0, └ ε x N ┘ ]: random slot machine steps [ └ ε x N ┘ +1, N]: slot machine with the highest average reward

  38. mab_var Succession of ε-first strategies, with a reset every r steps, r varying with the context

  39. Their running times

  40. Expected running times Twitter API for one week: - 3s - 200,000 nodes One domain website for one week: - 1s - 600,000 nodes

  41. Experimental framework (1)

  42. Experimental framework (2) ─ Graph score 10 seed graphs 1 seed graph: 50 seeds picked randomly among non-zero β Arithmetic average of the crawl scores (sum) ─ Global score Normalization with a baseline -- relative score Geometric average among the five graphs

  43. Datasets and code are online http://netiru.fr/research/14fc

  44. To measure the running times Same crawl sequence: the oracle Storage in RAM (20G) 3.6 GHz

  45. The running times (ms)

  46. nr Quadratic complexity, with large constant factors

  47. Their precision

  48. The precision Same crawl sequence: the oracle Precision: distance of the top node to the actual top node Arithmetically averaged over a window of 1000 steps

  49. For bretagne

  50. Their ability to lead crawls

  51. Leading the crawl Different crawl sequences: defined by the top estimated nodes

  52. Average graph scores for France

  53. The multi armed-bandits

  54. All the estimators

  55. Conclusion

  56. What we learnt Generic model NP-hardness offline Refresh rate of 1 Greedy Neighborhood features Linear regressions Multi-armed bandit strategy

  57. Future work Approximation of the optimal score Distributed crawl Recrawling nodes Further multi-armed bandits comparisons

  58. Thank you. georges@netiru.fr

  59. Finding the optimal crawl sequences in a known graph

  60. PTime many-one reduction from the LST-Graph problem Problem remains hard if nodes, not edges, are weighted A subtree rooted at r is seen as a crawl sequence starting from r Free edges are added to the graph to allow free crawls from he seed to any potential root of a subtree

  61. Rich friends will make you richer

  62. The greedy strategy Node picked = argmax(β(v)), v in frontier

  63. Is not always optimal 12 3 20 4 1 2 2

  64. The altered greedy strategy Node picked = probability q: argmax(β(v)) probability 1-q: random v so that, max(β(u)) - β(v) <= ζ x max(β(u))

  65. Altered greedy vs greedy for jazz

  66. The refresh rate disadvantage

  67. When estimation takes too long

  68. The score degradation (%) at different steps

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend