Ranking the Web with Spark Apache Big Data Europe 2016 - - PowerPoint PPT Presentation

ranking the web with spark apache big data europe 2016
SMART_READER_LITE
LIVE PREVIEW

Ranking the Web with Spark Apache Big Data Europe 2016 - - PowerPoint PPT Presentation

Ranking the Web with Spark Apache Big Data Europe 2016 sylvain@sylvainzimmer.com @sylvinus /usr/bin/whoami Jamendo (Founder & CTO, 2004-2011) TEDxParis (Co-founder, 2009-2012) dotConferences (Founder, 2012-) Pricing Assistant


slide-1
SLIDE 1

Ranking the Web with Spark Apache Big Data Europe 2016

sylvain@sylvainzimmer.com @sylvinus

slide-2
SLIDE 2

/usr/bin/whoami

  • Jamendo (Founder & CTO, 2004-2011)
  • TEDxParis (Co-founder, 2009-2012)
  • dotConferences (Founder, 2012-)
  • Pricing Assistant (Co-founder & CTO, 2012-)
slide-3
SLIDE 3

transparency reproducibility

slide-4
SLIDE 4
slide-5
SLIDE 5

https://uidemo.commonsearch.org

slide-6
SLIDE 6

https://explain.commonsearch.org/?q=python&g=en

slide-7
SLIDE 7

Ranking

slide-8
SLIDE 8

Disclaimer: IANASRE

(I Am Not A Search Relevance Engineer)

slide-9
SLIDE 9

What's in a score

score = fn( doc, query, language, user, time )

slide-10
SLIDE 10

What's in a score

score = fn( doc, query )

slide-11
SLIDE 11

What's in a score

score = fn( static_score, dynamic_score ( query ))

slide-12
SLIDE 12

Static score

slide-13
SLIDE 13

Static features

  • Scopes:
  • Page: URL depth, markup stats, ...
  • Domain: Age, page count, blacklists, ...
  • WebGraph: PageRank, ...
slide-14
SLIDE 14

http://infolab.stanford.edu/~backrub/google.html

The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Crawler Indexer Database Searcher Ranker

slide-15
SLIDE 15

Dynamic score

slide-16
SLIDE 16

Dynamic features

  • Text match: TF-IDF, BM25, proximity, topic, ...
  • Query-level: number of words, popularity, ...
  • Usage: clicks, dwell time, reformulations, ...
  • Time
slide-17
SLIDE 17

Scoring function

slide-18
SLIDE 18

Users Database Elasticsearch Indexer Python, Spark Data sources Common Crawl, Alexa top 1M, ...

words, static score query top 10 docs, final scores

Offline Online

Searcher Go

slide-19
SLIDE 19
slide-20
SLIDE 20

https://explain.commonsearch.org/?q=python&g=en

slide-21
SLIDE 21

Issues with this architecture

  • Static & dynamic scoring are in different

codebases

  • No control over result diversity
  • Hard to optimize
  • Very dependent on Elasticsearch
slide-22
SLIDE 22

Rescoring

slide-23
SLIDE 23

Users Database Indexer

words, static score, features query

Searcher

top 1k docs, features

Rescorer

final 10 docs

slide-24
SLIDE 24

Issues with rescoring

  • Latency
  • Pagination
  • Harder to explain
slide-25
SLIDE 25

Learning to rank

slide-26
SLIDE 26

LTR Model

  • Features
  • Training dataset
  • Evaluation: NDCG, ERR, ...
  • Algorithms: AdaRank, ListNet, LambdaMART, ...
  • Learning with Spark!
slide-27
SLIDE 27

The right questions

  • What do users expect?
  • What features?
  • How to evaluate and fine-tune in the real world?
slide-28
SLIDE 28

PageRank with Spark

slide-29
SLIDE 29
slide-30
SLIDE 30

http://commoncrawl.org

slide-31
SLIDE 31

https://github.com/commonsearch/cosr-back

slide-32
SLIDE 32

Common Search Pipeline

Doc sources Common Crawl, WARC files, URLs ... Filter plugins Document parsing Output plugins Data output Database, file, HDFS, S3, ...

slide-33
SLIDE 33

Most popular Wikipedia pages

slide-34
SLIDE 34

Dumping the web graph

slide-35
SLIDE 35

Naive pyspark PageRank

slide-36
SLIDE 36

GraphFrames

slide-37
SLIDE 37

SparkSQL PageRank

slide-38
SLIDE 38

SparkSQL PageRank

https://github.com/commonsearch/cosr-back/blob/master/spark/jobs/pagerank.py

slide-39
SLIDE 39

Tests

http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm

https://github.com/commonsearch/cosr-back/blob/master/tests/sparktests/test_pagerank.py

slide-40
SLIDE 40

https://about.commonsearch.org/developer/get-started

slide-41
SLIDE 41
slide-42
SLIDE 42

Top 10

slide-43
SLIDE 43
slide-44
SLIDE 44

Spam

slide-45
SLIDE 45
slide-46
SLIDE 46

Spamdexing

  • Keyword stuffing, hidden text
  • Scraper sites, Mirrors
  • Link farms
  • Splogs, Comment spam
  • Domaining
  • Cloaking
  • Bombing
slide-47
SLIDE 47

Questions?

https://about.commonsearch.org/contributing https://github.com/commonsearch contact@commonsearch.org slack.commonsearch.org