Bing Search: An Engine in the Clouds Munich Munich Rablstr. - - PowerPoint PPT Presentation

bing search
SMART_READER_LITE
LIVE PREVIEW

Bing Search: An Engine in the Clouds Munich Munich Rablstr. - - PowerPoint PPT Presentation

Bing Search: An Engine in the Clouds Munich Munich Rablstr. Gewrzmhlstr. Founded in January 2009 Offices in London and Munich More than 70 employees in total Close collaborations with Microsoft Research in Redmond and


slide-1
SLIDE 1

Bing Search:

An Engine in the Clouds

slide-2
SLIDE 2
  • Founded in January 2009
  • Offices in London and Munich
  • More than 70 employees in total
  • Close collaborations with Microsoft Research in

Redmond and Cambridge

  • Collaborations with various MSFT Product Groups

(incl. Office, Skype, Windows, Xbox etc.)

  • STC-E Munich: web ranking
  • Other STCs in Beijing, Hyderabad, and Silicon Valley

Munich – Rablstr. London - Cardinal Place Munich – Gewürzmühlstr.

slide-3
SLIDE 3
  • Applied Research branch of MSR working on:
  • IT-Security
  • Data-privacy
  • Mobility
  • Mobile Solutions
  • Web-Services

Deep technical expertise

(8 years of experience in Stream Analytics, Windows platform and small devices)

Relationships with engineering groups

(Windows, Office 365 Azure)

Strategic customer scenarios

(T elemetry, Early fault detection, Predictive maintenance) Enabli bling ng acquisi quisition tion of data ta on all class sses es of devices/se ces/sensor nsors, and d provi

  • viding

ding industr stry y leadin ding g analyt alytic ics s and proce cess ssing ng capabil pabilitie ities s from

  • m the edge

ge to cloud, d, leveragi raging: ng: http://research.microsoft.com/en-us/labs/atle/default.aspx

slide-4
SLIDE 4

1 2 3 4 5

slide-5
SLIDE 5
  • Content Gathering
  • Crawling
  • Indexing
  • Matching Query Words to Content
  • Query modifications needed?
  • Ranking Results
  • Features used
slide-6
SLIDE 6
  • Comprehensiveness
  • Serving & discovery of hundreds of billions of documents
  • Frequency
  • Optimized towards freshness & politeness
  • Balance
  • Depth vs. breadth in the processing of document content
slide-7
SLIDE 7
  • Intelligent

document selection matters

  • Result counts are

misleading

slide-8
SLIDE 8
  • Frequency
  • Optimized towards

freshness & politeness

  • Balance
  • Depth vs. breadth in the

processing of document content

slide-9
SLIDE 9
  • Most components are easily parallelizable such as crawling and document

processing

  • Developed on top of Private Cloud
  • Leveraging Cosmos as a highly scalable storage and computing system
  • Running under highly performant Datacenter Management System
slide-10
SLIDE 10
  • Petabyte Store and Computation System
  • About 62 physical petabytes stored (~275 logical petabytes stored in 2011)
  • Tens of thousands of computers across many datacenters
  • Massively parallel processing based on Dryad
  • Similar to MapReduce but can represent arbitrary DAGs of computation
  • Automatic computation placement with data
  • SCOPE (Structured Computation Optimized for Parallel Execution)
  • SQL-like language with set-oriented record and column manipulation
  • Automatically compiled and optimized for execution over Dryad
  • Management of hundreds of “Virtual Clusters” for computation allocation

Source: http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf

slide-11
SLIDE 11
  • Linguistic alterations
  • Stemming, morphological

variations, plurals

  • Spell Corrections
  • Approximately one third of

queries contain misspellings

  • Synonyms
  • Used sparingly, only in high

confidence situations

slide-12
SLIDE 12
  • Web represents Knowledge
  • Approximately one third of

queries contain misspellings

  • Users often correct themselves
  • Patterns of common (and not so

common) misspellings are discovered

  • Linguistic Depth
  • Less useful
  • Rules like “i” before “e” except

after “c” do not scale

  • Instead develop computer models
  • f correct spelling
  • Statistics & usage data help
slide-13
SLIDE 13
slide-14
SLIDE 14
slide-15
SLIDE 15
  • Heavy dependence on interaction logs (queries, sessions…)

 User behavior is king

  • Leveraging Cosmos for processing Billions of entries

 Deriving query histogram  Deriving query reformulation graph

  • Leveraging Machine Learning for ranking suggestions

 T raining word-level/contextual-level models

slide-16
SLIDE 16

1.

Create lots of features (attributes)

2.

Calculate them per query/document pair

3.

Let the computer learn how to rank with millions/billions of examples

4.

Rinse & Repeat

slide-17
SLIDE 17
  • Do the words appear in the title of the document?
  • Do the words appear in the order specified?
  • Are the words a substantial part of the title?
  • Are the words excessively repeated?
  • Is the title uncommonly long?
slide-18
SLIDE 18
  • Title: Spaghetti Western - Wikipedia
  • Title: Western Spaghetti Recipe – T

aste of Home

  • Title: The Spaghetti Western Orchestra

Tickets 2013 - The Spaghetti Western Orchestra Concert tour 2013 Tickets Good Not so good Not so good

slide-19
SLIDE 19

Words in Title in Correct Order? Is Title uncommonly long? yes Are words repeated excessively? yes

  • 1

yes 1000s of other features… no +4 no +2 no 1000s of other trees…

slide-20
SLIDE 20
  • The document with highest cumulative score is ranked

highest

  • Improvements to algorithms & features are made

constantly & shipped frequently

  • Machine Learning allows for scalability across many

dimensions

slide-21
SLIDE 21
  • Titles
  • Document Content
  • Links
  • Clicks
  • Word Frequencies
  • Images
  • Visual Layout
  • Co-occurrences of words
  • Freshness
  • Word Proximity

… and many more

slide-22
SLIDE 22
  • Leveraging Hybrid Cloud Computing Platform
  • Built on top of heterogeneous computing resources:
  • Single box
  • Cosmos: Map/Reduce
  • MPI
  • Easy deployment/management of applications (modules)
  • Powerful graphical layer of abstraction defining data-workflow.
  • Batch mode allowing to scale across dimensions such as markets
  • Leveraging Machine Learning as a Service
slide-23
SLIDE 23

Sources: http://www.altera.com/technology/system-design/articles/2014/cpu-architects.html http://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/

Accelerating Feature computation through Field Programmable Gate Arrays

slide-24
SLIDE 24
  • Feature Fundamentals
  • Which features are “universal”, which are market-specific?
  • Multiple Queries
  • When does it make sense to issue multiple (altered) queries?
  • How does one merge the results for the user?
  • Anchor & Link Signals
  • Which links and which anchor text are informative?
  • What features need to be leveraged in this context?
  • Knowledge Modeling
  • View the task of ranking as a translation from “query language”

to “document language”

slide-25
SLIDE 25
  • Global Ranking
  • Everything mentioned before … for International (no English-US, no

CJK)

  • Improving, Evaluating and Shipping Rankers in hundreds of markets
  • Universal Ranker
  • One Team and One Ranker
slide-26
SLIDE 26

http://www.bingiton.com/