Bing Search: An Engine in the Clouds Munich Munich Rablstr. - - PowerPoint PPT Presentation

▶

Mar 17, 2024 274 likes •560 views

Bing Search: An Engine in the Clouds Munich Munich Rablstr. Gewrzmhlstr. Founded in January 2009 Offices in London and Munich More than 70 employees in total Close collaborations with Microsoft Research in Redmond and

SLIDE 1

Bing Search:

An Engine in the Clouds

SLIDE 2

Founded in January 2009
Offices in London and Munich
More than 70 employees in total
Close collaborations with Microsoft Research in

Redmond and Cambridge

Collaborations with various MSFT Product Groups

(incl. Office, Skype, Windows, Xbox etc.)

STC-E Munich: web ranking
Other STCs in Beijing, Hyderabad, and Silicon Valley

Munich – Rablstr. London - Cardinal Place Munich – Gewürzmühlstr.

SLIDE 3

Applied Research branch of MSR working on:
IT-Security
Data-privacy
Mobility
Mobile Solutions
Web-Services

Deep technical expertise

(8 years of experience in Stream Analytics, Windows platform and small devices)

Relationships with engineering groups

(Windows, Office 365 Azure)

Strategic customer scenarios

(T elemetry, Early fault detection, Predictive maintenance) Enabli bling ng acquisi quisition tion of data ta on all class sses es of devices/se ces/sensor nsors, and d provi

viding

ding industr stry y leadin ding g analyt alytic ics s and proce cess ssing ng capabil pabilitie ities s from

m the edge

ge to cloud, d, leveragi raging: ng: http://research.microsoft.com/en-us/labs/atle/default.aspx

SLIDE 4

1 2 3 4 5

SLIDE 5

Content Gathering
Crawling
Indexing
Matching Query Words to Content
Query modifications needed?
Ranking Results
Features used

SLIDE 6

Comprehensiveness
Serving & discovery of hundreds of billions of documents
Frequency
Optimized towards freshness & politeness
Balance
Depth vs. breadth in the processing of document content

SLIDE 7

Intelligent

document selection matters

Result counts are

misleading

SLIDE 8

Frequency
Optimized towards

freshness & politeness

Balance
Depth vs. breadth in the

processing of document content

SLIDE 9

Most components are easily parallelizable such as crawling and document

processing

Developed on top of Private Cloud
Leveraging Cosmos as a highly scalable storage and computing system
Running under highly performant Datacenter Management System

SLIDE 10

Petabyte Store and Computation System
About 62 physical petabytes stored (~275 logical petabytes stored in 2011)
Tens of thousands of computers across many datacenters
Massively parallel processing based on Dryad
Similar to MapReduce but can represent arbitrary DAGs of computation
Automatic computation placement with data
SCOPE (Structured Computation Optimized for Parallel Execution)
SQL-like language with set-oriented record and column manipulation
Automatically compiled and optimized for execution over Dryad
Management of hundreds of “Virtual Clusters” for computation allocation

Source: http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf

SLIDE 11

Linguistic alterations
Stemming, morphological

variations, plurals

Spell Corrections
Approximately one third of

queries contain misspellings

Synonyms
Used sparingly, only in high

confidence situations

SLIDE 12

Web represents Knowledge
Approximately one third of

queries contain misspellings

Users often correct themselves
Patterns of common (and not so

common) misspellings are discovered

Linguistic Depth
Less useful
Rules like “i” before “e” except

after “c” do not scale

Instead develop computer models
f correct spelling
Statistics & usage data help

SLIDE 13

SLIDE 14

SLIDE 15

Heavy dependence on interaction logs (queries, sessions…)

 User behavior is king

Leveraging Cosmos for processing Billions of entries

 Deriving query histogram  Deriving query reformulation graph

Leveraging Machine Learning for ranking suggestions

 T raining word-level/contextual-level models

SLIDE 16

1. Create lots of features (attributes)

2. Calculate them per query/document pair

3. Let the computer learn how to rank with millions/billions of examples

4. Rinse & Repeat

SLIDE 17

Do the words appear in the title of the document?
Do the words appear in the order specified?
Are the words a substantial part of the title?
Are the words excessively repeated?
Is the title uncommonly long?

SLIDE 18

Title: Spaghetti Western - Wikipedia
Title: Western Spaghetti Recipe – T

aste of Home

Title: The Spaghetti Western Orchestra

Tickets 2013 - The Spaghetti Western Orchestra Concert tour 2013 Tickets Good Not so good Not so good

SLIDE 19

Words in Title in Correct Order? Is Title uncommonly long? yes Are words repeated excessively? yes

yes 1000s of other features… no +4 no +2 no 1000s of other trees…

SLIDE 20

The document with highest cumulative score is ranked

highest

Improvements to algorithms & features are made

constantly & shipped frequently

Machine Learning allows for scalability across many

dimensions

SLIDE 21

Titles
Document Content
Links
Clicks
Word Frequencies
Images
Visual Layout
Co-occurrences of words
Freshness
Word Proximity

… and many more

SLIDE 22

Leveraging Hybrid Cloud Computing Platform
Built on top of heterogeneous computing resources:
Single box
Cosmos: Map/Reduce
MPI
…
Easy deployment/management of applications (modules)
Powerful graphical layer of abstraction defining data-workflow.
Batch mode allowing to scale across dimensions such as markets
Leveraging Machine Learning as a Service

SLIDE 23

Sources: http://www.altera.com/technology/system-design/articles/2014/cpu-architects.html http://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/

Accelerating Feature computation through Field Programmable Gate Arrays

SLIDE 24

Feature Fundamentals
Which features are “universal”, which are market-specific?
Multiple Queries
When does it make sense to issue multiple (altered) queries?
How does one merge the results for the user?
Anchor & Link Signals
Which links and which anchor text are informative?
What features need to be leveraged in this context?
Knowledge Modeling
View the task of ranking as a translation from “query language”

to “document language”

SLIDE 25

Global Ranking
Everything mentioned before … for International (no English-US, no

CJK)

Improving, Evaluating and Shipping Rankers in hundreds of markets
Universal Ranker
One Team and One Ranker

SLIDE 26

http://www.bingiton.com/