Bing Search: An Engine in the Clouds Munich Munich Rablstr. - - PowerPoint PPT Presentation
Bing Search: An Engine in the Clouds Munich Munich Rablstr. - - PowerPoint PPT Presentation
Bing Search: An Engine in the Clouds Munich Munich Rablstr. Gewrzmhlstr. Founded in January 2009 Offices in London and Munich More than 70 employees in total Close collaborations with Microsoft Research in Redmond and
- Founded in January 2009
- Offices in London and Munich
- More than 70 employees in total
- Close collaborations with Microsoft Research in
Redmond and Cambridge
- Collaborations with various MSFT Product Groups
(incl. Office, Skype, Windows, Xbox etc.)
- STC-E Munich: web ranking
- Other STCs in Beijing, Hyderabad, and Silicon Valley
Munich – Rablstr. London - Cardinal Place Munich – Gewürzmühlstr.
- Applied Research branch of MSR working on:
- IT-Security
- Data-privacy
- Mobility
- Mobile Solutions
- Web-Services
Deep technical expertise
(8 years of experience in Stream Analytics, Windows platform and small devices)
Relationships with engineering groups
(Windows, Office 365 Azure)
Strategic customer scenarios
(T elemetry, Early fault detection, Predictive maintenance) Enabli bling ng acquisi quisition tion of data ta on all class sses es of devices/se ces/sensor nsors, and d provi
- viding
ding industr stry y leadin ding g analyt alytic ics s and proce cess ssing ng capabil pabilitie ities s from
- m the edge
ge to cloud, d, leveragi raging: ng: http://research.microsoft.com/en-us/labs/atle/default.aspx
1 2 3 4 5
- Content Gathering
- Crawling
- Indexing
- Matching Query Words to Content
- Query modifications needed?
- Ranking Results
- Features used
- Comprehensiveness
- Serving & discovery of hundreds of billions of documents
- Frequency
- Optimized towards freshness & politeness
- Balance
- Depth vs. breadth in the processing of document content
- Intelligent
document selection matters
- Result counts are
misleading
- Frequency
- Optimized towards
freshness & politeness
- Balance
- Depth vs. breadth in the
processing of document content
- Most components are easily parallelizable such as crawling and document
processing
- Developed on top of Private Cloud
- Leveraging Cosmos as a highly scalable storage and computing system
- Running under highly performant Datacenter Management System
- Petabyte Store and Computation System
- About 62 physical petabytes stored (~275 logical petabytes stored in 2011)
- Tens of thousands of computers across many datacenters
- Massively parallel processing based on Dryad
- Similar to MapReduce but can represent arbitrary DAGs of computation
- Automatic computation placement with data
- SCOPE (Structured Computation Optimized for Parallel Execution)
- SQL-like language with set-oriented record and column manipulation
- Automatically compiled and optimized for execution over Dryad
- Management of hundreds of “Virtual Clusters” for computation allocation
Source: http://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
- Linguistic alterations
- Stemming, morphological
variations, plurals
- Spell Corrections
- Approximately one third of
queries contain misspellings
- Synonyms
- Used sparingly, only in high
confidence situations
- Web represents Knowledge
- Approximately one third of
queries contain misspellings
- Users often correct themselves
- Patterns of common (and not so
common) misspellings are discovered
- Linguistic Depth
- Less useful
- Rules like “i” before “e” except
after “c” do not scale
- Instead develop computer models
- f correct spelling
- Statistics & usage data help
- Heavy dependence on interaction logs (queries, sessions…)
User behavior is king
- Leveraging Cosmos for processing Billions of entries
Deriving query histogram Deriving query reformulation graph
- Leveraging Machine Learning for ranking suggestions
T raining word-level/contextual-level models
1.
Create lots of features (attributes)
2.
Calculate them per query/document pair
3.
Let the computer learn how to rank with millions/billions of examples
4.
Rinse & Repeat
- Do the words appear in the title of the document?
- Do the words appear in the order specified?
- Are the words a substantial part of the title?
- Are the words excessively repeated?
- Is the title uncommonly long?
- Title: Spaghetti Western - Wikipedia
- Title: Western Spaghetti Recipe – T
aste of Home
- Title: The Spaghetti Western Orchestra
Tickets 2013 - The Spaghetti Western Orchestra Concert tour 2013 Tickets Good Not so good Not so good
Words in Title in Correct Order? Is Title uncommonly long? yes Are words repeated excessively? yes
- 1
yes 1000s of other features… no +4 no +2 no 1000s of other trees…
- The document with highest cumulative score is ranked
highest
- Improvements to algorithms & features are made
constantly & shipped frequently
- Machine Learning allows for scalability across many
dimensions
- Titles
- Document Content
- Links
- Clicks
- Word Frequencies
- Images
- Visual Layout
- Co-occurrences of words
- Freshness
- Word Proximity
… and many more
- Leveraging Hybrid Cloud Computing Platform
- Built on top of heterogeneous computing resources:
- Single box
- Cosmos: Map/Reduce
- MPI
- …
- Easy deployment/management of applications (modules)
- Powerful graphical layer of abstraction defining data-workflow.
- Batch mode allowing to scale across dimensions such as markets
- Leveraging Machine Learning as a Service
Sources: http://www.altera.com/technology/system-design/articles/2014/cpu-architects.html http://www.enterprisetech.com/2014/09/03/microsoft-using-fpgas-speed-bing-search/
Accelerating Feature computation through Field Programmable Gate Arrays
- Feature Fundamentals
- Which features are “universal”, which are market-specific?
- Multiple Queries
- When does it make sense to issue multiple (altered) queries?
- How does one merge the results for the user?
- Anchor & Link Signals
- Which links and which anchor text are informative?
- What features need to be leveraged in this context?
- Knowledge Modeling
- View the task of ranking as a translation from “query language”
to “document language”
- Global Ranking
- Everything mentioned before … for International (no English-US, no
CJK)
- Improving, Evaluating and Shipping Rankers in hundreds of markets
- Universal Ranker
- One Team and One Ranker
http://www.bingiton.com/