Semantic Technology for Online, Broadcast and Print Media Jem - - PowerPoint PPT Presentation

semantic technology for online broadcast and print
SMART_READER_LITE
LIVE PREVIEW

Semantic Technology for Online, Broadcast and Print Media Jem - - PowerPoint PPT Presentation

Semantic Technology for Online, Broadcast and Print Media Jem Rayfield: Head of Solution Architecture Financial Times: www.ft.com BBC MMXII Future Media Outline BBC: Dynamic Semantic Publishing and the World Cup 2010 BBC: Sport


slide-1
SLIDE 1

Future Media  BBC MMXII

Semantic Technology for Online, Broadcast and Print Media

  • Jem Rayfield: Head of Solution Architecture
  • Financial Times: www.ft.com
slide-2
SLIDE 2

Future Media  BBC MMXII

Outline

BBC: Dynamic Semantic Publishing and the World Cup 2010 BBC: Sport 2012 + Olympics Financial Times: Semantic Re-platform Financial Times: Semantic Prototype Financial Times: Behavioral Recommendations

slide-3
SLIDE 3

Future Media  BBC MMXII

BBC World Cup 2010 http://bbc.co.uk/worldcup

slide-4
SLIDE 4

Future Media  BBC MMXII

  • 1. 32 teams, 8 groups, 736 players  776 pages
  • 2. Fixtures & Results, Groups & Teams pages
  • 3. To many web pages for too few journalists
  • 4. Improve the publishing system to help achieve all of this

World Cup 2010

slide-5
SLIDE 5

Future Media  BBC MMXII

Page Per Player

http://news.bbc.co.uk/sport/football/world_cup_2010/groups_and_teams/team/england/wayne_rooney

slide-6
SLIDE 6

Future Media  BBC MMXII

Page Per Team

slide-7
SLIDE 7

Future Media  BBC MMXII

Page Per Group

slide-8
SLIDE 8

Future Media  BBC MMXII

Semantic publishing

USER EXPERIENCE ONTOLOGY TRIPLE STORE

slide-9
SLIDE 9

Future Media  BBC MMXII

BBC Sport: http://www.bbc.co.uk/ontologies/sport

Open Sport Ontology

slide-10
SLIDE 10

Journalism  BBC MMIX

Extendable Domain Driven Asset Tagging

slide-11
SLIDE 11

Journalism  BBC MMIX

Open Ontology/Dataset reuse Event | Geonames | Foaf | Etc.

slide-12
SLIDE 12

Journalism  BBC MMIX

Infer… player->team->competition

slide-13
SLIDE 13

Future Media  BBC MMXII

Graffiti: Suggest -> Tag [Player]

slide-14
SLIDE 14

Future Media  BBC MMXII

Graffiti: Suggest -> Tag [Location] (Geonames)

slide-15
SLIDE 15

Future Media  BBC MMXII

World Cup DSP Architecture

slide-16
SLIDE 16

Future Media  BBC MMXII

API Stack

slide-17
SLIDE 17

Journalism  BBC MMIX

Highly Scalable Clustered BigOWLIM

slide-18
SLIDE 18

Journalism  BBC MMIX

<http://www.chelseafc.com/> domain:documentType <http://www.bbc.co.uk/things/document-types/homepage> , <http://www.bbc.co.uk/things/document-types/external> . <http://www.bbc.co.uk/sport/football/teams/chelsea> domain:documentType <http://www.bbc.co.uk/things/document-types/bbc-document> , <http://www.bbc.co.uk/things/document-types/homepage> . <http://www.bbc.co.uk/things/2acacd19-6609-1840-9c2b-b0820c50d281#id> a sport:CompetitiveSportingOrganisation ; domain:canonicalName "Chelsea"^^<xsd:string> ; domain:document <http://www.chelseafc.com/> , <http://www.bbc.co.uk/sport/football/teams/chelsea> ; domain:externalId <http://dbpedia.org/resource/Chelsea_F.C.> , <urn:sports-stats:137316635> ; domain:name "Chelsea" ; domain:shortName "Chelsea"^^<xsd:string> ; sport:competesIn <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> . <http://dbpedia.org/resource/Chelsea_F.C.> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/dbpedia> . <urn:sports-stats:137316635> domain:externalIdType <http://www.bbc.co.uk/things/external-id-types/bbc-sport-stats> . <http://www.bbc.co.uk/things/5cd4682a-7643-f445-8b1f-bcbaf450bc89#id> domain:canonicalName "Premier League"^^<xsd:string> ; domain:externalId <urn:sports-stats:118996114> ; sport:competitionType <http://www.bbc.co.uk/things/competition-types/domestic-league> .

GET Accept text/rdf+n3 https://api.live.bbc.co.uk/dsp/sport/football/teams/chelsea

slide-19
SLIDE 19

Future Media  BBC MMXII

Rationale

  • Automated content publishing
  • Huge increase in content breadth (number of manageable pages)
  • Content re-use and re-purposing, increasing reach
  • Simplified content management
  • Journalist headcount reduction
  • Multi-dimensional entry points and semantic navigation
  • Improved user experience with high levels of user engagement
  • Dynamic, state (time|event) and semantic driven page layout
  • Personalized content aggregations
  • Open data and API’s
slide-20
SLIDE 20

Future Media  BBC MMXII

  • 750+ Dynamic aggregations/pages (Player, Squad, Group, etc..)
  • Average unique page requests a day : 2 million +
  • Average OWLIM SPARQL queries a day : 1 million
  • 100s RDF statement updates/inserts per minute with full OWL reasoning

and associated inference.

  • Multi data center fully resilient, clustered 6 node triple store
  • RDF graph model ideally suited to model domain representations such as

sport

World Cup statistics the GOOD

slide-21
SLIDE 21

Future Media  BBC MMXII

  • Sports stories and indices static
  • Sport content not responsive or personalized
  • RDF Store unable to handle thousands of statistic updates a second
  • RDF Store forward-chained closures expensive increase write latency
  • RDF graph model and SPARQL not ideally suited to the BBC’s News and

Sport document publication model

World Cup statistics the BAD

slide-22
SLIDE 22

Future Media  BBC MMXII

BBC Sport 2012; Online Refresh http://bbc.co.uk/sport

slide-23
SLIDE 23

Future Media  BBC MMXII

Sport Refresh 2012

  • Page per Athlete [10,000+], Page per country [200+], Page per

Discipline [400-500], Page per venue, Page per team

 A lot of output…

  • Almost real time statistics and live event pages
  • Time coded, metadata annotated, on demand video, 58,000

hours of content

  • Far too many web pages for far too few journalists
  • DSP annotation architecture to automate content aggregation
slide-24
SLIDE 24

Future Media  BBC MMXII

10000+ Dynamic Aggregations

slide-25
SLIDE 25

Future Media  BBC MMXII

Lots of Dynamic (Live) sports stats

slide-26
SLIDE 26

Future Media  BBC MMXII

slide-27
SLIDE 27

Future Media  BBC MMXII

Video delivery

slide-28
SLIDE 28

Future Media  BBC MMXII

Augment architecture with a Content Store

  • 1. Atomic content assets stored in MarkLogic XML store
  • 2. XML content queryable via Xquery
  • 3. Content Assets searchable
  • 4. Sports statistics searchable/queryable via XQuery
  • 5. Ontological SPARQL via BigOWLIM, assets Xquery via

MarkLogic

slide-29
SLIDE 29

Future Media  BBC MMXII

API Stack MarkLogic OWLIM Enterprise

slide-30
SLIDE 30

Future Media  BBC MMXII

Ontology Aware NLP

  • Information Workbench
  • OWLIM
  • (Spice) GATE+Ontotext
slide-31
SLIDE 31

Future Media  BBC MMXII

Ontology Aware NLP and Semantic Disambiguation

OWLIM

Generic Analysis … KB Gazetteer … … … … Disambiguation … … … Relevance Ranking

Ex-England boss Sven- Goran Eriksson says a "smear campaign" has been aimed at Roy Hodgson for omitting Rio Ferdinand. ? Roy Hodgson: coach ? Roy Hodgson: hockey player ? ………. V Roy Hodgson: coach

  • Roy Hodgson:

hockey player

  • ……….

V Rio Ferdinand

  • …….
  • ……….

V Sven-Goran Eriksson

  • …….
  • ……….

CES APP

1. Eriksson (78%) 2. Roy Hodgson (69%) 3. Rio Ferdinand (58%) 4. …

Curate Update Retrain & Adapt

slide-32
SLIDE 32

Future Media  BBC MMXII

Entity Relevance: Objective

  • Rank entities by their relatedness to the article
  • Accuracy 75%
  • We consider various frequencies of entity mentions in the article

and in the entire set of articles

  • Positions in the article fields or in the first paragraphs of the

body boost the relevance

slide-33
SLIDE 33

Future Media  BBC MMXII

Confidence and Relevance

The relevance of an entity in arbitrary document may depend

  • n:

Text context and the vicinity of an entity/concept within the text. (Confidence) Ontological graph context and the vicinity of an entity/concept within the graphs knowledge model The frequencies of entities in the corpus and document. (Relevance)

slide-34
SLIDE 34

Future Media  BBC MMXII

Disambiguation of Locations

  • Geospatial distance - a feature of OWLIM (geosparql)
  • Super region – GeoNames hierarchy and containment relations, e.g.

parentFeature

  • RDF Rank – Similar to Page Rank but RDF links
  • Human approval score (on the basis of curated documents)
slide-35
SLIDE 35

Future Media  BBC MMXII

Plenty of Caching

slide-36
SLIDE 36

Future Media  BBC MMXII

Sport Stats REST API

slide-37
SLIDE 37

Future Media  BBC MMXII

@prefix sport: <http://www.bbc.co.uk/ontologies/sport/> . @prefix asset: <http://www.bbc.co.uk/ontologies/asset/> . @prefix tag: <http://www.bbc.co.uk/ontologies/tag/> . @prefix domain: <http://www.bbc.co.uk/ontologies/domain/> . @prefix sesame: <http://www.openrdf.org/schema/sesame#> . @prefix owlim: <http://www.ontotext.com/> .

@prefix oly: <http://www.bbc.co.uk/ontologies/2012olympics/> . @prefix par: <http://purl.org/vocab/participation/schema#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> a sport:Person ; rdfs:label "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:name "Usain Bolt"^^xsd:string , "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; domain:canonicalName "Bolt Usain-athletics-jam-1986-08-21"^^xsd:string ; foaf:givenName "Usain"^^xsd:string ; foaf:familyName "Bolt"^^xsd:string ; domain:name "Usain Bolt"^^xsd:string ;

  • ly:dateOfBirth "1986-08-21"^^xsd:date ;
  • ly:gender "M"^^xsd:string ;
  • ly:height "195.0"^^xsd:float ;
  • ly:weight "94.0"^^xsd:float ;
  • ly:worldOlympicDream "true"^^xsd:boolean ;

sport:discipline <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> ; sport:competesIn <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> . <http://www.bbc.co.uk/things/b3a086df-ab42-2b44-be8b-76b600bfcdce#id> a sport:SportsDiscipline ; domain:name "Athletics"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics> . <http://www.bbc.co.uk/things/1b499a08-4f02-4196-aa6c-c43ea353138b#id> a sport:MedalCompetition ; domain:name "Men's 100m"^^xsd:string ; domain:shortName "Men's 100m"^^xsd:string ; domain:document <http://www.bbc.co.uk/sport/olympics/2012/sports/athletics/events/mens-100m> ;

  • ly:measurementType <http://www.bbc.co.uk/things/measurement-types/time> ;

domain:externalId <urn:ioc2012:ATM001000> . <http://www.bbc.co.uk/things/903ef380-bdae-4a45-9a8b-5e5a270a7d6c#id> oly:oneToWatch <http://www.bbc.co.uk/things/82f5db84-0591-49ee-b6f4-a1d26e9381fb#id> . <http://news.bbc.co.uk/sport1/hi/athletics/16554814.stm#asset> tag:tag <http://www.bbc.co.uk/things/a50dc8ba-947e-4856-8eb0-1cdbbf208ef7#thing> ;

Olympics API (RDF)

slide-38
SLIDE 38

Future Media  BBC MMXII

Unique Browser Requests per day Peaked at just over 8 million UK and 11 million Globally. Cumulative Unique browsers

  • nline total 34.6 million

Olympic numbers

slide-39
SLIDE 39

Future Media  BBC MMXII

On the busiest day, the BBC delivered 2.8 petabytes, with the peak traffic moment occurring when Bradley Wiggins won Gold and we shifted 700 Gb/s. 106 million requests for BBC Olympic video content across all online platforms Number of people watching individual Streams 

Olympic numbers

slide-40
SLIDE 40

FT?

…some background about the Financial Times

slide-41
SLIDE 41
slide-42
SLIDE 42

The FT Audience: The world’s most wealthy and powerful people

slide-43
SLIDE 43
slide-44
SLIDE 44

“The old “page editor” is dead; long live today’s content editor”

Global network of over 500 journalists

slide-45
SLIDE 45

The FT product portfolio

slide-46
SLIDE 46

FT has been successfully charging for online content since 2002

2002 FT.com subscriptions first introduced 2007 FT.com metered model is launched 2012 Digital subs

  • vertake print

circulation 2013 FT has 343k paying digital subscribers

  • Subscriptions were launched in 2002
  • The model was flexible and business rules were determined by various

departments

  • Customer experience wasn’t great!

(Customer data wasn’t been captured or used intelligently)

slide-47
SLIDE 47

Digital subscriptions continue to grow

2002 FT.com subscriptions first introduced 2007 FT.com metered model is launched 2012 Digital subs

  • vertake print

circulation 2013 FT has 443k paying digital subscribers

600,000+

Combined Print & Digital Circulation

443,000

Digital Subscribers

2 million

people worldwide read the FT in print or online

14% in past year

51%+

Revenue expected to be generated through content sales/subscriptions by end of 2013

slide-48
SLIDE 48

Digital is our future

slide-49
SLIDE 49

The FT is where our readers wish to consume it

slide-50
SLIDE 50

Meteoric rise of mobile

  • Over 50% of subscriber consumption
  • Third of FT.com traffic
  • Driving 24% of new digital subs
  • Mobile advertising up 26% yoy
slide-51
SLIDE 51

First major publisher to launch an HTML5 web app

  • Supporting offline

content consumption

slide-52
SLIDE 52

Existing Very complex Publishing Architecture. Just to publish a content object. No search no metadata?

Staging Content Staging ft.com layout

Staging platform adapter Layout UI [preditor] Content Creation UI [methode]

Publish: content Publish: ft.com layout

Delivery platform ft-cms render delivery api content api ft HTML 5 Web App Edit | Publish Article Manage ft.com Layout ft.com HTML Publish Read Site Audience Read Site

slide-53
SLIDE 53

Taxonomy based Search and Topic pages (Poor Quality) Top story is about Samsung?

slide-54
SLIDE 54

Taxonomy Based Tagging & Event Based Story editing

slide-55
SLIDE 55

Complex Taxonomy Based Search

User search Search API FAST ESP Search Feeder ft.com ft.com stack Blogs process Search Search Index Index Reads on schedule Writes on publish Mobile search Search Video Delivery Platform Reads new video schedule Search Feeder DB Video Feeder DB

slide-56
SLIDE 56

Moving towards : Semantic Publishing Platform

Provide a Universal Publishing "UP" platform where all FT content is universally accessible across all of its channels: RDF Store; semantically understand, interlink and store all FT content within a single pan FT Content Store APIs; content, metadata and the links between will be exposed and made publicly available via APIs Publishing tools; provide tooling which allows journalists, users, readers and contributors to publish to the Content Store via APIs. Metadata and Semantics; author content directly in addition publish data about the content - semantic metadata. Provide intelligent query and search capabilities empowering the next generation of FT products

slide-57
SLIDE 57

Dynamic Semantic Publishing Architecture

slide-58
SLIDE 58

Annotation ontology - Semantic fingerprint

slide-59
SLIDE 59

News Business ontology Collaboration: (FT, BBC, Reuters, Bloomberg, EuroMoney)

slide-60
SLIDE 60

Future: Story line ontology

slide-61
SLIDE 61

JSON+LD For all pubic (secure) endpoints (Demo)

slide-62
SLIDE 62

FT.com Semantic Prototype (Demo)…

slide-63
SLIDE 63

FT.com Semantic Prototype (Demo)…

slide-64
SLIDE 64

FT.com Semantic Prototype (Demo)…

slide-65
SLIDE 65

Related/Similar Content

slide-66
SLIDE 66

Recommendations

Serve relevant articles to increase user engagement and improve usability Primary Objective

slide-67
SLIDE 67

Recommendations

Subject: User Object: Article, Media Asset, Data, … Action: Read, Preview, Comment, … Subject, Object and Action

slide-68
SLIDE 68

Contextual Recommendations

Contextual Similarity

slide-69
SLIDE 69

Behavioural Recommendations

Behavioural Similarity U s e r P r

  • f

i l e

slide-70
SLIDE 70

Combined Contextual + Behavioural Recommendations

B e h a v i

  • u

r a l a n d C

  • n

t e x t u a l S i m i l a r i t y Reads U s e r P r

  • f

i l e

slide-71
SLIDE 71

User Actions: Limited to User Reads Article

reads

slide-72
SLIDE 72

User Actions: A wider behavioural perspective

perform comments votes posts preview read contains leads to read leads to preview

Article Search Action Result

Date FTS Q.

Tag Cat Tag set results

cat taxonomy

Search Log

slide-73
SLIDE 73

“Content” Based Recommendations

  • Results will be based on the similarity of content

items. Similarity will be defined by the terms/entities within the content (content fingerprint)

  • Results will be based on the past choices of an

individual user (a user's profile). (Users content fingerprint)

slide-74
SLIDE 74

Content based ranking

Two-fold scoring approach

1.

Similarity to recently viewed articles (context)

2.

Relevance to a long-term user profile,

  • Weights reflecting the relative importance of the

individual terms (static component)

  • Transition likelihoods among any pair of terms (dynamic

component)

slide-75
SLIDE 75

Collaborative Ranking

  • Results based on statistics that reflect the past

choices of all users

  • Results based on user ratings, and the similarity of

users or items

  • Content-agnostic
  • Aware of the quality of content
slide-76
SLIDE 76

Collaborative Ranking Mechanisms

User to Content Similarity Score User to User Sim. Score Content to Content

  • Sim. Score
slide-77
SLIDE 77

API : Recommend for :

http://www.ft.com/cms/s/0/398e6d5e-e858-11e3-9cb3-00144feabdc0.html

GET http://ft-recommendations.ft.com/api/behavioural/getsmartcontent? userid=6753665&contentid=398e6d5e-e858-11e3-9cb3-00144feabdc0 { "SmartAPI": “Recommendations API v1.8", "type": "behavioural", "status": "ok", "method": "getsmartcontent", "response": { "result": [ { "headline": "Apple seeks to work Jobs magic on the internet of things", "feedid": 21, "feed": "FT Mobile Content API", "rel": 1.331479549407959, "date": "Thu May 29 18:33:04 UTC 2014", "cid": "3c6e330a-e74b-11e3-8b4e-00144feabdc0", "url": "http://www.ft.com/cms/s/0/3c6e330a-e74b-11e3-8b4e-00144feabdc0.html", "popularity": 2452 }, {

slide-78
SLIDE 78

Architecture

SOLR 1 SOLR 2 SOLR 3 CS Node 3 CS Node 1 CS Node 2

Replication Group I FT API Fetch & Annotation OWLIM Worker Recommendation API Varnish Cache RR RR RR R e a d A r t i c l e

  • 1. get related
  • 2. ask
  • 4. query
  • 3. on cache

miss

  • 1. pull content
  • 2. annotate
  • 3. index

annotate content store user profiles update popularity click stream update user AWS INSTANCE AWS INSTANCE AWS INSTANCE AWS Elastic LB

slide-79
SLIDE 79

Main Actions

  • 1. Pull content – annotate/enrich –

index

  • 2. Accumulate/update user profile
  • 3. Recommend
slide-80
SLIDE 80

Main Actions

Profile Update

Request (User ID, Item ID)

Query Generation

Items Index (Solr) Profile Storage (Cassandra)

Recommendation Request (User ID)

Profile Update

User:

  • context
  • static component
  • dynamic component

Article:

  • co-visitation matrix
  • popularity

Boosted sub-queries for all involved ranking schemes: content-based, collaborative, popularity, recency

slide-81
SLIDE 81

Cassandra

Cassandra

Concept Concept

weight

Top Terms Top Terms Top Terms Top Terms Top Terms Top Terms

Concept Activation Matrix

User Profile Article Profile weight

Article Covisitation

reads

User Read Stream

slide-82
SLIDE 82

Solr

Solr

Article Profile Summary Title body concepts keyphrases

Top Ter ms Top Ter ms Top Ter ms Top Ter ms Top Ter ms C Top Ter ms Top Ter ms Top Ter ms Top Ter ms Top Ter ms KP

slide-83
SLIDE 83

Thank you. Questions?

Jem.rayfield@ft.com @jemrayfield