USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA Gareth Seneque - - PowerPoint PPT Presentation

using machine learning to automate content metadata
SMART_READER_LITE
LIVE PREVIEW

USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA Gareth Seneque - - PowerPoint PPT Presentation

USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA Gareth Seneque seneque@gmail.com @garethseneque https://search.abc.net.au THE PLAN! 1. What is the ABC/Search at the ABC 2. An overview of metadata the what/why 3. A platform: what


slide-1
SLIDE 1

USING MACHINE LEARNING TO AUTOMATE CONTENT METADATA

Gareth Seneque seneque@gmail.com @garethseneque https://search.abc.net.au

slide-2
SLIDE 2

THE PLAN!

1. What is the ABC/Search at the ABC 2. An overview of metadata – the what/why 3. A platform: what have we built? 4. Automating transcription of audio content 5. Automated generation of keywords/synopses 6. Some fun! 7. The future?

slide-3
SLIDE 3

THE ABC

  • We make lots of things!
  • You may have seen these things on one of your many screens J
  • A trusted source in 2019 across the political spectrum – imagine that!
  • “Majority (68%) of respondents think the ABC is more important in an age of

social media and fake news, including 64% of LNP and 61% of One Nation voters;

  • The results show 57% of respondents do not trust social media, while just 12% said

they do trust social media;

  • Over three times more voters trust the ABC (52%) than trust commercial media

(14%)”

Source: The Australia Institute http://www.tai.org.au/content/abc-still-australia-s-most-trusted-news-source

slide-4
SLIDE 4

THE ABC

  • But for our purposes today:

The ABC is in the business of words and pixels!

  • To come: lots of words about words, some words about pixels, too
slide-5
SLIDE 5

SEARCH @ THE ABC

  • https://search.abc.net.au
  • Algolia back-end (also used by Twitch, Stripe)
  • ~600k objects in our primary index
  • Covering all major content types from the last decade
  • ~230k articles
  • ~270k audio
  • ~85k video
  • Other things like recipes – very popular and worthy of their own content type!
slide-6
SLIDE 6

SEARCH @ THE ABC

  • ~500k searches/month
  • Peaks during weekdays, traffic nearly halves on weekends
  • (Aussies love a good weekend!)
  • Two challenges:
  • How do we get people to use our search?
  • Expectation of what search can do set by Google etc.
  • How do we delivery those using our search the most relevant results
  • Ensure high-quality metadata!
slide-7
SLIDE 7

METADATA: AN OPPORTUNITY!

  • Video: episode of BTN
  • Missing keywords
  • Missing synopsis
  • Audio: episode of Life

Matters podcast

  • No show name, just

episode name

  • Spelling mistake in a

keyword

  • Total of 5 keywords
  • No synopsis
  • No transcript
  • Article: News
  • Synopsis is first

sentence of article

  • 4 keywords for

lengthy article

slide-8
SLIDE 8

METADATA: OVERVIEW

  • Article à BodyText à Keywords/Synopsis
  • Audio à Transcript à Keywords/Synopsis
  • Video à Closed Captions/Transcript à Keywords/Synopsis

Looking to the future…

  • Images & Video à ‘Individual interacting with object’ à New attributes
  • Geoff Hinton in 2015: “I will be disappointed if in five years time we do not have

something that can watch a YouTube video and tell a story about what happened”

  • Somehow it is already 2019, so…
slide-9
SLIDE 9

AN AUTOMATED METADATA PLATFORM

Credit: 20th Century Fox

slide-10
SLIDE 10

AN AUTOMATED METADATA PLATFORM

  • Transcript pipeline
  • 2x Lambda functions
  • Monitoring content notifiers, picking up Podcasts, generating transcription requests,

picking up the results, separating transcript from word-confidence scores/timestamps and pushing to our search index

  • S3 buckets for storing transcripts
  • General attribute infrastructure & tools (for now, keywords/synopsis)
  • Load-balanced/distributed EC2 instances, CI/CD the things
  • Deploys models, API
  • Tools to update search objects (bulk/incremental)
  • All written in Go!
slide-11
SLIDE 11

WHY TRANSCRIPTS?

  • ~130 Podcasts in the Listen app alone
  • ~19 million Podcast downloads every month (!!)
  • Less than 5% have transcripts
  • Transcripts are expensive when produced by humans!
  • All that great content not easily discoverable
  • Hypothesis: we can increase audience engagement with searchable

transcripts

slide-12
SLIDE 12

EXPERIMENTS WITH DEEPSPEECH

  • Mozilla’s open-source implementation of Baidu’s paper Deep Speech:

Scaling up end-to-end speech recognition (2014)

  • A Recurrent Neural Network w/LSTM units &CTC
  • Optimisation method that controls for different patterns of speech
  • You want your network to understand when the slurring drunk and the impatient

teetotaller ask their phone for directions

  • You’ll see the co-inventor of the LSTM in the next slide
  • Pre-trained models: limited to ~30 second clips at 16kHz/mono
  • Need to build system to manage slicing up inputs/reconstructing outputs
slide-13
SLIDE 13

?????

slide-14
SLIDE 14

DEEPSPEECH 🤗 VS. HUMAN TRANSCRIPT 😭

VS: “On May 13, 1968, students and workers joined together in Paris in one of the largest protests the France had seen. They threatened the stability of the national government and arguably shifted the way we think about protests and political demonstrations forever. Yet that is just a small part of the story of

  • 1968. On every continent, in almost every nation on earth”
slide-15
SLIDE 15

MLAAS (PRONOUNCED MRMMLYAAAS)

  • Recurrent what? Maths? Who cares!
  • As of 2017 this stuff has been available as an AWS service w/a simple API call

– AWS Transcribe

slide-16
SLIDE 16

HUMAN TRANSCRIPT 😭

A lot of this is taken directly from the example of the 1960s. And so this question of when the 1960s ends and its legacy I think is most apparent in the fact that in a lot of ways the 1960s hasn't ended yet, we are still grappling with many of these most basic ideas. Annabelle Quince: Zachary Scarlett, co-editor of The Third World in the Global 1960s. You also heard from: Gerard De Groot, author of Student Protest: The Sixties and After; Heike Becker, Professor of Anthropology at the University of the Western Cape; and Gerd-Rainer Horn, author of The Spirit of '68: Rebellion in Western Europe and North America, 1956-1976. The sound engineer is Russel Stapleton. I'm Annabelle Quince and you've been listening to Rear Vision on RN.

slide-17
SLIDE 17

AWS TRANSCRIBE 🤗

A lot of this is taken directly from the example of the nineteen sixties, and so this question of when the nineteen sixties ends and it's legacies, i think, is most apparent in the fact that in a lot of ways the nineteen sixties hasn't ended yet. We're still grappling with many of these most basic ideas. Sekeras scarlet, co editor of the third world in the global nineteen sixties. You also heard from gerard degroot, the author of student protest the sixties and after heika bika, professor of anthropology at the university of the western cape, and god rainer horn, author of the spirit of sixty

  • eight. The sound engineer is russell stapleton. I'm annabelle quints,

and this is revision on our end.

slide-18
SLIDE 18

TRANSCRIBE OUTPUT: UNDER THE HOOD

slide-19
SLIDE 19

KEYWORD/SUMMARY METADATA

  • Not enough cold-drip in the world for our team to create keyword/summary

metadata for 600k objects

  • Two attributes suitable for NER and extractive/abstractive summarization
  • 2018/2019 has seen major breakthroughs in NLP/large language models –

SOTA results across a range of tasks(Google’s BERT, OpenAI’s GPT-2, AllenAI’s ElMO)

  • Can any of these breakthroughs help us? How do they compare to more

mature/minimal approaches

slide-20
SLIDE 20

EXPERIMENTS WITH BERT

  • Fine-tuning BERT for NER on CoNLL-2003
  • Viz to the right is the PCA of embeddings
  • Tensorboard!
  • That little cluster there is the label [unused]
  • doh
  • 1.2GB model, trained on K80 cloud GPUs
  • Overkill!
slide-21
SLIDE 21

GENERATED VS EXISTING KEYWORDS

Article: “Fact checking key claims of the 2019 federal election leaders' debate” – ABC News – 29/04/19

slide-22
SLIDE 22

GENERATED VS EXISTING SUMMARIES

Water restrictions will be introduced in Sydney if drought conditions don't ease in the next three months, according a report on dwindling dam levels in New South Wales. The latest research from Sydney Water reveals levels across 11 dams in Greater Sydney are dropping faster than they have in decades. NSW Water Minister Melinda Pavey said Water Rise Rules — which recommend reducing shower time and fixing tap leaks — applied to everyone in Sydney, the Blue Mountains and Illawarra. Water levels in dams are dropping faster than they have in decades, according to new research by Sydney Water — edging Sydney closer to the re- introduction of water restrictions.

Article: Water restrictions loom for Sydney as drought continues to impact on dam levels – ABC News – 05/05/19

slide-23
SLIDE 23

SOME (EARLY) RESULTS

  • Across the top News articles over the past week
  • 280% average increase in number of keywords
  • 22% increase in audio content availability across range of popular terms
  • 3-14% increase in CTR in A/B tests for combinations of ordered/unordered

keywords/extractive summaries

  • Tests running as we speak!
  • Abstractive summarization experiments continuing!
slide-24
SLIDE 24

CONSIDER THESE SNIPPETS

  • The Australian Securities Exchange, or ASX, is the world's biggest importer of

corporate bonds including government and industrial bonds. But despite the presence of bonds from Australia, a divided market continues to pour in. The yield on the ASX 200 index on the first day of trading has dropped from 1.8 per cent to 1.6 per cent after the British economy was hurt by the global financial crisis. But it fell to 1.4 per cent on Friday, after the ASX 200 index was downgraded to junk status.

  • It took a lot of digging and a lot of sweat to find this picture of a tiger that

had been spotted on a beach in Indonesia. It was taken on the first of three days of tracking by a trapeze dog. It was originally thought the tiger was a female but it has now been confirmed that it is a male. Local fishermen were quick to point out that the picture was not meant for social media, but is a tribute to a friend who passed away in 2015. It's not the first time the tiger that has been spotted on a beach in Indonesia has been photographed.

slide-25
SLIDE 25

CONTEXT: PROGRESS IN LANGUAGE MODELS

Credit: @OriolVinyalsML – Google Deepmind

slide-26
SLIDE 26

THE FUTURE

  • Have Search integrated with every ABC platform/app
  • Searchable transcripts in Listen!
  • Deliver more people more content, larger sample sizes for A/B tests
  • Abstractive summarization
  • Challenging/a game that moves as you play, but potential for unique metadata

& downstream tasks

  • Generating image/video metadata
  • (Safe) opt-in-only personalisation
  • Enabling experiments with things like DRN: A Deep Reinforcement Learning

Framework for News Recommendation – Microsoft Research/Penn State University (2018)

slide-27
SLIDE 27

SEND US FEEDBACK!